1. Overview #
Because the kernel is the operating system’s core, holding ultimate control over every part of the system, attacks that target it remain persistent. Kernel vulnerabilities that enable such attacks can arise in many components, including device drivers and various subsystems. Device drivers, in particular, often run in kernel mode, making them prime targets for exploitation. This report examines vulnerability types stemming from device drivers and reviews the protection mechanisms that major desktop operating systems (Windows and Linux) employ to prevent them.
2. Kernel Types #
Kernels come in several varieties—monolithic kernels, microkernels, hybrid kernels, and exokernels. Among these, monolithic, micro, and hybrid kernels are the most widely used.
2.1 Monolithic Kernel #
A monolithic kernel—sometimes called a “single” kernel—executes all core operating-system functions within the kernel itself. Because essential services such as the file system, networking, device drivers, and process management all run in kernel space, monolithic kernels are typically faster and more efficient than other designs. Device drivers also run in kernel mode for direct hardware access and quick interaction with core OS facilities. The downside of this tightly integrated structure is that a fault in one component can impact the entire system, and the high interdependency among components can make debugging difficult.
2.2 Microkernel #
In contrast to the “all-in-one” approach of a monolithic kernel, a microkernel provides only the minimal mechanisms necessary for an OS. Most services—including device drivers—run in user mode, which limits unrestricted access to low-level system resources. This isolation reduces the likelihood that a single fault will cascade across the whole system. However, frequent context switches and message passing—for example, between file services—can hurt performance compared with a monolithic design.
2.3 Hybrid Kernel #
A hybrid kernel combines concepts from monolithic and microkernels. Apart from adding extra code in kernel space for performance, it resembles a pure microkernel and is sometimes called a “modified microkernel.” For device drivers, most run in kernel mode for speed, while some may run in user mode when stability and security warrant it.
2.4 Kernel Architectures Used by Major Desktop OSes #
In the global desktop OS market, Windows holds the largest share, followed by macOS and Linux. On the server side, Linux leads, with Windows close behind. Among these, Windows (Windows NT)1 employs a hybrid kernel, while Linux uses a monolithic kernel. As noted above, both architectures execute device drivers in kernel mode, which raises the likelihood of kernel-level vulnerabilities.
3. Purpose of Kernel Exploitation #
The main objective of a kernel exploit is privilege escalation. Once an attacker gains higher-level privileges, they can arbitrarily change system settings, steal or manipulate sensitive data, and ultimately take full control of the machine. Windows and Linux have fundamentally different system architectures, including their privilege-management models. Before examining common vulnerability types and defense techniques, we briefly outline each OS’s privilege-management structure and the flow of a privilege-escalation attack.
3.1 Privilege Management in Windows #
In Windows, access-control information is stored in an Access Token, which can be reached from the EPROCESS (Executive Process) structure residing in kernel memory.
3.1.1 EPROCESS Structure
#
When a process first starts, Windows creates this structure, so every process has an EPROCESS.
The structure holds everything the OS needs to manage the process, including its Access Token.
dt nt!_eprocess
Running the command above in the Windows debugger, WinDbg, lists the members of the EPROCESS structure:
0: kd> dt nt!_eprocess
+0x000 Pcb : _KPROCESS
...
+0x4b8 Token : _EX_FAST_REF
...
As shown, the structure contains a Token member.
3.1.2 Token Overwrite #
As explained in 3.1.1 EPROCESS Structure, every process—including system processes2—has an EPROCESS structure. Because Windows identifies privileges through an Access Token, overwriting a user process’s token with that of a system process enables privilege escalation.
Let’s see how to find the Access Token for the system process lsass.exe3 using WinDbg:
0: kd> !process 0 0 lsass.exe
PROCESS ffffad08231af080
SessionId: 0 Cid: 02c0 Peb: 77b03a5000 ParentCid: 0214
DirBase: 101901000 ObjectTable: ffffd3098618b800 HandleCount: 1287.
Image: lsass.exe
First, run !process 0 0 to obtain basic information. The address shown after PROCESS (ffffad08231af080) is the starting address of lsass.exe’s EPROCESS structure.
0: kd> !process ffffad08231af080 7
PROCESS ffffad08231af080
SessionId: 0 Cid: 02c0 Peb: 77b03a5000 ParentCid: 0214
DirBase: 101901000 ObjectTable: ffffd3098618b800 HandleCount: 1287.
Image: lsass.exe
VadRoot ffffad0824b55150 Vads 152 Clone 0 Private 1351. Modified 394. Locked 3.
DeviceMap ffffd3098245d1e0
Token ffffd30983f4b220
ElapsedTime 00:09:02.496
UserTime 00:00:00.078
KernelTime 00:00:00.046
...
Passing that address to !process with the 7 flag yields more detail, including the Token member’s address.
0: kd> !token ffffd30983f4b220
_TOKEN 0xffffd30983f4b220
TS Session ID: 0
User: S-1-5-18
User Groups:
00 S-1-5-32-544
Attributes - Default Enabled Owner
01 S-1-1-0
Attributes - Mandatory Default Enabled
02 S-1-5-11
Attributes - Mandatory Default Enabled
03 S-1-16-16384
Attributes - GroupIntegrity GroupIntegrityEnabled
Primary Group: S-1-5-18
Privs:
02 0x000000002 SeCreateTokenPrivilege Attributes - Enabled
03 0x000000003 SeAssignPrimaryTokenPrivilege Attributes -
04 0x000000004 SeLockMemoryPrivilege Attributes - Enabled Default
05 0x000000005 SeIncreaseQuotaPrivilege Attributes -
07 0x000000007 SeTcbPrivilege Attributes - Enabled Default
08 0x000000008 SeSecurityPrivilege Attributes -
09 0x000000009 SeTakeOwnershipPrivilege Attributes -
...
Running !token with that address displays detailed privilege information for the Access Token. If you write this token’s value into a user process’s Token member, the user process will inherit lsass.exe’s privileges—achieving privilege escalation. (In an actual exploit, this procedure is automated in code.)
After overwriting the user process’s Access Token with one from a system process, the process effectively runs as NT AUTHORITY\SYSTEM4, the highest privilege level on Windows, able to access almost all system resources.
3.2 Privilege Management in Linux #
In Linux, access-control information is stored in the cred structure, which can be reached from the task_struct5 residing in kernel memory.
3.2.1 task_struct and cred Structures
#
struct task_struct {
...
/* Process credentials: */
/* Tracer's credentials at attach: */
const struct cred __rcu *ptracer_cred;
/* Objective and real subjective task credentials (COW): */
const struct cred __rcu *real_cred;
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
...
}
A task_struct is created when a process first starts, and a new one is also created whenever a new thread is spawned.
The cred structure is likewise created when the process starts, but—unlike task_struct—it is not duplicated for new threads; instead, the same cred is shared by all threads in the process.
struct cred {
atomic_long_t usage;
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
unsigned securebits; /* SUID-less security management */
...
} __randomize_layout;
As it holds the credential data, members such as uid and euid appear here.
3.2.2 prepare_kernel_cred(), commit_creds(), and the init_cred Structure
#
The kernel uses two helper functions to manipulate process privileges: prepare_kernel_cred() and commit_creds().
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
...
if (daemon)
old = get_task_cred(daemon);
else
old = get_cred(&init_cred);
...
}
prepare_kernel_cred() creates a new cred structure with the desired identity.
If it is called with a NULL argument, it invokes get_cred(&init_cred), producing credentials defined in init_cred:
struct cred init_cred = {
.usage = ATOMIC_INIT(4),
.uid = GLOBAL_ROOT_UID,
.gid = GLOBAL_ROOT_GID,
.suid = GLOBAL_ROOT_UID,
.sgid = GLOBAL_ROOT_GID,
.euid = GLOBAL_ROOT_UID,
.egid = GLOBAL_ROOT_GID,
.fsuid = GLOBAL_ROOT_UID,
.fsgid = GLOBAL_ROOT_GID,
...
};
The privileges encoded in init_cred correspond to the root user, i.e., the highest privilege level.
Thus a call such as prepare_kernel_cred(NULL) yields a cred structure with root privileges. Creating the structure alone does not immediately grant root, however; it merely prepares the credentials.
int commit_creds(struct cred *new)
{
struct task_struct *task = current;
const struct cred *old = task->real_cred;
...
if (new->user != old->user || new->user_ns != old->user_ns)
inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
...
}
commit_creds() is the routine that actually applies new privileges.
The critical lines assign the supplied cred—new—to the current process’s real_cred and cred, thereby changing its rights.
commit_creds(prepare_kernel_cred(NULL));
In practice, chaining the two calls as above elevates the current process to root—but only on kernels earlier than 6.2.
Beginning with Linux 6.2, prepare_kernel_cred() no longer calls get_cred(&init_cred) when passed NULL.
commit_creds(&init_cred)
Nevertheless, init_cred still exists, so passing it directly to commit_creds() can still achieve privilege escalation.
4. Vulnerability Types #
4.1 Buffer Overflow #
A buffer overflow is a flaw in which input data exceed a memory boundary and overwrite adjacent data—a problem that can be just as catastrophic in kernel space as it is in user space. Below, we examine the functions and macros that can cause buffer-overflow vulnerabilities.
4.1.1 Windows #
void RtlCopyMemory(
void* Destination,
const void* Source,
size_t Length
);
RtlCopyMemory() is a macro that copies the contents of a source memory block to a destination memory block. (It can copy data from user memory into kernel memory and vice versa.)
// wdm.h
#define RtlCopyMemory(Destination,Source,Length) memcpy((Destination),(Source),(Length))
As the declaration shows, RtlCopyMemory() is merely a wrapper around memcpy()6. Because memcpy() performs no bounds checking—copying data blindly—it can introduce buffer-overflow vulnerabilities.
RtlCopyMemory(KernelBuffer, UserBuffer, Size);
The snippet above is vulnerable: the third argument, Size, is supplied by the user, and no size check is performed.
RtlCopyMemory(KernelBuffer, UserBuffer, sizeof(KernelBuffer));
Passing the destination buffer’s actual size as the third argument allows the copy to proceed safely.
4.1.2 Linux #
unsigned long _copy_from_user(
void *to,
const void __user *from,
unsigned long n
);
_copy_from_user()—as its name suggests—copies data from user space into kernel space. Like RtlCopyMemory() in 4.1.1 Windows, it performs no bounds checking on its third parameter, so a buffer overflow can occur.
static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n)
{
if (check_copy_size(to, n, false))
n = _copy_from_user(to, from, n);
return n;
}
To mitigate this risk, Linux provides copy_from_user(), a wrapper around _copy_from_user(). As the source shows, it calls check_copy_size() before performing the copy, validating the destination buffer and the requested length to prevent buffer-overflow vulnerabilities.
4.2 Use After Free #
A use-after-free (UaF) vulnerability arises when a program continues to access memory that has already been freed.
4.2.1 Windows #
typedef struct _USE_AFTER_FREE_NON_PAGED_POOL
{
FunctionPointer Callback;
CHAR Buffer[0x54];
} USE_AFTER_FREE_NON_PAGED_POOL, *PUSE_AFTER_FREE_NON_PAGED_POOL;
typedef struct _FAKE_OBJECT_NON_PAGED_POOL
{
CHAR Buffer[0x54 + sizeof(PVOID)];
} FAKE_OBJECT_NON_PAGED_POOL, *PFAKE_OBJECT_NON_PAGED_POOL;
PUSE_AFTER_FREE_NON_PAGED_POOL g_UseAfterFreeObjectNonPagedPool = NULL;
NTSTATUS AllocateUaFObjectNonPagedPool(VOID)
{
UseAfterFree = ExAllocatePoolWithTag(
NonPagedPool,
sizeof(USE_AFTER_FREE_NON_PAGED_POOL),
(ULONG)POOL_TAG
);
UseAfterFree->Callback = &UaFObjectCallbackNonPagedPool;
g_UseAfterFreeObjectNonPagedPool = UseAfterFree;
...
}
NTSTATUS FreeUaFObjectNonPagedPool(VOID)
{
if (g_UseAfterFreeObjectNonPagedPool)
{
ExFreePoolWithTag((PVOID)g_UseAfterFreeObjectNonPagedPool, (ULONG)POOL_TAG);
}
...
}
NTSTATUS UseUaFObjectNonPagedPool(VOID)
{
if (g_UseAfterFreeObjectNonPagedPool->Callback)
{
g_UseAfterFreeObjectNonPagedPool->Callback();
}
...
}
NTSTATUS AllocateFakeObjectNonPagedPool(PFAKE_OBJECT_NON_PAGED_POOL UserFakeObject)
{
KernelFakeObject = (PFAKE_OBJECT_NON_PAGED_POOL)ExAllocatePoolWithTag(
NonPagedPool,
sizeof(FAKE_OBJECT_NON_PAGED_POOL),
(ULONG)POOL_TAG
);
RtlCopyMemory(
(PVOID)KernelFakeObject,
(PVOID)UserFakeObject,
sizeof(FAKE_OBJECT_NON_PAGED_POOL)
);
...
}
The code above illustrates a UaF scenario. AllocateUaFObjectNonPagedPool() allocates pool memory with ExAllocatePoolWithTag(); FreeUaFObjectNonPagedPool() frees it with ExFreePoolWithTag(); and UseUaFObjectNonPagedPool() dereferences the global pointer g_UseAfterFreeObjectNonPagedPool to invoke its Callback.
Even after the memory is freed, the global pointer still references the same address. Consequently, calling UseUaFObjectNonPagedPool() after the free still executes g_UseAfterFreeObjectNonPagedPool->Callback(). If an attacker overwrites this member with shell-code address, they can escalate privileges.
AllocateFakeObjectNonPagedPool() makes that overwrite possible: if its call to ExAllocatePoolWithTag() happens to reuse the address just freed, RtlCopyMemory() can fill the structure with attacker-controlled data. The overall attack flow is:
ALLOCATE_UAF_OBJECTFREE_UAF_OBJECTALLOCATE_FAKE_OBJECTUSE_UAF_OBJECT
The critical requirement is that the reallocation must occur at the same address. Windows’s allocator will reuse a freed block if the requested size matches, but blocks may coalesce on free, lowering the chance. Attackers therefore rely on heap-spray techniques to create predictable layouts; the IoCompletionReserve7 structure is often used because of its regular size and pattern.
85407000 size: 60 previous size : 0 (Allocated)IoCo(Protected)
85407060 size : 60 previous size : 60 (Free)IoCo
85407100 size : 60 previous size : 60 (Allocated)IoCo(Protected)
...
By mass-allocating these objects and freeing every other one, attackers obtain the alternating pattern above. A subsequent allocation fills one of the free slots, and freeing it again leaves an isolated hole that will not merge with neighbors. Repeating AllocateFakeObjectNonPagedPool() eventually places the fake object at a chosen address, enabling privilege escalation.
ExFreePoolWithTag((PVOID)g_UseAfterFreeObjectNonPagedPool, (ULONG)POOL_TAG);
g_UseAfterFreeObjectNonPagedPool = NULL;
The root cause is that the pointer still references memory after it is freed. Setting the pointer to NULL immediately after freeing prevents UaF by ensuring no further access is possible.
4.2.2 Linux #
Linux follows the same overall attack flow. Here, we highlight UaF specifics stemming from Linux characteristics.
g_buf = kzalloc(BUFFER_SIZE, GFP_KERNEL);
static int module_close(struct inode *inode, struct file *file)
{
printk(KERN_INFO "module_close called\n");
kfree(g_buf);
return 0;
}
A device driver is accessed via a file descriptor; when that descriptor is closed, module_close() is invoked. In this example, it frees memory with kfree().
int fd1 = open("/dev/driver", O_RDWR);
int fd2 = open("/dev/driver", O_RDWR);
close(fd1);
write(fd2, "Hello", 5);
Multiple file descriptors can remain open simultaneously. Thus, even after one descriptor closes and frees g_buf, another descriptor can still access the global variable and manipulate the freed memory—creating a UaF. As in Windows, attackers use heap-spray tactics; on Linux, /dev/ptmx8 is popular because its allocations show predictable size and layout.
static int module_close(struct inode *inode, struct file *file)
{
printk(KERN_INFO "module_close called\n");
kfree(g_buf);
g_buf = NULL;
return 0;
}
As in Windows, nullifying the pointer after freeing the memory effectively prevents use-after-free vulnerabilities.
4.3 NULL Pointer Dereference #
A NULL pointer dereference error occurs when a program dereferences a pointer that points to NULL. Today this error cannot be exploited directly, but in the past it was feasible because the kernel could freely access user-space memory and user programs could map the zero page9. Below, we review the APIs that were historically used to map the zero page on each OS.
4.3.1 Windows #
On legacy versions of Windows, the documented memory-mapping APIs—VirtualAlloc() and VirtualAllocEx()—could not allocate memory below 0x1000. However, the undocumented function NtAllocateVirtualMemory() imposed no such limit. Attackers would call NtAllocateVirtualMemory() to map the zero page, copy shell-code into it, and then trigger a kernel-mode dereference of a NULL pointer.
NTSYSCALLAPI NTSTATUS NtAllocateVirtualMemory(
HANDLE ProcessHandle,
PVOID *BaseAddress,
ULONG_PTR ZeroBits,
SIZE_T RegionSize,
ULONG AllocationType,
ULONG Protect
);
NtAllocateVirtualMemory()—now fully documented—allocates memory in a user process’s virtual address space from kernel mode. The MSDN page explicitly states that the BaseAddress parameter must not be NULL, preventing zero-page mappings.
4.3.2 Linux #
Linux once allowed zero-page mappings via mmap():
static inline unsigned long round_hint_to_min(unsigned long hint)
{
hint &= PAGE_MASK;
if (((void *)hint != NULL) &&
(hint < mmap_min_addr))
return PAGE_ALIGN(mmap_min_addr);
return hint;
}
Today, every mmap() call passes through round_hint_to_min()10, which compares the requested address with the global variable mmap_min_addr and blocks mappings below that threshold.
$ sysctl vm.mmap_min_addr
vm.mmap_min_addr = 65536
mmap_min_addr sets the minimum user-space mapping address; its value can be queried with sysctl11. On most systems the default is 65,536 bytes (0x10000), meaning mappings are allowed only above 64 KB.
As shown above, modern memory-mapping APIs disallow zero-page mappings by default, and even if they did not, SMEP (Supervisor Mode Execution Prevention) and SMAP (Supervisor Mode Access Prevention) render NULL-pointer dereference an impractical direct exploit vector. Nevertheless, indirect risks remain: for example, a NULL dereference in the Linux kernel triggers a kernel Oops, which can be leveraged for a denial-of-service attack. Design flaws can also reintroduce vulnerabilities, so it is still essential to verify that a pointer is not NULL before use.
4.4 Double Fetch #
Double Fetch is a race-condition vulnerability—specifically a TOCTOU (Time of Check to Time of Use) bug—that arises between kernel mode and user mode.
Ordinary race conditions typically occur when two pieces of code in the same space (e.g., kernel-to-kernel or user-to-user) run concurrently. By contrast, a double-fetch issue appears during data exchanges across spaces (kernel ↔ user). Let’s see how this happens.
A double-fetch bug occurs when the kernel fetches the same user-space data more than once. In the diagram, the kernel function reads user data once for validation and again for actual use. If a user thread modifies that data between the two reads, the second access may see a different value—potentially triggering buffer overflows, out-of-bounds accesses, and other flaws.
UserBuffer = UserDoubleFetch->Buffer;
ProbeForRead(UserBuffer, sizeof(KernelBuffer), (ULONG)__alignof(UCHAR));
if (UserDoubleFetch->Size > sizeof(KernelBuffer))
{
Status = STATUS_INVALID_PARAMETER;
return Status;
}
RtlCopyMemory((PVOID)KernelBuffer, UserBuffer, UserDoubleFetch->Size);
The snippet above looks safe: before calling RtlCopyMemory(), the driver compares UserDoubleFetch->Size with the kernel buffer. Yet because UserDoubleFetch->Size is fetched twice (once for the check, once for the copy), an attacker can race two user threads—one that keeps mutating UserDoubleFetch->Size, and another that repeatedly invokes the driver. If the size is enlarged between the check and the copy, a buffer overflow results. 12
UserBuffer = UserDoubleFetch->Buffer;
UserBufferSize = UserDoubleFetch->Size; /* fetch once */
if (UserBufferSize > sizeof(KernelBuffer))
{
Status = STATUS_INVALID_PARAMETER;
return Status;
}
RtlCopyMemory((PVOID)KernelBuffer, UserBuffer, UserBufferSize);
The root cause is dereferencing user memory more than once. A simple fix is to dereference exactly once, copy the value into a kernel-space local variable, and use that copy thereafter; user space can no longer tamper with it, preserving integrity.
5. Mitigation Techniques #
Because Windows and Linux share comparable kernel architectures, many hardening measures overlap. (This report focuses on kernel-specific defenses; generic mechanisms such as stack canaries and ASLR are not covered.)
5.1 SMEP (Supervisor Mode Execution Prevention) #
SMEP prevents kernel mode from executing code that resides in user-space memory. If SMEP is enabled and the kernel jumps to shell-code stored in user space, a kernel panic 13 ensues. Conceptually, SMEP is similar to NX/DEP but applies specifically to supervisor-mode execution. Although such restrictions can be emulated in software, we focus on the hardware implementation.
5.1.1 CR4 Register #
SMEP—and SMAP discussed next—are hardware features on x86/x86-64 CPUs 1415. In addition to general-purpose registers (RIP, RAX, …), these CPUs expose control registers for system configuration.
SMEP is toggled by bit 20 of CR4:
- 1 → enabled | 0 → disabled
5.1.2 Checking Whether SMEP Is Enabled #
As shown in 5.1.3 SMEP Bypass, kernel code can read/write control registers. For a quick check, however, you can use a debugger or system files.
5.1.2.1 Windows #
0: kd> r cr4
cr4=0000000000350ef8
0: kd> ? ((@cr4 >> 20) & 1)
Evaluate expression: 0 = 00000000`00000000
Reading CR4 in WinDbg and masking bit 20 reveals the SMEP state.
5.1.2.2 Linux #
$ cat /proc/cpuinfo | grep smep
flags : ... smep bmi2 rdseed adx smap clflushopt ...
If SMEP is active, the smep flag appears in /proc/cpuinfo.
5.1.3 SMEP Bypass #
Both Windows and Linux let kernel-mode code modify control registers, enabling drivers or modules to tweak CPU behavior.
mov rax, 0xFFFEFFFFF ; clear bit 20
mov cr4, rax
ret
The gadget above clears bit 20 of CR4, disabling SMEP. By chaining such gadgets in a ROP/JOP sequence, an attacker can bypass SMEP and jump to user-space shell-code.
5.2 SMAP (Supervisor Mode Access Prevention) #
SMAP is a protection mechanism designed to complement SMEP. Whereas SMEP only forbids execution of user-space code from kernel space, SMAP blocks the read and write access itself. Enabling SMAP therefore provides several security benefits—chief among them, preventing stack-pivoting attacks that rely on user memory.
mov esp, 0x12345678
ret
Even with SMEP enabled, attackers can often find a kernel-space gadget like the one above (or its x64 equivalent that writes to rsp). If they can control RIP, they redirect the stack pointer to a user-mapped region and build an ROP/JOP chain there. Although the shell-code remains unexecutable, the gadgets execute in kernel space—bypassing SMEP. When SMAP is enabled, however, any attempt by kernel code to read that user-space ROP chain triggers a kernel panic, significantly raising the bar for exploitation.
How, then, do legitimate drivers—which routinely copy data to and from user space—operate while SMAP is on? They temporarily lift the restriction via the stac / clac instructions (“set AC” / “clear AC”):
- The AC (Alignment Check) flag in the EFLAGS register also governs SMAP access.
- When
stacsets AC, the CPU allows kernel code to access user memory even while SMAP is active. clacclears AC again, re-enabling SMAP’s block. (Note that this does not flip the SMAP bit in CR4.)
static __always_inline __must_check unsigned long
copy_user_generic(void *to, const void *from, unsigned long len)
{
stac();
/*
* If the CPU supports FSRM, use 'rep movs';
* otherwise, fall back to rep_movs_alternative.
*/
asm volatile(
"1:\n\t"
ALTERNATIVE("rep movsb",
"call rep_movs_alternative", ALT_NOT(X86_FEATURE_FSRM))
"2:\n"
_ASM_EXTABLE_UA(1b, 2b)
:"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
: : "memory", "rax");
clac();
return len;
}
As noted in 4.1.2 Buffer Overflow (Linux), Linux’s copy_from_user() ultimately calls copy_user_generic(), which wraps the user-memory copy between stac() and clac(). Because those helpers already exist in the kernel, attackers sometimes build ROP/JOP chains that invoke stac to bypass SMAP—a reminder that most instructions needed for exploitation also appear in legitimate kernel code. (Like SMEP, SMAP is toggled via the CR4 register; checking status or flipping the bit uses the same techniques described in 5.1.1 CR4 Register.)
5.3 KASLR (Kernel Address Space Layout Randomization) #
User space benefits from ASLR (Address Space Layout Randomization); kernel space has an analogous defense: KASLR. Each OS randomizes slightly different regions:
Windows
- Kernel image (
ntoskrnl.exe) - Kernel-mode drivers
- Kernel heap
Linux
- Kernel image
- Kernel-mode drivers (kernel modules)
- Kernel stack
- Kernel heap
Because KASLR is applied only once at boot, leaking any symbol inside the kernel reveals the base address. (To mitigate this, researchers proposed FGKASLR—Function-Granular KASLR—but it has not yet landed in the upstream Linux kernel.) In addition, all drivers share the same kernel address space; even if one driver is well-hardened, another might leak an address and defeat KASLR system-wide.
KASLR is therefore weaker than user-mode ASLR—especially on Linux, which offers at most 9 bits of entropy for x64 kernels, making brute-force attacks feasible. Windows fares better with roughly 18 bits of kernel entropy but is still vulnerable once an address leaks.
5.4 KPTI (Kernel Page-Table Isolation) #
KPTI16 was introduced to mitigate Meltdown, a vulnerability in Intel CPUs that lets user-mode code read kernel-space memory, enabling kernel-memory leaks, KASLR bypass, and more. (A full explanation of Meltdown is outside this post’s scope.)
Without KPTI, a user-mode process’s page table normally maps both kernel-space and user-space addresses to avoid the overhead of TLB flushing17. Because system calls and context switches frequently jump into kernel mode, keeping those entries yields TLB hits and better performance, and OS-controlled page protections keep things safe in theory.
Meltdown affected Intel CPUs because, for performance, they use out-of-order execution18 that fetches data into the cache before checking access rights, unlike most CPUs, which fetch only after validation. That difference let user code infer cached kernel data.
KPTI counters this by splitting page tables in two—one for user mode, one for kernel mode. In the user-mode table, all kernel mappings are removed except the bare minimum (e.g., interrupt stubs). Mode switches must now swap page tables, forcing a TLB flush and incurring a performance hit.
The CR3 control register holds the pointer to the active page table, so switching tables also means writing CR3—something ROP chains often need when returning from kernel to user space.
...
0xffffffff81800e7f: or rdi,0x1000
0xffffffff81800e86: mov cr3,rdi
0xffffffff81800e89: pop rax
0xffffffff81800e8a: pop rdi
0xffffffff81800e8b: swapgs
0xffffffff81800e8e: jmp 0xffffffff81800eb0
...
These instructions—part of Linux’s swapgs_restore_regs_and_return_to_usermode—show the page-table switch embedded in the normal kernel-to-user return path, providing useful gadgets for exploits.
Because not every CPU is Meltdown-vulnerable, KPTI is enabled only on affected chips. While it does not directly deter most traditional exploits, enabling it is advisable whenever possible.
5.5 KADR (Kernel Address Display Restriction) #
KADR (Kernel Address Display Restriction) blocks leakage of kernel symbols and addresses so attackers cannot easily map the kernel’s internal layout. When KADR is on, non-privileged users cannot read key files and paths such as /boot/vmlinuz*, /boot/System.map*, /sys/kernel/debug/, /proc/slabinfo, or /proc/kallsyms.
$ cat /proc/kallsyms | grep prepare_kernel_cred
0000000000000000 T prepare_kernel_cred
0000000000000000 t prepare_kernel_cred.cold
/proc/kallsyms is a virtual file19 listing every kernel symbol. With KADR enabled, addresses are shown as 0.
$ sysctl -w kernel.kptr_restrict=0
$ sysctl -w kernel.perf_event_paranoid=0
Setting kptr_restrict20 and perf_event_paranoid21 to 0 via sysctl22 disables KADR.
$ cat /proc/kallsyms | grep prepare_kernel_cred
ffffffffab12ae90 T prepare_kernel_cred
ffffffffabec1e78 t prepare_kernel_cred.cold
Once disabled, /proc/kallsyms again reveals full symbol names and addresses, eliminating the need for separate info leaks—hence attackers care greatly whether KADR is active.
5.6 Driver Signature Enforcement #
Driver Signature Enforcement verifies the integrity of a device driver and confirms the identity of its publisher. Beginning with Windows Vista, the feature is enabled by default on all 64-bit editions of Windows. Each time a new driver is installed or loaded, Windows checks the following requirements; if any one fails, installation and execution are blocked:
- The driver is signed with a valid code-signing certificate.
- The driver has been validated and signed by the Windows Hardware Developer Center.
- The driver is signed by Microsoft.
With Driver Signature Enforcement enabled, untrusted drivers cannot be loaded, and integrity checks thwart man-in-the-middle tampering. The feature is not perfect, however. The Hardware Developer Center requirement applies only to Windows 10 version 1607 and later, and Microsoft allowed three transitional exceptions:
- Drivers already distributed on earlier Windows versions that are later upgraded to Windows 10.
- Drivers distributed while Secure Boot was disabled in the BIOS.
- Drivers signed with a user certificate that was valid before 29 July 2015, issued by a certificate authority trusted by Windows.
The third exception can be abused. An attacker can sign a malicious driver with a certificate from a trusted CA and back-date the timestamp to appear as though it was signed before 29 July 2015, thereby bypassing Driver Signature Enforcement.
A real-world example involves the LAPSUS$ hacking group, which stole NVIDIA’s code-signing certificates. Malware signed with those certificates appeared within a single day. Because the stolen certificates expired before 29 July 2015, the attackers did not even need to tamper with timestamps. This incident underscores how troublesome Driver Signature Enforcement is for attackers—and why users should keep it enabled and routinely check for revoked or compromised certificates.
6. Conclusion #
Vulnerabilities in device drivers—components deeply intertwined with the kernel—can have catastrophic consequences on any major OS. Combining multiple protection techniques is therefore essential to minimize bugs and block potential attacks. Continuous research and improvement of driver and kernel defenses will remain vital to maintaining system stability and trustworthiness.
References #
- KernJC: Automated Vulnerable Environment Generation for Linux Kernel Vulnerabilities
- Desktop Operating System Market Share Worldwide | Statcounter Global Stats
- fortunebusinessinsights.com/ko/server-operating-system-market-106601
- Kernel (operating system) - Wikipedia
- Operating Systems for Dummies
- Difference between microkernel and monolithic kernel – IT Release
- Kernel in Operating System - GeeksforGeeks
- Hybrid kernel - Wikipedia
- Hybrid kernel | Microsoft Wiki | Fandom
- Why Is Linux a Monolithic Kernel? | Baeldung on Linux
- Access Tokens - Win32 apps | Microsoft Learn
- EPROCESS structure in Windows Kernel | by S12 - 0x12Dark Development | Medium
- sched.h source code [linux/include/linux/sched.h] - Codebrowser
- cred.h source code [linux/include/linux/cred.h] - Codebrowser
- cred.c source code [linux/kernel/cred.c] - Codebrowser
- prepare_kernel_cred(), commit_creds() 함수란?
- cred.c source code [linux/kernel/cred.c] - Codebrowser
- HackSysExtremeVulnerableDriver/Driver/HEVD/Windows/BufferOverflowStack.c at master · hacksysteam/HackSysExtremeVulnerableDriver
- RtlCopyMemory macro (wdm.h) - Windows drivers | Microsoft Learn
- RtlCopyMemory() Vs Memcpy() - NTDEV - OSR Developer Community
- usercopy.c source code [linux/lib/usercopy.c] - Codebrowser
- uaccess.h source code [linux/include/linux/uaccess.h] - Codebrowser
- ExFreePoolWithTag function (wdm.h) - Windows drivers | Microsoft Learn
- HEVD Windows Kernel Exploitation 6: Use-After-Free – Binary Exploitation
- Heap Spray - 기본 Heap Spra.. : 네이버블로그
- Holstein v3: Use-after-Freeの悪用 | PAWNYABLE!
- 05.Null pointer dereference(32bit & 64bit) - TechNote - Lazenca.0x0
- Zero page - Wikipedia
- FuzzySecurity | Windows ExploitDev: Part 12
- NtAllocateVirtualMemory function (ntifs.h) - Windows drivers | Microsoft Learn
- sys.c source code [linux/arch/arm64/kernel/sys.c] - Codebrowser
- mmap.c source code [linux/mm/mmap.c] - Codebrowser
- Linux kernel oops - Wikipedia
- sec17-wang.pdf
- HackSysExtremeVulnerableDriver/Driver/HEVD/Windows/DoubleFetch.c at master · hacksysteam/HackSysExtremeVulnerableDriver
- HEVD writeups - yuvaly0’s blog
- Double Fetch | PAWNYABLE!
- 레지스터 (Register)
- Control register - Wikipedia
- Linux kernel protection
- IA-32e 모드 전환 : 네이버 블로그
- PowerPoint Presentation - Windows SMEP bypass U equals S_0.pdf
- 메모리 보호기법 정리
- x64 Architecture - Windows drivers | Microsoft Learn
- VirtualAlloc function (memoryapi.h) - Win32 apps | Microsoft Learn
- mmap(2) - Linux manual page
- Supervisor Mode Access Prevention - Wikipedia
- CLAC — Clear AC Flag in EFLAGS Register
- STAC — Set AC Flag in EFLAGS Register
- uaccess_64.h source code [linux/arch/x86/include/asm/uaccess_64.h] - Codebrowser
- FGKASLR - CTF Wiki
- Kernel page-table isolation - Wikipedia
- [Linux Kernel] KPTI: Kernel Page-Table Isolation
- Paging in Operating System - GeeksforGeeks
- 레지스터 (Register)
- Segmentation과 Paging(3) - 페이징
- Linux Kernel PWN | 01 From Zero to One
- A Guide for Driver Signature Enforcement for Windows 7/10/11
- Driver Signing - Windows drivers | Microsoft Learn
- Stolen Nvidia certificates used to sign malware—here’s what to do - ThreatDown by Malwarebytes
- Hackers exploit Windows driver signature enforcement loophole for malware persistence | CSO Online
-
Windows NT systems prior to the NT line used simpler, more monolithic kernel structures. ↩︎
-
System processes are those launched directly by the OS. They start with the system itself and run with high privileges to maintain stability and management. ↩︎
-
lsass.exe(Local Security Authority Subsystem Service) manages user logins and security policies. ↩︎ -
Most system processes run under the NT AUTHORITY\SYSTEM account. ↩︎
-
Task is the Linux-kernel term equivalent to a process. ↩︎
-
Function wrapping encloses one function within another to extend or modify behavior. Here it standardizes usage in specific environments. ↩︎
-
The
IoCompletionReservestructure, used for asynchronous I/O completion, is predictably allocated and often reused—making it useful for heap spraying. ↩︎ -
/dev/ptmx(pseudo-terminal master multiplexer) manages virtual terminals and mediates user interactions. ↩︎ -
A page is a fixed-size block of memory; the zero page is the page at address 0. ↩︎
-
Internally,
mmap()callsround_hint_to_min()viaksys_mmap_pgoff()anddo_mmap(). ↩︎ -
sysctllets you query or alter kernel parameters at runtime. ↩︎ -
On Windows, threads are created with
CreateThread(); on Linux withpthread_create(). ↩︎ -
A kernel panic is a fatal error after which the OS halts; on Windows this appears as a blue screen of death (BSOD). ↩︎
-
Intel and AMD are the primary manufacturers of x86 / x86-64 CPUs. ↩︎
-
On ARM, features analogous to SMEP/SMAP are called PXN and PAN. ↩︎
-
Formerly called KAISER; Windows implements a similar concept as KVA Shadow. ↩︎
-
The TLB (Translation Lookaside Buffer) caches virtual-to-physical address translations; context switches flush it for correctness. ↩︎
-
Out-of-order execution lets CPUs run instructions non-sequentially for higher throughput. ↩︎
-
Virtual files are generated by the kernel on demand and are not stored on disk. ↩︎
-
kernel.kptr_restrictcontrols the exposure level of kernel pointer addresses. ↩︎ -
kernel.perf_event_paranoidlimits unprivileged access to performance-monitoring data. ↩︎ -
sysctltypically requires root privileges (sudo). ↩︎