Device Driver Vulnerability Types and Kernel Security Techniques

Table of Contents

1. Overview
#

Figure 1. Number of kernel vulnerabilities since 2010

Because the kernel is the operating system’s core, holding ultimate control over every part of the system, attacks that target it remain persistent. Kernel vulnerabilities that enable such attacks can arise in many components, including device drivers and various subsystems. Device drivers, in particular, often run in kernel mode, making them prime targets for exploitation. This report examines vulnerability types stemming from device drivers and reviews the protection mechanisms that major desktop operating systems (Windows and Linux) employ to prevent them.

2. Kernel Types
#

Kernels come in several varieties—monolithic kernels, microkernels, hybrid kernels, and exokernels. Among these, monolithic, micro, and hybrid kernels are the most widely used.

2.1 Monolithic Kernel
#

Figure 2. Architecture of a monolithic kernel

A monolithic kernel—sometimes called a “single” kernel—executes all core operating-system functions within the kernel itself. Because essential services such as the file system, networking, device drivers, and process management all run in kernel space, monolithic kernels are typically faster and more efficient than other designs. Device drivers also run in kernel mode for direct hardware access and quick interaction with core OS facilities. The downside of this tightly integrated structure is that a fault in one component can impact the entire system, and the high interdependency among components can make debugging difficult.

2.2 Microkernel
#

In contrast to the “all-in-one” approach of a monolithic kernel, a microkernel provides only the minimal mechanisms necessary for an OS. Most services—including device drivers—run in user mode, which limits unrestricted access to low-level system resources. This isolation reduces the likelihood that a single fault will cascade across the whole system. However, frequent context switches and message passing—for example, between file services—can hurt performance compared with a monolithic design.

2.3 Hybrid Kernel
#

Figure 4. Architecture of a hybrid kernel

A hybrid kernel combines concepts from monolithic and microkernels. Apart from adding extra code in kernel space for performance, it resembles a pure microkernel and is sometimes called a “modified microkernel.” For device drivers, most run in kernel mode for speed, while some may run in user mode when stability and security warrant it.

2.4 Kernel Architectures Used by Major Desktop OSes
#

Figure 5. Global desktop OS market share

In the global desktop OS market, Windows holds the largest share, followed by macOS and Linux. On the server side, Linux leads, with Windows close behind. Among these, Windows (Windows NT)¹ employs a hybrid kernel, while Linux uses a monolithic kernel. As noted above, both architectures execute device drivers in kernel mode, which raises the likelihood of kernel-level vulnerabilities.

3. Purpose of Kernel Exploitation
#

The main objective of a kernel exploit is privilege escalation. Once an attacker gains higher-level privileges, they can arbitrarily change system settings, steal or manipulate sensitive data, and ultimately take full control of the machine. Windows and Linux have fundamentally different system architectures, including their privilege-management models. Before examining common vulnerability types and defense techniques, we briefly outline each OS’s privilege-management structure and the flow of a privilege-escalation attack.

3.1 Privilege Management in Windows
#

In Windows, access-control information is stored in an Access Token, which can be reached from the EPROCESS (Executive Process) structure residing in kernel memory.

3.1.1 `EPROCESS` Structure
#

When a process first starts, Windows creates this structure, so every process has an EPROCESS. The structure holds everything the OS needs to manage the process, including its Access Token.

dt nt!_eprocess

Running the command above in the Windows debugger, WinDbg, lists the members of the EPROCESS structure:

0: kd> dt nt!_eprocess
   +0x000 Pcb              : _KPROCESS
...
   +0x4b8 Token            : _EX_FAST_REF
...

As shown, the structure contains a Token member.

3.1.2 Token Overwrite
#

As explained in 3.1.1 EPROCESS Structure, every process—including system processes²—has an EPROCESS structure. Because Windows identifies privileges through an Access Token, overwriting a user process’s token with that of a system process enables privilege escalation.

Let’s see how to find the Access Token for the system process lsass.exe³ using WinDbg:

0: kd> !process 0 0 lsass.exe
PROCESS ffffad08231af080
    SessionId: 0  Cid: 02c0    Peb: 77b03a5000  ParentCid: 0214
    DirBase: 101901000  ObjectTable: ffffd3098618b800  HandleCount: 1287.
    Image: lsass.exe

First, run !process 0 0 to obtain basic information. The address shown after PROCESS (ffffad08231af080) is the starting address of lsass.exe’s EPROCESS structure.

0: kd> !process ffffad08231af080 7
PROCESS ffffad08231af080
    SessionId: 0  Cid: 02c0    Peb: 77b03a5000  ParentCid: 0214
    DirBase: 101901000  ObjectTable: ffffd3098618b800  HandleCount: 1287.
    Image: lsass.exe
    VadRoot ffffad0824b55150 Vads 152 Clone 0 Private 1351. Modified 394. Locked 3.
    DeviceMap ffffd3098245d1e0
    Token                             ffffd30983f4b220
    ElapsedTime                       00:09:02.496
    UserTime                          00:00:00.078
    KernelTime                        00:00:00.046
...

Passing that address to !process with the 7 flag yields more detail, including the Token member’s address.

0: kd> !token ffffd30983f4b220
_TOKEN 0xffffd30983f4b220
TS Session ID: 0
User: S-1-5-18
User Groups: 
 00 S-1-5-32-544
    Attributes - Default Enabled Owner 
 01 S-1-1-0
    Attributes - Mandatory Default Enabled 
 02 S-1-5-11
    Attributes - Mandatory Default Enabled 
 03 S-1-16-16384
    Attributes - GroupIntegrity GroupIntegrityEnabled 
Primary Group: S-1-5-18
Privs: 
 02 0x000000002 SeCreateTokenPrivilege            Attributes - Enabled 
 03 0x000000003 SeAssignPrimaryTokenPrivilege     Attributes - 
 04 0x000000004 SeLockMemoryPrivilege             Attributes - Enabled Default 
 05 0x000000005 SeIncreaseQuotaPrivilege          Attributes - 
 07 0x000000007 SeTcbPrivilege                    Attributes - Enabled Default 
 08 0x000000008 SeSecurityPrivilege               Attributes - 
 09 0x000000009 SeTakeOwnershipPrivilege          Attributes - 
...

Running !token with that address displays detailed privilege information for the Access Token. If you write this token’s value into a user process’s Token member, the user process will inherit lsass.exe’s privileges—achieving privilege escalation. (In an actual exploit, this procedure is automated in code.)

After overwriting the user process’s Access Token with one from a system process, the process effectively runs as NT AUTHORITY\SYSTEM⁴, the highest privilege level on Windows, able to access almost all system resources.

3.2 Privilege Management in Linux
#

In Linux, access-control information is stored in the cred structure, which can be reached from the task_struct⁵ residing in kernel memory.

3.2.1 `task_struct` and `cred` Structures
#

struct task_struct {
    ...
	/* Process credentials: */

	/* Tracer's credentials at attach: */
	const struct cred __rcu		*ptracer_cred;

	/* Objective and real subjective task credentials (COW): */
	const struct cred __rcu		*real_cred;

	/* Effective (overridable) subjective task credentials (COW): */
	const struct cred __rcu		*cred;
    ...
}

A task_struct is created when a process first starts, and a new one is also created whenever a new thread is spawned. The cred structure is likewise created when the process starts, but—unlike task_struct—it is not duplicated for new threads; instead, the same cred is shared by all threads in the process.

struct cred {
	atomic_long_t	usage;
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
...
} __randomize_layout;

As it holds the credential data, members such as uid and euid appear here.

3.2.2 `prepare_kernel_cred()`, `commit_creds()`, and the `init_cred` Structure
#

The kernel uses two helper functions to manipulate process privileges: prepare_kernel_cred() and commit_creds().

struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
...
	if (daemon)
		old = get_task_cred(daemon);
	else
		old = get_cred(&init_cred);
...
}

prepare_kernel_cred() creates a new cred structure with the desired identity. If it is called with a NULL argument, it invokes get_cred(&init_cred), producing credentials defined in init_cred:

struct cred init_cred = {
	.usage			= ATOMIC_INIT(4),
	.uid			= GLOBAL_ROOT_UID,
	.gid			= GLOBAL_ROOT_GID,
	.suid			= GLOBAL_ROOT_UID,
	.sgid			= GLOBAL_ROOT_GID,
	.euid			= GLOBAL_ROOT_UID,
	.egid			= GLOBAL_ROOT_GID,
	.fsuid			= GLOBAL_ROOT_UID,
	.fsgid			= GLOBAL_ROOT_GID,
...
};

The privileges encoded in init_cred correspond to the root user, i.e., the highest privilege level. Thus a call such as prepare_kernel_cred(NULL) yields a cred structure with root privileges. Creating the structure alone does not immediately grant root, however; it merely prepares the credentials.

int commit_creds(struct cred *new)
{
	struct task_struct *task = current;
	const struct cred *old = task->real_cred;
...
	if (new->user != old->user || new->user_ns != old->user_ns)
		inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
	rcu_assign_pointer(task->real_cred, new);
	rcu_assign_pointer(task->cred, new);
...
}

commit_creds() is the routine that actually applies new privileges. The critical lines assign the supplied cred—new—to the current process’s real_cred and cred, thereby changing its rights.

commit_creds(prepare_kernel_cred(NULL));

In practice, chaining the two calls as above elevates the current process to root—but only on kernels earlier than 6.2. Beginning with Linux 6.2, prepare_kernel_cred() no longer calls get_cred(&init_cred) when passed NULL.

commit_creds(&init_cred)

Nevertheless, init_cred still exists, so passing it directly to commit_creds() can still achieve privilege escalation.

4. Vulnerability Types
#

4.1 Buffer Overflow
#

A buffer overflow is a flaw in which input data exceed a memory boundary and overwrite adjacent data—a problem that can be just as catastrophic in kernel space as it is in user space. Below, we examine the functions and macros that can cause buffer-overflow vulnerabilities.

4.1.1 Windows
#

void RtlCopyMemory(
   void*       Destination,
   const void* Source,
   size_t      Length
);

RtlCopyMemory() is a macro that copies the contents of a source memory block to a destination memory block. (It can copy data from user memory into kernel memory and vice versa.)

// wdm.h
#define RtlCopyMemory(Destination,Source,Length) memcpy((Destination),(Source),(Length))

As the declaration shows, RtlCopyMemory() is merely a wrapper around memcpy()⁶. Because memcpy() performs no bounds checking—copying data blindly—it can introduce buffer-overflow vulnerabilities.

RtlCopyMemory(KernelBuffer, UserBuffer, Size);

The snippet above is vulnerable: the third argument, Size, is supplied by the user, and no size check is performed.

RtlCopyMemory(KernelBuffer, UserBuffer, sizeof(KernelBuffer));

Passing the destination buffer’s actual size as the third argument allows the copy to proceed safely.

4.1.2 Linux
#

unsigned long _copy_from_user(
	void *to,
	const void __user *from,
	unsigned long n
);

_copy_from_user()—as its name suggests—copies data from user space into kernel space. Like RtlCopyMemory() in 4.1.1 Windows, it performs no bounds checking on its third parameter, so a buffer overflow can occur.

static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n)
{
	if (check_copy_size(to, n, false))
		n = _copy_from_user(to, from, n);
	return n;
}

To mitigate this risk, Linux provides copy_from_user(), a wrapper around _copy_from_user(). As the source shows, it calls check_copy_size() before performing the copy, validating the destination buffer and the requested length to prevent buffer-overflow vulnerabilities.

4.2 Use After Free
#

A use-after-free (UaF) vulnerability arises when a program continues to access memory that has already been freed.

4.2.1 Windows
#

typedef struct _USE_AFTER_FREE_NON_PAGED_POOL
{
    FunctionPointer Callback;
    CHAR Buffer[0x54];
} USE_AFTER_FREE_NON_PAGED_POOL, *PUSE_AFTER_FREE_NON_PAGED_POOL;

typedef struct _FAKE_OBJECT_NON_PAGED_POOL
{
    CHAR Buffer[0x54 + sizeof(PVOID)];
} FAKE_OBJECT_NON_PAGED_POOL, *PFAKE_OBJECT_NON_PAGED_POOL;

PUSE_AFTER_FREE_NON_PAGED_POOL g_UseAfterFreeObjectNonPagedPool = NULL;

NTSTATUS AllocateUaFObjectNonPagedPool(VOID)
{
	UseAfterFree = ExAllocatePoolWithTag(
		NonPagedPool,
		sizeof(USE_AFTER_FREE_NON_PAGED_POOL),
		(ULONG)POOL_TAG
	);
	UseAfterFree->Callback = &UaFObjectCallbackNonPagedPool;
	g_UseAfterFreeObjectNonPagedPool = UseAfterFree;
	...
}

NTSTATUS FreeUaFObjectNonPagedPool(VOID)
{
	if (g_UseAfterFreeObjectNonPagedPool)
	{
		ExFreePoolWithTag((PVOID)g_UseAfterFreeObjectNonPagedPool, (ULONG)POOL_TAG);
	}
	...
}

NTSTATUS UseUaFObjectNonPagedPool(VOID)
{
	if (g_UseAfterFreeObjectNonPagedPool->Callback)
	{
		g_UseAfterFreeObjectNonPagedPool->Callback();
	}
	...
}

NTSTATUS AllocateFakeObjectNonPagedPool(PFAKE_OBJECT_NON_PAGED_POOL UserFakeObject)
{
	KernelFakeObject = (PFAKE_OBJECT_NON_PAGED_POOL)ExAllocatePoolWithTag(
		NonPagedPool,
		sizeof(FAKE_OBJECT_NON_PAGED_POOL),
		(ULONG)POOL_TAG
	);

	RtlCopyMemory(
		(PVOID)KernelFakeObject,
		(PVOID)UserFakeObject,
		sizeof(FAKE_OBJECT_NON_PAGED_POOL)
	);
    ...
}

The code above illustrates a UaF scenario. AllocateUaFObjectNonPagedPool() allocates pool memory with ExAllocatePoolWithTag(); FreeUaFObjectNonPagedPool() frees it with ExFreePoolWithTag(); and UseUaFObjectNonPagedPool() dereferences the global pointer g_UseAfterFreeObjectNonPagedPool to invoke its Callback.

Even after the memory is freed, the global pointer still references the same address. Consequently, calling UseUaFObjectNonPagedPool() after the free still executes g_UseAfterFreeObjectNonPagedPool->Callback(). If an attacker overwrites this member with shell-code address, they can escalate privileges.

AllocateFakeObjectNonPagedPool() makes that overwrite possible: if its call to ExAllocatePoolWithTag() happens to reuse the address just freed, RtlCopyMemory() can fill the structure with attacker-controlled data. The overall attack flow is:

ALLOCATE_UAF_OBJECT
FREE_UAF_OBJECT
ALLOCATE_FAKE_OBJECT
USE_UAF_OBJECT

The critical requirement is that the reallocation must occur at the same address. Windows’s allocator will reuse a freed block if the requested size matches, but blocks may coalesce on free, lowering the chance. Attackers therefore rely on heap-spray techniques to create predictable layouts; the IoCompletionReserve⁷ structure is often used because of its regular size and pattern.

85407000 size:  60 previous size : 0  (Allocated)IoCo(Protected)
85407060 size : 60 previous size : 60  (Free)IoCo
85407100 size : 60 previous size : 60  (Allocated)IoCo(Protected)
...

By mass-allocating these objects and freeing every other one, attackers obtain the alternating pattern above. A subsequent allocation fills one of the free slots, and freeing it again leaves an isolated hole that will not merge with neighbors. Repeating AllocateFakeObjectNonPagedPool() eventually places the fake object at a chosen address, enabling privilege escalation.

ExFreePoolWithTag((PVOID)g_UseAfterFreeObjectNonPagedPool, (ULONG)POOL_TAG);
g_UseAfterFreeObjectNonPagedPool = NULL;

The root cause is that the pointer still references memory after it is freed. Setting the pointer to NULL immediately after freeing prevents UaF by ensuring no further access is possible.

4.2.2 Linux
#

Linux follows the same overall attack flow. Here, we highlight UaF specifics stemming from Linux characteristics.

g_buf = kzalloc(BUFFER_SIZE, GFP_KERNEL);

static int module_close(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "module_close called\n");
  kfree(g_buf);
  return 0;
}

A device driver is accessed via a file descriptor; when that descriptor is closed, module_close() is invoked. In this example, it frees memory with kfree().

int fd1 = open("/dev/driver", O_RDWR);
int fd2 = open("/dev/driver", O_RDWR);
close(fd1);
write(fd2, "Hello", 5);

Multiple file descriptors can remain open simultaneously. Thus, even after one descriptor closes and frees g_buf, another descriptor can still access the global variable and manipulate the freed memory—creating a UaF. As in Windows, attackers use heap-spray tactics; on Linux, /dev/ptmx⁸ is popular because its allocations show predictable size and layout.

static int module_close(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "module_close called\n");
  kfree(g_buf);
  g_buf = NULL;
  return 0;
}

As in Windows, nullifying the pointer after freeing the memory effectively prevents use-after-free vulnerabilities.

4.3 NULL Pointer Dereference
#

A NULL pointer dereference error occurs when a program dereferences a pointer that points to NULL. Today this error cannot be exploited directly, but in the past it was feasible because the kernel could freely access user-space memory and user programs could map the zero page⁹. Below, we review the APIs that were historically used to map the zero page on each OS.

4.3.1 Windows
#

On legacy versions of Windows, the documented memory-mapping APIs—VirtualAlloc() and VirtualAllocEx()—could not allocate memory below 0x1000. However, the undocumented function NtAllocateVirtualMemory() imposed no such limit. Attackers would call NtAllocateVirtualMemory() to map the zero page, copy shell-code into it, and then trigger a kernel-mode dereference of a NULL pointer.

NTSYSCALLAPI NTSTATUS NtAllocateVirtualMemory(
	HANDLE    ProcessHandle,
	PVOID     *BaseAddress,
	ULONG_PTR ZeroBits,
	SIZE_T    RegionSize,
	ULONG     AllocationType,
	ULONG     Protect
);

NtAllocateVirtualMemory()—now fully documented—allocates memory in a user process’s virtual address space from kernel mode. The MSDN page explicitly states that the BaseAddress parameter must not be NULL, preventing zero-page mappings.

4.3.2 Linux
#

Linux once allowed zero-page mappings via mmap():

static inline unsigned long round_hint_to_min(unsigned long hint)
{
	hint &= PAGE_MASK;
	if (((void *)hint != NULL) &&
	    (hint < mmap_min_addr))
		return PAGE_ALIGN(mmap_min_addr);
	return hint;
}

Today, every mmap() call passes through round_hint_to_min()¹⁰, which compares the requested address with the global variable mmap_min_addr and blocks mappings below that threshold.

$ sysctl vm.mmap_min_addr
vm.mmap_min_addr = 65536

mmap_min_addr sets the minimum user-space mapping address; its value can be queried with sysctl¹¹. On most systems the default is 65,536 bytes (0x10000), meaning mappings are allowed only above 64 KB.

As shown above, modern memory-mapping APIs disallow zero-page mappings by default, and even if they did not, SMEP (Supervisor Mode Execution Prevention) and SMAP (Supervisor Mode Access Prevention) render NULL-pointer dereference an impractical direct exploit vector. Nevertheless, indirect risks remain: for example, a NULL dereference in the Linux kernel triggers a kernel Oops, which can be leveraged for a denial-of-service attack. Design flaws can also reintroduce vulnerabilities, so it is still essential to verify that a pointer is not NULL before use.

4.4 Double Fetch
#

Double Fetch is a race-condition vulnerability—specifically a TOCTOU (Time of Check to Time of Use) bug—that arises between kernel mode and user mode.

Ordinary race conditions typically occur when two pieces of code in the same space (e.g., kernel-to-kernel or user-to-user) run concurrently. By contrast, a double-fetch issue appears during data exchanges across spaces (kernel ↔ user). Let’s see how this happens.

Figure 6. Double-fetch vulnerability flow

A double-fetch bug occurs when the kernel fetches the same user-space data more than once. In the diagram, the kernel function reads user data once for validation and again for actual use. If a user thread modifies that data between the two reads, the second access may see a different value—potentially triggering buffer overflows, out-of-bounds accesses, and other flaws.

UserBuffer = UserDoubleFetch->Buffer;
ProbeForRead(UserBuffer, sizeof(KernelBuffer), (ULONG)__alignof(UCHAR));

if (UserDoubleFetch->Size > sizeof(KernelBuffer))
{
	Status = STATUS_INVALID_PARAMETER;
	return Status;
}

RtlCopyMemory((PVOID)KernelBuffer, UserBuffer, UserDoubleFetch->Size);

The snippet above looks safe: before calling RtlCopyMemory(), the driver compares UserDoubleFetch->Size with the kernel buffer. Yet because UserDoubleFetch->Size is fetched twice (once for the check, once for the copy), an attacker can race two user threads—one that keeps mutating UserDoubleFetch->Size, and another that repeatedly invokes the driver. If the size is enlarged between the check and the copy, a buffer overflow results. ¹²

UserBuffer      = UserDoubleFetch->Buffer;
UserBufferSize  = UserDoubleFetch->Size;   /* fetch once */

if (UserBufferSize > sizeof(KernelBuffer))
{
	Status = STATUS_INVALID_PARAMETER;
	return Status;
}

RtlCopyMemory((PVOID)KernelBuffer, UserBuffer, UserBufferSize);

The root cause is dereferencing user memory more than once. A simple fix is to dereference exactly once, copy the value into a kernel-space local variable, and use that copy thereafter; user space can no longer tamper with it, preserving integrity.

5. Mitigation Techniques
#

Because Windows and Linux share comparable kernel architectures, many hardening measures overlap. (This report focuses on kernel-specific defenses; generic mechanisms such as stack canaries and ASLR are not covered.)

5.1 SMEP (Supervisor Mode Execution Prevention)
#

SMEP prevents kernel mode from executing code that resides in user-space memory. If SMEP is enabled and the kernel jumps to shell-code stored in user space, a kernel panic ¹³ ensues. Conceptually, SMEP is similar to NX/DEP but applies specifically to supervisor-mode execution. Although such restrictions can be emulated in software, we focus on the hardware implementation.

5.1.1 CR4 Register
#

SMEP—and SMAP discussed next—are hardware features on x86/x86-64 CPUs ¹⁴¹⁵. In addition to general-purpose registers (RIP, RAX, …), these CPUs expose control registers for system configuration.

SMEP is toggled by bit 20 of CR4:

1 → enabled | 0 → disabled

5.1.2 Checking Whether SMEP Is Enabled
#

As shown in 5.1.3 SMEP Bypass, kernel code can read/write control registers. For a quick check, however, you can use a debugger or system files.

5.1.2.1 Windows
#

0: kd> r cr4
cr4=0000000000350ef8

0: kd> ? ((@cr4 >> 20) & 1)
Evaluate expression: 0 = 00000000`00000000

Reading CR4 in WinDbg and masking bit 20 reveals the SMEP state.

5.1.2.2 Linux
#

$ cat /proc/cpuinfo | grep smep  
flags           : ... smep bmi2 rdseed adx smap clflushopt ...

If SMEP is active, the smep flag appears in /proc/cpuinfo.

5.1.3 SMEP Bypass
#

Both Windows and Linux let kernel-mode code modify control registers, enabling drivers or modules to tweak CPU behavior.

mov rax, 0xFFFEFFFFF   ; clear bit 20
mov cr4, rax
ret

The gadget above clears bit 20 of CR4, disabling SMEP. By chaining such gadgets in a ROP/JOP sequence, an attacker can bypass SMEP and jump to user-space shell-code.

5.2 SMAP (Supervisor Mode Access Prevention)
#

SMAP is a protection mechanism designed to complement SMEP. Whereas SMEP only forbids execution of user-space code from kernel space, SMAP blocks the read and write access itself. Enabling SMAP therefore provides several security benefits—chief among them, preventing stack-pivoting attacks that rely on user memory.

mov esp, 0x12345678
ret

Even with SMEP enabled, attackers can often find a kernel-space gadget like the one above (or its x64 equivalent that writes to rsp). If they can control RIP, they redirect the stack pointer to a user-mapped region and build an ROP/JOP chain there. Although the shell-code remains unexecutable, the gadgets execute in kernel space—bypassing SMEP. When SMAP is enabled, however, any attempt by kernel code to read that user-space ROP chain triggers a kernel panic, significantly raising the bar for exploitation.

How, then, do legitimate drivers—which routinely copy data to and from user space—operate while SMAP is on? They temporarily lift the restriction via the stac / clac instructions (“set AC” / “clear AC”):

The AC (Alignment Check) flag in the EFLAGS register also governs SMAP access.
When stac sets AC, the CPU allows kernel code to access user memory even while SMAP is active.
clac clears AC again, re-enabling SMAP’s block. (Note that this does not flip the SMAP bit in CR4.)

static __always_inline __must_check unsigned long
copy_user_generic(void *to, const void *from, unsigned long len)
{
	stac();
	/*
	 * If the CPU supports FSRM, use 'rep movs';
	 * otherwise, fall back to rep_movs_alternative.
	 */
	asm volatile(
		"1:\n\t"
		ALTERNATIVE("rep movsb",
			    "call rep_movs_alternative", ALT_NOT(X86_FEATURE_FSRM))
		"2:\n"
		_ASM_EXTABLE_UA(1b, 2b)
		:"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
		: : "memory", "rax");
	clac();
	return len;
}

As noted in 4.1.2 Buffer Overflow (Linux), Linux’s copy_from_user() ultimately calls copy_user_generic(), which wraps the user-memory copy between stac() and clac(). Because those helpers already exist in the kernel, attackers sometimes build ROP/JOP chains that invoke stac to bypass SMAP—a reminder that most instructions needed for exploitation also appear in legitimate kernel code. (Like SMEP, SMAP is toggled via the CR4 register; checking status or flipping the bit uses the same techniques described in 5.1.1 CR4 Register.)

5.3 KASLR (Kernel Address Space Layout Randomization)
#

User space benefits from ASLR (Address Space Layout Randomization); kernel space has an analogous defense: KASLR. Each OS randomizes slightly different regions:

Windows

Kernel image (ntoskrnl.exe)
Kernel-mode drivers
Kernel heap

Linux

Kernel image
Kernel-mode drivers (kernel modules)
Kernel stack
Kernel heap

Because KASLR is applied only once at boot, leaking any symbol inside the kernel reveals the base address. (To mitigate this, researchers proposed FGKASLR—Function-Granular KASLR—but it has not yet landed in the upstream Linux kernel.) In addition, all drivers share the same kernel address space; even if one driver is well-hardened, another might leak an address and defeat KASLR system-wide.

KASLR is therefore weaker than user-mode ASLR—especially on Linux, which offers at most 9 bits of entropy for x64 kernels, making brute-force attacks feasible. Windows fares better with roughly 18 bits of kernel entropy but is still vulnerable once an address leaks.

5.4 KPTI (Kernel Page-Table Isolation)
#

KPTI¹⁶ was introduced to mitigate Meltdown, a vulnerability in Intel CPUs that lets user-mode code read kernel-space memory, enabling kernel-memory leaks, KASLR bypass, and more. (A full explanation of Meltdown is outside this post’s scope.)

Figure 8. Page tables before KPTI (left) and after KPTI (right)

Without KPTI, a user-mode process’s page table normally maps both kernel-space and user-space addresses to avoid the overhead of TLB flushing¹⁷. Because system calls and context switches frequently jump into kernel mode, keeping those entries yields TLB hits and better performance, and OS-controlled page protections keep things safe in theory.

Meltdown affected Intel CPUs because, for performance, they use out-of-order execution¹⁸ that fetches data into the cache before checking access rights, unlike most CPUs, which fetch only after validation. That difference let user code infer cached kernel data.

KPTI counters this by splitting page tables in two—one for user mode, one for kernel mode. In the user-mode table, all kernel mappings are removed except the bare minimum (e.g., interrupt stubs). Mode switches must now swap page tables, forcing a TLB flush and incurring a performance hit.

The CR3 control register holds the pointer to the active page table, so switching tables also means writing CR3—something ROP chains often need when returning from kernel to user space.

...
   0xffffffff81800e7f:  or     rdi,0x1000
   0xffffffff81800e86:  mov    cr3,rdi
   0xffffffff81800e89:  pop    rax
   0xffffffff81800e8a:  pop    rdi
   0xffffffff81800e8b:  swapgs
   0xffffffff81800e8e:  jmp    0xffffffff81800eb0
...

These instructions—part of Linux’s swapgs_restore_regs_and_return_to_usermode—show the page-table switch embedded in the normal kernel-to-user return path, providing useful gadgets for exploits.

Because not every CPU is Meltdown-vulnerable, KPTI is enabled only on affected chips. While it does not directly deter most traditional exploits, enabling it is advisable whenever possible.

5.5 KADR (Kernel Address Display Restriction)
#

KADR (Kernel Address Display Restriction) blocks leakage of kernel symbols and addresses so attackers cannot easily map the kernel’s internal layout. When KADR is on, non-privileged users cannot read key files and paths such as /boot/vmlinuz*, /boot/System.map*, /sys/kernel/debug/, /proc/slabinfo, or /proc/kallsyms.

$ cat /proc/kallsyms | grep prepare_kernel_cred   
0000000000000000 T prepare_kernel_cred
0000000000000000 t prepare_kernel_cred.cold

/proc/kallsyms is a virtual file¹⁹ listing every kernel symbol. With KADR enabled, addresses are shown as 0.

$ sysctl -w kernel.kptr_restrict=0
$ sysctl -w kernel.perf_event_paranoid=0

Setting kptr_restrict²⁰ and perf_event_paranoid²¹ to 0 via sysctl²² disables KADR.

$ cat /proc/kallsyms | grep prepare_kernel_cred 
ffffffffab12ae90 T prepare_kernel_cred
ffffffffabec1e78 t prepare_kernel_cred.cold

Once disabled, /proc/kallsyms again reveals full symbol names and addresses, eliminating the need for separate info leaks—hence attackers care greatly whether KADR is active.

5.6 Driver Signature Enforcement
#

Driver Signature Enforcement verifies the integrity of a device driver and confirms the identity of its publisher. Beginning with Windows Vista, the feature is enabled by default on all 64-bit editions of Windows. Each time a new driver is installed or loaded, Windows checks the following requirements; if any one fails, installation and execution are blocked:

The driver is signed with a valid code-signing certificate.
The driver has been validated and signed by the Windows Hardware Developer Center.
The driver is signed by Microsoft.

Figure 10. Mitigating man-in-the-middle attacks with Driver Signature Enforcement

With Driver Signature Enforcement enabled, untrusted drivers cannot be loaded, and integrity checks thwart man-in-the-middle tampering. The feature is not perfect, however. The Hardware Developer Center requirement applies only to Windows 10 version 1607 and later, and Microsoft allowed three transitional exceptions:

Drivers already distributed on earlier Windows versions that are later upgraded to Windows 10.
Drivers distributed while Secure Boot was disabled in the BIOS.
Drivers signed with a user certificate that was valid before 29 July 2015, issued by a certificate authority trusted by Windows.

The third exception can be abused. An attacker can sign a malicious driver with a certificate from a trusted CA and back-date the timestamp to appear as though it was signed before 29 July 2015, thereby bypassing Driver Signature Enforcement.

A real-world example involves the LAPSUS$ hacking group, which stole NVIDIA’s code-signing certificates. Malware signed with those certificates appeared within a single day. Because the stolen certificates expired before 29 July 2015, the attackers did not even need to tamper with timestamps. This incident underscores how troublesome Driver Signature Enforcement is for attackers—and why users should keep it enabled and routinely check for revoked or compromised certificates.

6. Conclusion
#

Vulnerabilities in device drivers—components deeply intertwined with the kernel—can have catastrophic consequences on any major OS. Combining multiple protection techniques is therefore essential to minimize bugs and block potential attacks. Continuous research and improvement of driver and kernel defenses will remain vital to maintaining system stability and trustworthiness.

References
#

Windows NT systems prior to the NT line used simpler, more monolithic kernel structures. ↩︎
System processes are those launched directly by the OS. They start with the system itself and run with high privileges to maintain stability and management. ↩︎
lsass.exe (Local Security Authority Subsystem Service) manages user logins and security policies. ↩︎
Most system processes run under the NT AUTHORITY\SYSTEM account. ↩︎
Task is the Linux-kernel term equivalent to a process. ↩︎
Function wrapping encloses one function within another to extend or modify behavior. Here it standardizes usage in specific environments. ↩︎
The IoCompletionReserve structure, used for asynchronous I/O completion, is predictably allocated and often reused—making it useful for heap spraying. ↩︎
/dev/ptmx (pseudo-terminal master multiplexer) manages virtual terminals and mediates user interactions. ↩︎
A page is a fixed-size block of memory; the zero page is the page at address 0. ↩︎
Internally, mmap() calls round_hint_to_min() via ksys_mmap_pgoff() and do_mmap(). ↩︎
sysctl lets you query or alter kernel parameters at runtime. ↩︎
On Windows, threads are created with CreateThread(); on Linux with pthread_create(). ↩︎
A kernel panic is a fatal error after which the OS halts; on Windows this appears as a blue screen of death (BSOD). ↩︎
Intel and AMD are the primary manufacturers of x86 / x86-64 CPUs. ↩︎
On ARM, features analogous to SMEP/SMAP are called PXN and PAN. ↩︎
Formerly called KAISER; Windows implements a similar concept as KVA Shadow. ↩︎
The TLB (Translation Lookaside Buffer) caches virtual-to-physical address translations; context switches flush it for correctness. ↩︎
Out-of-order execution lets CPUs run instructions non-sequentially for higher throughput. ↩︎
Virtual files are generated by the kernel on demand and are not stored on disk. ↩︎
kernel.kptr_restrict controls the exposure level of kernel pointer addresses. ↩︎
kernel.perf_event_paranoid limits unprivileged access to performance-monitoring data. ↩︎
sysctl typically requires root privileges (sudo). ↩︎

Reply by Email

1. Overview #

2. Kernel Types #

2.1 Monolithic Kernel #

2.2 Microkernel #

2.3 Hybrid Kernel #

2.4 Kernel Architectures Used by Major Desktop OSes #

3. Purpose of Kernel Exploitation #

3.1 Privilege Management in Windows #

3.1.1 EPROCESS Structure #

3.1.2 Token Overwrite #

3.2 Privilege Management in Linux #

3.2.1 task_struct and cred Structures #

3.2.2 prepare_kernel_cred(), commit_creds(), and the init_cred Structure #

4. Vulnerability Types #

4.1 Buffer Overflow #

4.1.1 Windows #

4.1.2 Linux #

4.2 Use After Free #

4.2.1 Windows #

4.2.2 Linux #

4.3 NULL Pointer Dereference #

4.3.1 Windows #

4.3.2 Linux #

4.4 Double Fetch #

5. Mitigation Techniques #

5.1 SMEP (Supervisor Mode Execution Prevention) #

5.1.1 CR4 Register #

5.1.2 Checking Whether SMEP Is Enabled #

5.1.2.1 Windows #

5.1.2.2 Linux #

5.1.3 SMEP Bypass #

5.2 SMAP (Supervisor Mode Access Prevention) #

5.3 KASLR (Kernel Address Space Layout Randomization) #

5.4 KPTI (Kernel Page-Table Isolation) #

5.5 KADR (Kernel Address Display Restriction) #

5.6 Driver Signature Enforcement #

6. Conclusion #

References #