This post walks through exploiting CVE-2014-4943, a type confusion bug in the Linux kernel's L2TP subsystem. I found this bug with my old colleague @vegard and turned his initial denial of service PoC into a local privilege escalation. Many, many exploitation mitigations have been added to the kernel since then, so a lot of the techniques used here no longer work.

The vulnerability

The upstream fix for this vulnerability simply removed the functionality because it never ever worked:

net/l2tp: don't fall back on UDP [get|set]sockopt

The l2tp [get|set]sockopt() code has fallen back to the UDP functions for socket option levels != SOL_PPPOL2TP since day one, but that has never actually worked, since the l2tp socket isn't an inet socket.

As David Miller points out:

"If we wanted this to work, it'd have to look up the tunnel and then use tunnel->sk, but I wonder how useful that would be"

Since this can never have worked so nobody could possibly have depended on that functionality, just remove the broken code and return -EINVAL.

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 950909f04ee6a..13752d96275e8 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -1365,7 +1365,7 @@ static int pppol2tp_setsockopt(struct socket *sock, int level, int optname,
         int err;
 
         if (level != SOL_PPPOL2TP)
-                return udp_prot.setsockopt(sk, level, optname, optval, optlen);
+                return -EINVAL;
 
         if (optlen < sizeof(int))
                 return -EINVAL;
@@ -1491,7 +1491,7 @@ static int pppol2tp_getsockopt(struct socket *sock, int level, int optname,
         struct pppol2tp_session *ps;
 
         if (level != SOL_PPPOL2TP)
-                return udp_prot.getsockopt(sk, level, optname, optval, optlen);
+                return -EINVAL;
 
         if (get_user(len, optlen))
                 return -EFAULT;

The commit description and patch are fairly opaque but contains a hint: L2TP sockets aren't INET sockets. Looking at the code before the fix was applied we can see that any socket operations on a L2TP socket is forwarded to the UDP subsystem for any non-L2TP operations:

net/l2tp/l2tp_ppp.c

static int pppol2tp_setsockopt(struct socket *sock, int level, int optname,
                              char __user *optval, unsigned int optlen)
{
       struct sock *sk = sock->sk;
       struct l2tp_session *session;
       struct l2tp_tunnel *tunnel;
       struct pppol2tp_session *ps;
       int val;
       int err;

       if (level != SOL_PPPOL2TP)
               return udp_prot.setsockopt(sk, level, optname, optval, optlen);

struct sock defines generic data for sockets in the kernel. Each specific type of socket — UDP, TCP, SCTP, L2TP, etc — defines its own structure that embeds a struct sock as its first member. This allows generic networking code to handle sock structures then "upcast" them when needing to handle protocol specific logic. In pppol2tp_setsockopt, the sock parameter actually points to a 'struct pppox_sock' allocation:

include/linux/if_pppox.h

struct pppox_sock {
        /* struct sock must be the first member of pppox_sock */
        struct sock sk;
        ...
};

When we call setsockopt on a L2TP socket, we can specify an arbitrary "level" which allows us to easily call into the UDP socket options handler. If we choose SOL_UDP we call into udp_lib_setsockopt which blindly casts the sk pointer to a struct udp_sock via udp_sk:

net/ipv4/udp.c

int udp_setsockopt(struct sock *sk, int level, int optname,
                   char __user *optval, unsigned int optlen)
{
        if (level == SOL_UDP  ||  level == SOL_UDPLITE)
                return udp_lib_setsockopt(sk, level, optname, optval, optlen,
                                          udp_push_pending_frames);
        ...

net/ipv4/udp.c

int udp_lib_setsockopt(struct sock *sk, int level, int optname,
                       char __user *optval, unsigned int optlen,
                       int (*push_pending_frames)(struct sock *))
{
        struct udp_sock *up = udp_sk(sk);
        ...

include/linux/udp.h

static inline struct udp_sock *udp_sk(const struct sock *sk)
{
        return (struct udp_sock *)sk;
}

This is a form of type confusion, which is a relatively rare class of kernel bug. We can use this bug to either trigger an out-of-bound read/write (because a struct udp_sock is larger than a struct pppox_sock) or corrupt a member of the underlying struct pppox_sock allocation. I explored both these options.

Arbitrary read/write

For example, the encap_type field in struct udp_sock has an offset greater than the size of struct pppox_sock. We can use the UDP_ENCAP socket option to read and write this field:

net/ipv4/udp.c

int udp_lib_setsockopt(struct sock *sk, int level, int optname,
                       char __user *optval, unsigned int optlen,
                       int (*push_pending_frames)(struct sock *))
{
        struct udp_sock *up = udp_sk(sk);
        ...
        switch (optname) {
        ...
        case UDP_ENCAP:
                switch (val) {
                ...
                case UDP_ENCAP_L2TPINUDP:
                        up->encap_type = val;

net/ipv4/udp.c

int udp_lib_getsockopt(struct sock *sk, int level, int optname,
                       char __user *optval, int __user *optlen)
{
        struct udp_sock *up = udp_sk(sk);
        ...
        switch (optname) {
        ...
        case UDP_ENCAP:
                val = up->encap_type;
                break;

Unfortunately, on the kernel we're targeting generic allocations are rounded up to the next power of two. As a result, reading or writing the encap_type field does not allow us to manipulate other objects on the heap. Rather, we can only read or write the extra padding added by the allocator which is always zeroed on allocation. Instead, I used type confusion to get code execution.

Type confusion

Comparing the two structures we can see that reading or writing through the struct udp_sock pointer allows us to read or write elements of the struct pppox_sock allocation. For example, writing to the pending field (1) would actually corrupt the underlying chan field (2):

include/net/inet_sock.h

struct inet_sock {
        /* sk and pinet6 has to be the first two members of inet_sock */
        struct sock                        sk;
#if IS_ENABLED(CONFIG_IPV6)
        struct ipv6_pinfo                  *pinet6;
#endif
        __be32                             inet_saddr;
        __s16                              uc_ttl;
        __u16                              cmsg_flags;
        __be16                             inet_sport;
        __u16                              inet_id;
        struct ip_options_rcu __rcu        *inet_opt;
        ...

include/linux/udp.h

struct udp_sock {
        /* inet_sock has to be the first member */
        struct inet_sock    inet;
        int                 pending;           /* Any pending frames ? */               (1)
        unsigned int        corkflag;          /* Cork is required */
        __u8                encap_type;        /* Is this an Encapsulation socket? */
        ...

include/linux/if_pppox.h

struct pppox_sock {
        /* struct sock must be the first member of pppox_sock */
        struct sock sk;
        struct ppp_channel chan;                                          (2)
        struct pppox_sock        *next;          /* for hash table */
        union {
                struct pppoe_opt pppoe;
                struct pptp_opt  pptp;
        } proto;
        ...

Digging some more into the layout of the two structures, we find that there is a promising candidate for our type confusion: the ppp field:

$ gdb -q net/ipv4/udp.o
Reading symbols from net/ipv4/udp.o...done.
(gdb) p/x &((struct udp_sock *) 0)->inet.inet_opt
$1 = 0x510
...
$ gdb -q net/l2tp/l2tp_ppp.o
Reading symbols from net/l2tp/l2tp_ppp.o...done.
(gdb) p/x &((struct pppox_sock *) 0)->chan.ppp
$1 = 0x510

The inet_opt field embedded in the UDP socket structure contains the set of IP options which can be directly controlled via the IP_OPTIONS socket option. The ppp field is a chunk of opaque data, represented as a void * in the structure. At runtime, though, it points to a struct ppp allocation.

These Point-to-Point Protocol (PPP) structures are used when sending or receiving data tunneled via a L2TP socket. For this exploit we will target the optional compression feature in PPP which is implemented as a series of function pointers embedded deeply in the struct ppp allocation. We can trigger the use of this corrupted structure deep in the call stack when receiving data from a UDP socket. The packet is first received by the generic UDP processing which calls into the L2TP module which finally calls down into the PPP decompression logic. The callstack at this point looks like this:

When we finally reach the decompression code, we fully control the contents of the ppp argument (1) and we can choose whether the packet we send is compressed or not (2), which will trigger a call through a function pointer we control (3):

drivers/net/ppp/ppp_generic.c

static struct sk_buff *
ppp_decompress_frame(struct ppp *ppp, struct sk_buff *skb)                      (1)
{
        int proto = PPP_PROTO(skb);
        ...
        if (proto == PPP_COMP) {                                                (2)
                ...
                /* the decompressor still expects the A/C bytes in the hdr */
                len = ppp->rcomp->decompress(ppp->rc_state, skb->data - 2,      (3)
                                skb->len + 2, ns->data, obuff_size);
}

Top-half vs bottom-half

When using this codepath in our exploit, we need to be aware of how the kernel handles different types of sockets.

For packets received from a physical/virtual network adapter, the adapter fires an interrupt letting the kernel know that a packet has arrived. When running in an interrupt handler, we cannot directly process the packet because the kernel is in an indeterminate state. This is the so-called "top half" routine, which is fired by an interrupt and is solely responsible for adding the packet to a workqueue. The queue is then later processed in a non-interrupt context, termed the "bottom half". This has the effect of clearly separating the send logic from the receive logic:

The kernel has different behavior for sockets bound to localhost.

tophalf:

`recv(2)`
 called
   |
   |
   v
   |
   \---->---- packet processed
                 from queue
                    |    
                    ^
                    |
                +---+---+
                | queue |
                +---^---+
                    |
--------------------+-----------------------
bottomhalf:         |
                    |
      /---->--- add packet --->---\
      |          to queue         |
   interupt                       |
   handler                        |
      |                           v
      ^                           |
      |                           |
   interrupt                  interrupt
   fired when                  handler
 packet received              completes

When sending a packet to a localhost socket, however, no interrupts are involved and there is no longer a clear separation of sending and receiving data. Instead, when a packet is destined for a local interface it is parsed and routed immediately in the same kernel context as the send operation. Why is this important for out exploit? It means that after corrupting the L2TP structure in the kernel, we can send a packet on the L2TP socket which will directly trigger the decompression function pointer in the context of our exploit process. This allows us to construct shellcode in our exploit process, which can then be called directly called from kernelspace because we are in the same context. In diagram form:

kernelspace:                 | userspace:
                             |
                             |
         /-------------------|-----  write(l2tp_fd, packet, sizeof(packet))
         v                   |
     send logic              |
         |                   |
         v                   |
     recv logic              |
         |                   |
         v                   |
  PPP decompression          |
       logic                 |
         |                   |
         |                   |
sk->ppp->rcomp->incomp()     |
         |                   |
         |                   |
         \-------------------|----------->   fake
                             |             compressor
                             |                 |
                             |                 |
                             |                 v
                             |              shellcode
                             |

Cleanup

When we gain execution back in userspace we can elevate our privileges by simply setting the process's uid and gid to zero. But what happens after our shellcode has run? Looking again at ppp_decompress_frame, there is still packet handling logic that runs after our shellcode has triggered that we need to navigate:

drivers/net/ppp/ppp_generic.c

static struct sk_buff *
ppp_decompress_frame(struct ppp *ppp, struct sk_buff *skb)
{
        int proto = PPP_PROTO(skb);
        ...
        if (proto == PPP_COMP) {
                ...
                /* the decompressor still expects the A/C bytes in the hdr */
                len = ppp->rcomp->decompress(ppp->rc_state, skb->data - 2,
                                skb->len + 2, ns->data, obuff_size);
                if (len < 0) {
                        /* Pass the compressed frame to pppd as an
                           error indication. */
                        if (len == DECOMP_FATALERROR)                                (4)
                                ppp->rstate |= SC_DC_FERROR;
                        kfree_skb(ns);
                        goto err;
                }
                ...
        } else {
                ...
        }
        ...
 err:
        ppp->rstate |= SC_DC_ERROR;
        ppp_receive_error(ppp);
        return skb;
}

We can bypass most of this logic by returning DECOMP_FATALERROR from our shellcode which will set the PPP socket into an error state (4). As noted before, we are pretty deep in the call stack at this point and still need to return back through the L2TP and UDP logic but this is mostly a case of ensuring that the various fields in our fake compressor object are set up correctly.

Once we finally return to userspace from the write(2) syscall our process is now root. 🍨