PoC or GTFO, Volume 2
Page 9
Exploiting DMG→SGB command packets for gaining a foothold on SNES
The Super Game Boy command packet protocol has two nifty commands for gaining control of the SNES. 0x79 writes data to an arbitrary memory location, while 0x91 sets the NMI vector and jumps to an arbitrary address. Both commands are real, documented command packets; they are not undocumented debug commands.
Since the Stage 2 code executing on the DMG is so small we needed to minimize the number of packets required. The SNES’s controller registers are memory-mapped I/O registers that automatically update each video frame when enabled. It is possible to execute code from those registers but it isn’t particularly easy to do so, largely because it is very unsafe to execute anything from those registers when they are in the middle of an update. (There are all sorts of intermediate stages.)
The solution is to find some way for the SNES CPU to waste time during that update elsewhere. The NMI vector and the NMI handler are perfect for this: when enabled, it starts running just before the register starts updating. We just need an NMI handler that wastes somewhere between roughly four and 260 scanlines, so it hits after the current NMI returns but before the next NMI starts. Scanning descriptions of various SNES I/O registers, a useful one seems to be $4212, which has bit 7 set when the console is performing a vertical retrace. The NMI occurs immediately after the vertical retrace starts and the retrace lasts for about 40 scanlines, so waiting for $4212 bit 7 to clear works out perfectly. Since the retrace bit is bit 7 and the SNES CPU happens to be in a mode where the A register is 8 bits wide,18 numbers with bit 7 set show as negative, so it’s trivial to branch on those using BMI instruction. Handily enough, the LDA instruction that loads the memory address into the A register sets the condition flags, so we can just loop around that one instruction using BMI.
After the loop, we must return from the NMI. This is done using the RTI instruction, so the final NMI handler looks like:
This handler trashes the A register, which is generally considered bad style, but we can get away with doing that.
We send two packets; the first one writes six bytes (AD 12 42 30 FB 40) into the memory address 0x001800. This is the NMI routine.
The second one jumps to 0x004218, which is the start of the controller registers, with the NMI vector set to 0x001800, the address of the routine we just wrote.19
Figure 10.9: Inception
Stage 3: From stable loop in autopoller registers to loading payloads.
480 bytes per second; 60 payload bytes per second.
We have transferred control flow to controller registers, but we aren’t done just yet. The controller registers are only eight bytes in size, and normally not all bits are even controllable. However, there are some tricks we can play to control all the bits. First, even though a standard SNES controller only has twelve buttons, the autopoller reads all 16 bits. Normally the last four are controller type identification bits. Since those bits are read from the controller, the controller can set those bits to whatever it likes, including changing those bits every frame. Second, the last four bytes of the register are read from the second data line that is normally not connected to anything unless there is a multitap device. It isn’t possible to just connect a multitap device whenever we like as the game will softlock. Fortunately, it is possible to connect the second controller so that it shares all the other pins (+5V, ground, latch and clock), but use the second data pin instead the first.
These two tricks allow controlling all 128 bits in the controller registers which gives us eight bytes of data per frame. While this is a huge improvement over our Stage 1 effective data rate of a nybble per frame it still only amounts to a datarate of 300 bytes per frame because three of those eight bytes need to be used for looping in the controller registers, leaving only five bytes usable. (Although, as you’ll see, only one byte of payload data can be sent per frame.)
Specifically, to loop successfully in the controller registers we need to wait for the NMI induced interrupt in order to avoid the NMI happening at an unpredictable instruction (because the NMI trashes A) and then jump to the start of the controller register. Then there is issue that NMI is not initially enabled, even if the handler is set, so the first frame has to enable the NMI handler. Fortunately, this can be done rather compactly:
Since the code is idempotent, this is good time to switch from sending input in once per frame to sending input in once per latch poll. The way the SGB BIOS polls the controllers is completely crazy, often polling more than once per frame, polling too many bits, trying to poll but leaving the latch held high, etc. Because this is a somewhat common problem even in other games, the bot connected to the controller ports has a mode where it synchronizes what input to send based on the edge of each video frame (1/60th of a second in a polling window) by keeping track of how much time has elapsed; if the game asks for input more than once on the same frame we give it that frame’s input again until we know it is time for the next frame’s polls, which means we can follow the polling no matter how crazy it is. The obvious trade off is that this mode is limited to eight bytes per frame with four controllers attached, so we need to switch the bot’s mode to one that is strictly polling based, sending the next set of button presses on each latch. Making that transition can be a bit glitchy considering it was added as a firmware hack but because this piece of code is idempotent we can just spam the same input several times as we only need it to hit in the range. This happens from frame 12,117 to 12,212 in the movie.
We now have a stable loop in the controller registers that we can use to poke some code into RAM. The five bytes per frame is enough to write one byte per frame into an arbitrary address in first 8kB of the SNES’s RAM:
This assembles to five bytes, A9 xx 8D yy yy. Finally, after the writes, we can use JML (four bytes) to jump to the desired address. Since the DMG is still playing some annoying tunes, the first order of business is to try to crash it. Writing 00 to the clock control/reset register at 0x6003 should do the trick by stopping the DMG clock, and in fact this works in the LSNES emulator, but on a real console the annoying tunes keep playing until the DMG corrupts itself enough to crash completely.20
Figure 10.10: Now using four controllers!
Stage 4: Increasing the datarate even further.
3,840 bytes per second.
One byte per frame is rather slow as it would take us several minutes to write our payload at that speed so we poke the following routine (Stage 4) that reads eight bytes per frame from the autopoller registers and writes it sequentially to RAM, starting from 0x1A00 until 0x1B1F into address 0x19000.
As machine code, e2 30 a9 01 8d 00 42 c2 10 a0 00 1a ad 12 42 10 fb ad 12 42 30 fb a2 18 42 a9 00 eb a9 07 8b 54 7e 00 ab c0 20 1b d0 e4 5c 08 1a 7e.
Why jump to eight bytes after the start of the payload? It turns out that code loads some junk from what is previously in the controller registers on the first frame, so we just ignore the first few bytes and start the payload code afterwards. Eight bytes per frame still isn’t fast enough, so the routine this code pokes into RAM is another loader routine that uses serial controller registers to read eight bytes eight times per frame, for total of 64 bytes per frame.
Let’s take a look at the Stage 5 payload:
; 0000 => Current transfer adr
; 0002 => Transfer end address
; 0004 => Blocks to transfer.
; 0006 => Current xfr bank.
; 0008 => 0: No transfer.
; 1: Transfer in progress.
; 000C => Blocks transferred.
; 0010 => Jump vector to next
; in chain.
; 0020-0027 => Buffer
; 0080-00BF => Buffer.
Start:
NOP ; 8 NOPs, for the junk
NOP ; at start.
NOP
NOP
NOP
NOP
NOP
NOP
SEI
LDA #$00 ; Autopoll off,
; NMI and IRQ off
.
STA $4200
REP #$30 ; 16 - bit A/X/Y.
; Initially no transfer.
LDA #$0000
STA $0008
frame_loop:
SEP #$20
not_in_vblank:
; Wait until next vblank ends
LDA $4212
BPL not_in_vblank
in_vblank:
LDA $4212
BMI in_vblank
REP #$20
LDA #$0008
STA $0004
LDA #$0000
STA $000C
rx_block:
LDA #$0001
STA $4016
LDX #$0003
latch_high_wait:
DEX
BNE latch_high_wait
STZ $4016
LDX #$0004
latch_low_wait:
DEX
BNE latch_low_wait
LDA #$0000
STA $0020
STA $0022
STA $0024
STA $0026
LDY #$0010
read_loop:
LDA $4016
PHA
; Bit 0 => 0020,
; Bit 1 => 0024,
; Bit 8 => 0022,
; Bit 9 => 0026
BIT #$0001
BNE b0nz
LDA $0020
ASL A
BRA b0d
b0nz:
LDA $0020
ASL A
EOR #$0001
b0d:
STA $0020
PLA
PHA
BIT #$0002
BNE b1nz
LDA $0024
ASL A
BRA b1d
b1nz:
LDA $0024
ASL A
EOR #$0001
b1d:
STA $0024
PLA
PHA
BIT #$0100
BNE b8nz
LDA $0022
ASL A
BRA b8d
b8nz:
LDA $0022
ASL A
EOR #$0001
b8d:
STA $0022
PLA
BIT #$0200
BNE b9nz
LDA $0026
ASL A
BRA b9d
b9nz:
LDA $0026
ASL A
EOR #$0001
b9d:
STA $0026
DEY
BNE read_loop
; Move the block from 0020
; to its final place
LDA $000C
ASL A
ASL A
ASL A
CLC
ADC #$0080
TAY
LDX #$0020
LDA #$0007
MVN $00, $00
; Increment the count at 000C,
; decrement the count at 0004.
; If no more blocks, exit.
LDA $000C
INA
STA $000C
LDA $0004
DEA
STA $0004
BEQ exit_rx_loop
JMP rx_block
exit_rx_loop:
LDA $0008
BNE doing_transfer
; Okay, setup transfer.
LDA $0082
CMP #$FF
BMI not_jump
; This is jump, copy the adr.
STA $12
LDA $0080
STA $10
BRA out
not_jump:
LDA $0080; Starting address.
STA $0000
LDA $0082; Bank.
STA $0006
LDA $0084; Ending address.
STA $0002
; Self-modify the move.
LDX #move_instruction
LDA $0006
AND #$FF
STA $01,X
; Enter transfer.
LDA #$0001
STA $0008
; See you next frame.
JMP no_reset_transfer
doing_transfer:
; Copy the stuff to its final
; place in WRAM.
LDY $0000
LDX #$0080
LDA #$003F
PHB
move_instruction:
MVN $40,$00 ; Bogus bank,
; to be modified.
PLB
TYA
STA $0000
CMP $0002
BNE no_reset_transfer
STZ $0008 ; End transfer.
no_reset_transfer:
; Next frame.
JMP frame.loop
out:
JMP [$10]
Figure 10.11. Why should we wait for next frame? Go sub-frame!
Stage 5: Transfers of data in blocks with headers.
3,840 bytes per second.
This routine is rather complex, so let’s review some of its trickier parts. The serial protocol works by first setting the latch bit, bit 0 in 0x4016, then clearing it, then reading the appropriate number of times from 0x4016 (port #1) and 0x4017 (port #2). Bit 0 of the read result is the first data line value, while bit 1 is the second data line value. After each read, the line is automatically clocked so the next bit is read. The two port latch lines are connected together; bit 0 of 0x4016 controls both.
The bot is slow, so we wait after setting/clearing the latch bit. We properly reassemble the input in the usual order of the controller registers, since we have CPU time available to do that. Since we read 16-bit quantities, port 0x4017 is read as high 8 bits, so the data lines there appear as bits 8 and 9.
To handle large payloads, the payload is divided into blocks with headers. Each header tells where the payload is to be written, or, if it is the last block, where to begin execution.
The routine uses self-modifying code: The source and destination banks in MVN are fixed in code, but this code is dynamically rewritten to refer to correct target bank.
Automating the Movie Creation
Since manually editing, recompiling and transforming inputs gets old very fast when iterating payload ROMs, tools to automate this are very useful. This is the whole reason for having Stage 5 use block headers. Furthermore, to not have one person doing the work every time, it’s helpful to have a tool that even script-kiddies can run. The tool to do this is a Lua script that runs inside the emulator. (The LSNES emulator has built-in support for running Lua scripts, with all sorts of functions for manipulating the emulator.)
This code, the main Lua script, refers to four external files. “stage4.dat” contains the memory writes to load the Stage 4 payload from page 176 while executing in the controller registers.
This file contains the Stage 4 payload, plus the ill-fated attempt to shut up the DMG. (As noted previously, it dies on its own later.) The first line containing 0x001900 is the address to jump to after all bytes are written.
A filename is taken as a parameter, which is the payload ROM to use. As you can see, the Lua script fixes the memory mappings, but this is okay, as those are not difficult to modify.
The specified memory mappings copy a sixteen kilobyte byte region starting from file offset 0x8000 into 0x7E8000, and the 0x7A00 byte region starting from offset 0x10000 into 0x7F8000. (The first 32kB contain initialization code for testing.)
The script assumes that the loaded movie causes the SNES to jump into controller registers and then enable NMI, using the methods described earlier. It appends the rest of the stages and payload to the movie. Also, since it edits the loaded input, it is possible to just load state near the point of gaining control of the SNES and then append the payload for very fast testing. (Otherwise it would take about two minutes for it to reach that point when executing from the start.)
Stage 6: Twitch Chat Interface
After successfully transferring our payload, execution of the exploit payload (created by P4Plus2) can officially begin. There are three parts to the final payload: Reset, the Chat Interface, and a TASVideos Webview.
The Reset
Because
much of the hardware state is either unknown or unreliable at the point of control transfer we need to initialize much of the system to a known state. On the SNES this usually implies setting a myriad of registers from audio to display state, but also just as important is clearing out WRAM such that a clean slate is presented to the payload. Once we have a cleared state it is possible to perform screen setup.
In the initial case we set the tile data and tilemap VRAM addresses and set the video made to 0x01, which gives us two layers of 4–bit depth (Layers 1 and 2) and a single layer of 2–bit depth, Layer 3.
Layer 1 is used as a background which displays the chat interface, while Layer 2 is used for emoji and text. Layer 3 is unused. A special case for the text and emoji however is Red’s own text which is on the sprite layer, allowing code to easily update that text independently.
The Chat Interface
Now that we have the screen itself set up and able to run we need to stream data from Twitch chat to the SNES. But we only have 64 bytes per frame available to support emoji as well as the alphabet, numbers, various symbols, and even special triggers for controlling the payload execution. This complexity quickly bogged down our throughput per frame, so we created special encodings for performance! On average the most common characters will be a-z in lower case, which conveniently fit into a 5–bit encoding with several more characters to spare.
The SNES has both 16–bit and 8–bit modes, so in 16–bit mode we can easily process three characters with a bit to spare! But what about the rest of our character space? Well, we have a single bit remaining and can set it to allow the remaining characters to be alternatively encoded. The alternate encoding allowed for two 7 bit characters, with an additional toggle bit on the second character.