Dis86: A decompiler for x86 16-bit real-mode binaries
Today I'm publicly releasing a decompiler I've been building. Source code is (here).
I plan to document the internals here over a series of posts, but today's announcement will remain fairly high-level.
Backstory
Over the last year, I've been casually working on a reverse-engineering and reimplementation project of an old MS-DOS video-game that got me curious about science and engineering as a child. Along with this project has come all kinds of fascinating discoveries about this original IBM PC that I was too young to appreciate at the time. I've been intending to write publicly about all my discoveries for far too long.
The most traditional tool used for this task is IDA Pro. Sadly they dropped x86 16-bit real mode support some time ago. And to complicate matters more, I use a Mac M1 Pro Aarch64 machine so I can't simply use some older version without resorting to a complicated emulation stack (argh). I tried Ghidra also at some point and found issues getting it to work for my needs.
Being a very experienced systems engineer, it doesn't take much annoyance until you're seriously considering taking up the task of just building the thing you want from scratch. It's a curse. A curse I love.
Prehistoric version 1
The first version was a very simple 1-to-1 assembly instruction to C statement convertor.
For example:
x86-16 Asmn | C Code |
---|---|
inc ax |
AX += 1; |
push bx |
PUSH(BX); |
leave |
SP = BP; BP = POP(); |
The generated C-code then could execute instead of the orginal machine-code using a hybrid runtime I developed that
switches between executing inside DosBox or on native Mac M1 Aarch64 depending on which functions have been decompiled
or are still only available in the original machine code. I call this runtime Hydra
. Hydra itself is
an interesting systems engineering project and I plan to also release it someday (TBD).
The typical reverse-engineering flow looked like:
- Identify an interesting function and where it is in the binary (e.g. from
0452:012a
to0452:0151
) - Run
dis86
to generate eqivalent C-code - Configure the Hydra Runtime to use this new code instead of the old machine code
- Verify it works identically to the original
- Begin refactoring the function: simplifying, naming variables, introducing higher-level constructs, etc
- Goto 4 ... until satisfied
You spend a lot of time in the 4-6 loop here. And there are obvious and not-so-obvious transformations that appear repeatedly that you quickly get sick of doing manually. You really want to focus on the less trivial parts such as understanding a data-structure or algorithm.
Here are some examples:
Assembly pattern | Semantic operation |
---|---|
xor si,si |
SI = 0 |
or bx,bx jne 0x234 |
if (BX != 0) goto addr_234 |
push ax call 0x356:0x78 pop cx |
call function 0x356:0x78 with one argument (AX) |
You also want to recognize function parameters and local variables. Instead of accessing them with raw load/store instructions, you want to symbolize them and use ordinary assignment.
Some examples of variables:
Assembly pattern | Sematic operation |
---|---|
mov di,WORD PTR ss:[bp+0x6] |
DI = <first function parameter> |
les bx,DWORD PTR ss:[bp-0x6] |
ES:BX = <local variable at offset -0x6> |
mov WORD PTR ss:[bp-0x8],dx |
<local variable at offset -0x8> = DX |
And, you want to symbolize global varibales and fucntion calls.
Prehistoric version 2
So version 2 is born which does a handful of "peephole-style" transformations and symbolization.
Example output:
#define _param_0006 ARG_16(0x6)
#define _local_0002 LOCAL_16(0x2)
#define _local_0008 LOCAL_16(0x8)
#define _local_0006 LOCAL_32(0x6)
#define _local_000a LOCAL_16(0xa)
void func_00006b42__0622_0922(void)
{
PUSH(BP); // push bp
BP = SP; // mov bp,sp
SP -= 0xa; // sub sp,0xa
PUSH(SI); // push si
PUSH(DI); // push di
DI = _param_0006; // mov di,WORD PTR ss:[bp+0x6]
BX = DI; // mov bx,di
BX <<= 0x2; // shl bx,0x2
AX = *PTR_16(DS, BX+0x946); // mov ax,WORD PTR ds:[bx+0x946]
DX = *PTR_16(DS, BX+0x944); // mov dx,WORD PTR ds:[bx+0x944]
*(u16*)((u8*)&_local_0006 + 2) = AX; // mov WORD PTR ss:[bp-0x4],ax
*(u16*)((u8*)&_local_0006 + 0) = DX; // mov WORD PTR ss:[bp-0x6],dx
SI = 0; // xor si,si
goto label_00006b93; // jmp 0x6b93
label_00006b64:
LOAD_SEG_OFF(ES, BX, _local_0006); // les bx,DWORD PTR ss:[bp-0x6]
// test WORD PTR es:[bx],0x40
if ((*PTR_16(ES, BX) & 0x40) == 0) goto label_00006b72; // je 0x6b72
*PTR_16(ES, BX) ^= 0x50; // xor WORD PTR es:[bx],0x50
label_00006b72:
LOAD_SEG_OFF(ES, BX, _local_0006); // les bx,DWORD PTR ss:[bp-0x6]
AX = *PTR_16(ES, BX); // mov ax,WORD PTR es:[bx]
_local_0002 = AX; // mov WORD PTR ss:[bp-0x2],ax
// test WORD PTR ss:[bp-0x2],0x14
if ((_local_0002 & 0x14) != 0) goto label_00006b8e; // jne 0x6b8e
// push WORD PTR ss:[bp-0x4]
// push bx
// callf 0x581:0x496
F_vga_dyn_append(m, BX, (u16)(_local_0006>>16)); // add sp,0x4
label_00006b8e:
// ... SNIP ...
The refactor iteration loop improves, but naturally we want more! In particular we want control-flow synthesis.
For example:
Assembly pattern | Pseudo C Code |
---|---|
00006b80: jne 0x6b85 00006b82: mov ax,1 00006b85: ... |
if (<something> != 0) { AX = 1 } |
00006bcb: jmp 0x6bfc 00006bcd: ... 00006bfc: cmp si,0x42 00006bff: jl 0x6bcd 00006c01: ... |
while (SI < 0x42) { ... } |
0000754c: jmp WORD PTR cs:[bx+0x1c] |
switch (<val>) { ... } |
These are not easy to infer in a decompiler architecture that lacks proper intermediate representations, so ultimately the decision was made to redesign from the bottom up. Which leads directly to the current form of dis86.
The inference problem
In many ways dis86 is just like any ordinary (forward) compiler. All compilers are tasked with translation from some
source language to some destination language. In this case we have source=x86-16
and target=c-source-code
. This
reality means that many established compiler design techniques can be used.
However, on top of all the normal challenges of building a high-quality compiler, decompilers also have a complicated
"inference problem"
of trying to invert a function that is NOT a bijection! Control flow synthesis is a simple
example of this:
Consider:
void my_function_1() {
if (cond) {
// .... do thing
return;
}
// ... do another thing
}
And:
void my_function_1() {
if (cond) {
// .... do thing
} else {
// ... do another thing
}
}
These can both be compiled to the same machine code. But which ONE would that machine-code decompile to?
The above is a very trivial example, but it demonstrates the principle of trying to invert a function that is not invertible. In reality, these problems can get very complicated quickly.
Here's a more practical example you find in much x86-16 code:
000079c0 (074b:0510) | 55 push bp;
000079c1 (074b:0511) | 8b ec mov bp,sp;
000079cb (074b:051b) | 9a fc 04 81 05 call 0x581:0x4fc;
000079d0 (074b:0520) | 5d pop bp;
000079d1 (074b:0521) | cb retf;
Does this function return a value? If so, where does it come from? Well, the calling conventions on this system
return u16
in AX
and u32
in DX:AX
. But neither is set here. So is it a void function?
Maybe.
It could be:
void func_074b_0510() {
func_0581_04fc();
}
void func_0581_04fc() {
// ... body ...
}
Or it could be:
u16 func_074b_0510() {
return func_0581_04fc();
}
u16 func_0581_04fc() {
// ... body ...
}
Or it could be:
u32 func_074b_0510() {
return func_0581_04fc();
}
u32 func_0581_04fc() {
// ... body ...
}
How do we decide? And this is just one case. A solution for this (e.g. whole program analysis) may not be sufficient for all such ambiguities. In the limit, complexity dominates.
Many decompilers have to depend on various heuristics and assumptions to deal with the issue. In this sense, dis86 is no different. But, as a principle, we value producing code that executes correctly over code that maximizes readability. For the above case we'll depend on manual annotation unless a distinction can be made with certainty. Over time, we hope to improve this continually and/or make inference good at suggesting solutions to the operator. But, ultimately the operator is the ultimate driver.
This is a trade-off that depends a lot on your application. If you're using a decompiler to simply read code and perform a security audit, then you may wish to trade some correctness for readability. If you find a security hole, you'll go read the machine-code directly to verify it anyways. But, if you're tring to semantically lift an entire binary into C code, then you instead worry about breaking some corner case in some infrequently executed code path.
Dis86 Design Overview
Finally we come to the overall internal structure of dis86.
Dis86 is a series of transform steps over well-defined data-structures:
- Disassembly: Raw binary is decoded into an in-memory representation of the instructions
- Basic blocks inference: Detect basic block boundaries based on jmp instructions and targets
- Build SSA IR [1]: Convert each instruction into one or more IR operations, inserting them into their respective basic-blocks
- Optimize IR: xor reduction, or reduction, phi reduction, branch simplifiction, stack ptr accumulation, common-subexpression-elimination, value-propagation, deadcode-elimination, etc
- Symbolize and forward memory: name parameters and locals, forward non-escaping memory, fuse upper/lower split addresses, etc
- Infer control flow: Detect loops, if-stmt, and switch-stmts. Schedule basic-block ordering/placement and jump fallthrough.
- Build an AST [2]: Build an AST representing C code using the final IR and the Control Flow
- Generate C Code: Walk the AST and emit the final C code
Each of these intermediate steps can be inspected, which is very useful for debugging and development.
Next Steps
In future posts, we'll drill down into specific components in gory detail.
If you'd like to learn more about this project, you can stay tuned to this blog via the rss feed (here). And, feel free to review out the source code yourself on github (here). If you find this interesting, you can also buy me a coffee (here).
Appendix