Obfuscated Tiny C Compiler: Deobfuscated

I was looking for a minimalist C compiler that is easy enough to retarget for another ridiculous project of mine (which I’ll possibly publish some day here). This search led me to TCC (Tiny C Compiler) and subsequently OTCC (Obfuscated Tiny C Compier).

This is not my first time discovering Fabric Bellard’s OTCC. However, on this particular day, I decided I wanted more. I wanted to understand. Fully. So I deobfuscated the entire compiler, turning over every rock, poking at the innards of every worm, and not stopping until enlightenment was attained.

Show Me The Code Already!

Here are links to code for impatient people:

Original: otcc.c
Expanded: otcc_expanded.c
Commented: otcc_commented.c
Original (fixed): otcc_fix.c
Commented (fixed): otcc_commented_fix.c
Full Repository: github

Original (otcc.c)

Strange to say, but it’s remarkable how unobfuscated it actually is! What looks like obfuscation is largely done to save as much space as possible: single char variables, reused variables, K&R function style, omitting types, etc. Even tricks often associated with obfuscation such as #define, are actually used to compress and save space. The source has a lot of obscure constants, but those are often just x86-32 machine code literals, naturally obfuscated. Finally, there is an impressive operator table encoded in a compact string literal. Code formatting likewise is very minimal: line-breaks after each ‘;’ or ‘}’

The space savings are reinvested by implementing a larger set of C features; an impressively large set: (here and here)

All of this is accomplished in less than one 4KB x86-32 memory page of source code (3301 bytes)

Expanded (otcc_expanded.c)

Here I took the obvious first steps:

Run the preprocessor (gcc -E)
Sensibly indent and format

Otherwise, little else was done

Commented (otcc_commented.c)

Here I took apart every detail and commented it obsessively. My goal, however, was to avoid large breaking source transformations. The original obfuscated source structure remains intact, and I believe a straightforward mechanical re-obfuscator could generate the original source with a matching diff (I did not verify this claim, however).

Fixed (otcc_fix.c and otcc_commented_fix.c)

These versions contain fixes that allow the code to compile and run on a modern x86-32 linux distro (more on this later).

High-level Overview

The commented version is quite extensive and reading inline explanations is better anyways, so I won’t repeat that work here. Go read it there. Instead, I want to discuss some bigger-picture structural observations here.

Raw Arrays as an Associative Data Structure

Data structures are key to compiler construction, but this C has no struct. Instead, a symbol table is built out of 2 ordinary arrays. The first array (symkey_base) stores ascii text and can be thought of as the “keys” of an associative array. It’s easy to search with strstr(). Linear search here means quadratic runtimes, but it’s a great trade. It’s conserving the most precious resource: source code chars. This search yields an index.

The index is used to access the second array (symval_base) which is used as a value array of symbol information.

Both of these arrays are abused in many fabulous ways which you can read in the code comments.

Parser Style

The parser is a fairly standard textbook Pratt-style recursive-descent parser that proceeds top-down: declarations ⇒ statements ⇒ binary expressions ⇒ unary expressions

Not surprising, though rarely seen these days, it’s a single pass compiler that tokenizes, expands macros, parses, and emits code at the same time. This is key: it avoids requiring complex data-structures in a minimalist implementation language.

But, to do so, it needs to handle…

Back-patching

Compiling in a single-pass means that lots of unresolved branch instructions get emitted before their targets are known. These need to get fixed up later. But, there can be lots of these to fix up, from possibly deep nesting of control-flow structures! Another data structure is required.

OTCC uses a clever solution here. I’ve seen it elsewhere, and don’t know where it originated. I presume it was widely known before this compiler was written. Or perhaps not. I would love to know if you, Ms Future Blog Reader, happens to know.

Effectively, it constructs a linked-list through unresolved jump target immediates by chaining jumps with the same unresolved target. When a target is resolved finally, the compiler walks this linked-list chain, patching each one.

If that's confusing, I'm sorry. A visual diagram would help, but I don't feel like making one.

But perhaps this helps instead:

A jump target is added to the list: here
All targets in the list are patched: here
This function walks the list and patches: here

Operator Data Encoding

A surprisingly large set of C operators are supported by employing a clever encoding to build an operator table in an ordinary C string. A base-64 encoding is used that is carefully constructed to avoid both the double-quote character (") and the escape character (\). For each operator in the tokenizer, this operator table is fully unpacked and linearly searched for matches. On success, the tokenizer produces both the precedence of the operator (for the Pratt parser) and a blob of x86-32 machine code to emit (except when that is also abused, haha).

Executing In-Process and Calling libc

OTCC is really a JIT compiler runtime of sorts. It generates machine code into a buffer, casts it into a function pointer, and jumps into it. Since it’s executing in the same process as the compiler, it doesn’t need deal with a linker nor a loader nor a binary file format (e.g. ELF). This greatly simplifies things. But it also gives a special super-power. It can lookup libc functions via dlsym() and compile calls to them also. Since it’s executes in the same process, it Just Works. This gives OTCC access to all of the important libc functions and is a major source-text savings (e.g. strstr to do symbol-table lookups).

Codegen

The x86-32 codegen is quite vanilla despite being obfuscated as decimal integer literals. The EAX register is used for all expression results until a binary operator is encountered, where the left-hand-side moves to the stack and then to ECX. There are some fun tricks to do memory operations on both global and local variables with the same codegen routines by setting one specific bit in the ModR/M byte.

Luckily, the x86-32 Linux ABI is quite friendly to simple codegen as all function arguments and all function local can be stored on the stack and indexed with a negative or small positive stackframe offset. Contrast this with x86-64 where specific argument registers must be used (more on this later).

Also, this compiler benefits from and heavily exploits the fact that sizeof(int) == sizeof(void*) on x86-32 by not having to implement pointer types and just using int everywhere. The syntax *(int*) is a special case handled by the parser. It could be viewed as a load_u32 / store_u8 operator. Similarly, we have *(char*) as a load_u8 / store_u8 operator. Despite having the char keyword, this compiler doesn't really have true types: everything is an int, often heavily abused.

Error-handling

I just had to say something about error-handling because this has ... NONE. To be fair, it’s a toy that was entered in a contest and not intended to be Production Quality by any stretch of the phrase. But, it’s interesting just how far the “lack of error-handling” is exploited, and that is noteworthy. Everywhere in the source is the assumption that input data is well-formed. Any deviation will very likely lead to a nasty crash.

For example:

Tokens carry no type. Sometimes they are ASCII chars, keyword values, or pointers into symbol data. The parser “just knows”
The parser just advances past tokens without checking them because well-fromed C programs are supposed to have a certain punctuation token at that location.
Heap pointers in the compiler are assumed to be < 0x8000000 (or > 0 as a signed int) and this idea is used to disambiguate global-variable absolute-addresses from local-variable stack-offsets (more on this later)
Magic integers 1 or 2 are mixed with pointer data regularly and checked only where relevant for well-formed inputs
... etc, etc, etc ...

This should go without saying: If you're not developing a toy program, DON'T DO THIS. And if you are developing a toy program: Send It To Me =)

Getting it running

Sadly, there isn’t much in the way of x86-32 machines these days, and my normal machine is now an Aarch64 MacBook M1. So, I set up qemu-system-i386 with Alpine Linux.

Sadly this wasn't enough. This modern distro uses musl and it’s malloc() tries to use mmap() where possible, producing allocations in the upper half of the virtual address space. This collides with OTCCs use of the negative range as “stack offsets”.

A large number of fixes were attempted:

Override malloc() via LD_PRELOAD and call sbrk(). Failed because musl rejects calls to sbrk that increment it!
Make the sbrk() syscall directly. Works, but crashes before fopen (maybe colliding with musl malloc?)
Use a bump allocator on a static buffer in the LD_PRELOAD library. Darn, the entire library gets mapped into high mem.
Just statically-link the bump-allocator so the static buffer is in the low-memory .text region. Opps, stage2 compiler will still pick up the original malloc from dlsym()
Go back to LD_PRELOAD but use mmap() with an address hint to get a block of memory at 0x70000000 and then bump-allocator. Great!

Next problem is that Alpine maps musl libc to high-mem also which means that global vars from musl also get miscompiled! (e.g. stdin).

Removing the stdin default gets the compiler compiling itself, but otccex.c is miscompiled by the stage2 for the same reason.

The solution to all ended up being simple (ha):

$ diff otcc.c otcc_fix.c
168c168
< s((e<512)<<7|5,n;
---
> s((e<512&-e<512)<<7|5,n;

Just have the OTCC codegen generate code for locals if they have small offsets in (-512, 512)

No more funky malloc shims nor LD_PRELOAD tricks

All of the following commants work now:

$ ./otcc otccex.c 5
fib(5) = 5
fact(5) = 120

$ ./otcc otcc_fix.c otccex.c 5
fib(5) = 5
fact(5) = 120

$ ./otcc_commented otccex.c 5
fib(5) = 5
fact(5) = 120

$ ./otcc_commented otcc_commented_fix.c otccex.c 5
fib(5) = 5
fact(5) = 120

Modernizing?

I considered modernizing OTCC to x86-64 but it’s not so straightforward due to the ABI. Compiling to the system ABI opens the door to the dlsym trick. Porting to x86-64 has at least two problems (and likely more that I haven't yet realized):

The x86-64 ABI puts arguments in registers rather than the stack
We no longer have sizeof(int) == sizeof(void*) so the agressive type-punning is broken

These are probably both possible to overcome, but I haven’t attempted it.

Future work I suppose …

Closing thoughts

These days, we live in a world built on seemingly endless piles of code. If you’re fortunate (or misfortunate) enough to work on core parts of an important software system, you know how hard it is to keep it all in your head. In fact, you simply can’t. Software is big, it’s built by teams, and knowledge is distributed across many minds.

Taking apart OTCC is like visiting another planet in an secluded part of the universe, where few care to venture. In just ~500 lines, you can have immense complexity and functionality. But, perhaps, surprisingly you can actually fit it all in one brain. There’s a certain beauty to understanding how everything fits together end-to-end in the maximum detail, nothing over-simplified as a crutch for the human brain. It’s a luxury that software in 2023 doesn’t provide.

I’m not claiming it’s better. That would assume a “metric-space for software-quality” which is almost certainly too hard to define. But, a little reflection might be useful.

And arguably today’s software is better. Is it not better to have knowledge distributed across minds versus a single expert mind who can get hit by a bus at any time? And anyone that complains about bad software has to only go back a few decades to find significantly worse problems. For example: A good feature of old Sun machines was that they booted so fast, which of course was needed because they crashed so often (“Pros and Cons of Suns” from The Unix Hater’s Handbook). These days, unless you use some sketchy kernel driver/module, it’s not easy to panic the kernel. Software has gotten unreasonably stable. When we complain about software being awful and broken, it's more an indicator of how far we've shifted the goal-posts over the years.

But at the same time, when modern software breaks, it is so incredibly complex that it can be shockingly hard to fix. There is something grand about having a complete mental model and forcing things to be overly simple that leads to efficiency on many levels.

To be clear, OTCC is just a toy. But, perhaps I’m surprised at how interesting of a toy it is.. and perhaps, by contrast, how uninteresting much of modern software often is.

But, there’s no real point to any of this, it was all for fun. And I hope you had fun too. If not, here’s an entertaining cat video

Stay Calm and Carry On!

xorvoid