Lifting Binaries, Part 0: Devirtualizing VMProtect and Themida: It's Just Flattening?
Table Of Contents
- Table Of Contents
- Intro
- Day 0
- Failed attempts
- Whiplash
- Challenges
- Action
- Whats next?
- Special thanks to
- Thanks for reading!
- Useful resources:
Intro
This is going to be the first part of multipart series in which I discuss about using compiler techniques for reverse engineering purposes. I want this first post to be more of a blog about why I decided on this approach and why I developed Mergen rather than focusing on the more technical aspects.
Let’s go back to day 0, where everything started.
Day 0
Commercial VM based obfuscators like VMProtect and Themida are considered as industry standards because of their black box implementation. Even though these solutions have been in this position for a long time and there are public projects that are able to deobfuscate a particular solution or a particular version of a solution, there are (or were) no public projects that can deobfuscate multiple versions or multiple solutions.
So bored enough to learn a new topic, stupid enough to make it wildly ambitious, I started by creating a basic executable that just returns the sum of two numbers, applied VMProtect to it, and started to mess around. When I looked at the function in IDA, I saw the function was getting replaced with a jump to VMProtect generated .???0 section with a weird looking code block.
.???0:000000014000AE7B push r10
.???0:000000014000AE7D pushfq
.???0:000000014000AE7E push rsi
.???0:000000014000AE7F bsf r10, r10
.???0:000000014000AE83 bts r10w, di
.???0:000000014000AE88 push r13
.???0:000000014000AE8A test cx, r15w
.???0:000000014000AE8E btc r10w, sp
.???0:000000014000AE93 cmc
.???0:000000014000AE94 push r8
.???0:000000014000AE96 stc
.???0:000000014000AE97 bswap r10
.???0:000000014000AE9A push rbx
.???0:000000014000AE9B bswap bx
.???0:000000014000AE9E sbb r10b, sil
.???0:000000014000AEA1 cmc
.???0:000000014000AEA2 push rdx
.???0:000000014000AEA3 btr bx, dx
.???0:000000014000AEA7 push r11
.???0:000000014000AEA9 clc
.???0:000000014000AEAA movzx r10d, bp
.???0:000000014000AEAE push rax
.???0:000000014000AEAF push r15
.???0:000000014000AEB1 sal rax, 5Ah
.???0:000000014000AEB5 push rbp
.???0:000000014000AEB6 push r9
.???0:000000014000AEB8 push rdi
.???0:000000014000AEB9 and ebp, 38D937E3h
.???0:000000014000AEBF push rcx
.???0:000000014000AEC0 push r14
.???0:000000014000AEC2 bts r9d, r9d
.???0:000000014000AEC6 btc rax, 57h ; 'W'
.???0:000000014000AECB push r12
.???0:000000014000AECD and r9b, 0CCh
.???0:000000014000AED1 shld r10w, r8w, 3Ah
.???0:000000014000AED7 add bpl, al
.???0:000000014000AEDA mov rax, 0
.???0:000000014000AEE4 push rax
.???0:000000014000AEE5 bswap rbx
.???0:000000014000AEE8 bsf bx, r14w
.???0:000000014000AEED mov rdi, [rsp+88h+arg_0]
.???0:000000014000AEF5 cmc
.???0:000000014000AEF6 shr bpl, cl
.???0:000000014000AEF9 bswap edi
.???0:000000014000AEFB add edi, 66931E79h
.???0:000000014000AF01 movzx rbp, r12w
.???0:000000014000AF05 or r11b, bpl
.???0:000000014000AF08 adc bl, r14b
.???0:000000014000AF0B bswap edi
.???0:000000014000AF0D inc r9b
.???0:000000014000AF10 sub edi, 6573517Fh
.???0:000000014000AF16 add rdi, rax
.???0:000000014000AF19 mov r10, 100000000h
.???0:000000014000AF23 bts r11, rdi
.???0:000000014000AF27 add rdi, r10
.???0:000000014000AF2A movsx r11, bx
.???0:000000014000AF2E mov rbx, rsp
.???0:000000014000AF31 movsx ebp, bp
.???0:000000014000AF34 btc r11w, r8w
.???0:000000014000AF39 mov bpl, r15b
.???0:000000014000AF3C sub rsp, 180h
.???0:000000014000AF43 bsf r9d, r15d
.???0:000000014000AF47 and bpl, 0D0h
.???0:000000014000AF4B movzx bp, r13b
.???0:000000014000AF50 and rsp, 0FFFFFFFFFFFFFFF0h
.???0:000000014000AF57
.???0:000000014000AF57 loc_14000AF57: ; CODE XREF: .???0:loc_14001D58B↓j
.???0:000000014000AF57 ; .???0:000000014001FFEC↓j ...
.???0:000000014000AF57 mov rbp, rdi
.???0:000000014000AF5A rcr r11w, 7Fh
.???0:000000014000AF5F xor r11w, 0E0Ch
.???0:000000014000AF65 xadd r9, r9
.???0:000000014000AF69 mov r9, 0
.???0:000000014000AF73 sub rbp, r9
.???0:000000014000AF76 dec r11b
.???0:000000014000AF79 or r11b, r11b
.???0:000000014000AF7C mov r9b, 0C2h
.???0:000000014000AF7F
.???0:000000014000AF7F loc_14000AF7F: ; DATA XREF: sub_14000AE7B:loc_14000AF7F↓o
.???0:000000014000AF7F lea r11, loc_14000AF7F
.???0:000000014000AF86 sar r9b, cl
.???0:000000014000AF89 bt r9, rdi
.???0:000000014000AF8D sub rdi, 4
.???0:000000014000AF94 ror r9b, cl
.???0:000000014000AF97 mov r9d, [rdi]
.???0:000000014000AF9A jmp loc_140049977
We can make sense of some instructions in this block, such as the red ones, which simply pushes the code into the (virtual) stack. So I debugged it in hopes of finding our numbers and sum of our numbers, and voila, we have the block that we do the addition operation. This is the block that does the addition:
.???0:0000000140020893 mov eax, [rbx]
.???0:0000000140020895 mov rsi, 42DB0376h
.???0:000000014002089C mov edx, [rbx+4]
.???0:000000014002089F xor si, r15w
.???0:00000001400208A3 test dl, cl
.???0:00000001400208A5 sub rbx, 4
.???0:00000001400208AC dec esi
.???0:00000001400208AE add eax, edx
.???0:00000001400208B0 movsx esi, bx
.???0:00000001400208B3 mov [rbx+8], eax
.???0:00000001400208B6 not si
.???0:00000001400208B9 xchg si, si
.???0:00000001400208BC bswap esi
.???0:00000001400208BE pushfq
.???0:00000001400208BF mov rsi, 6F022F06h
.???0:00000001400208C6 sal sil, cl
.???0:00000001400208C9 cmp r10b, sil
.???0:00000001400208CC pop qword ptr [rbx]
.???0:00000001400208CE xor sil, dl
.???0:00000001400208D1 sar si, cl
.???0:00000001400208D4 sub rdi, 4
.???0:00000001400208DB shl sil, 0E5h
.???0:00000001400208DF mov esi, [rdi]
.???0:00000001400208E1 stc
.???0:00000001400208E2 xor esi, ebp
.???0:00000001400208E4 jmp loc_14007A020
Red instruction loads the first operand, blue instruction loads the second, and purple will do the addition and store it in the slot. So essentially the handle is just this:
.???0:0000000140020893 mov eax, [rbx]
.???0:000000014002089C mov edx, [rbx+4]
.???0:00000001400208AE add eax, edx
.???0:00000001400208B3 mov [rbx+8], eax
You could also imagine that it kind of looks like this in high level implementation:
int arg1 = stack.top();
stack.pop();
int arg2 = stack.top();
stack.push_back(arg1 + arg2);
However, we don’t always have the luxury of attaching a debugger, tracing it, and analyzing each handler manually. Even though this was a good learning experience, we need to come up with a solution that requires less manual work.
Failed attempts
If we wanted to patch or change the behavior of the program, a good solution would be creating our own function. Initially, I had different approaches and different projects such as using Triton to uncover the control flow and apply existing optimizations in Triton and just pasting the code somewhere else in the binary. However, generating the output takes quite a while, and the quality of the output is not very good.
So I decided to use unicorn engine and apply my own optimizations. You can guess that it takes quite a while to implement optimization passes when you have 0 experience with compilers, so that project was also scratched.
I remember finding retdec and playing with it; while playing with it, I found a good friend that was also was interested into the project. However, retdec wasn’t sufficent for our purposes, and I wasn’t experienced enough to make the neccesary improvements to it.
Whiplash
The failed attempts made me even more determined to work on it. So I decided to get more experience and started working on a project that will lift assembly to LLVM IR. The idea was to create something like a tracer and optimize it with LLVM passes like the last project where I used unicorn engine, but instead of writing our own optimizations, LLVM would take care of it.
I also realized I could just use LLVM passes to find the next block, so we didn’t need to use unicorn engine either.
Using LLVM would introduce me to SSA (single static assignment) format, which allows us to read the output easier than asm because every variable is only defined once, making control flow and data dependencies easier to follow. Also, LLVM is maintained by other people, and it’s already a widely used project.
Assembly
add rax, rcx
sub rdx, rbx
sub rax, rdx
LLVM IR (SSA format)
%rax2 = add i64 %rax, %rcx
%rdx2 = sub i64 %rdx, %rbx
%rax3 = sub i64 %rax2, %rdx2
So the ability to produce assembly code, passes, SSA and active community made using LLVM a no-brainer.
Unlike other devirtualization oriented projects, we are going to create a generic approach that would work with everything. So no special VIP, VSP or any vm register will be tracked; we will take the assembly code and directly lift it to LLVM IR and apply optimizations. Our only assumption is the CPU is able to execute this code, so we keep track of the memory, registers and additionally assumptions that come with branches.
Challenges
In order for our project to “trace” our virtualized function correctly, there were several challenges such as:
- Partial reads between memory slots could not be resolved.
- Storing a value with higher bits symbolized and truncating again would confuse value-tracking.
This is because LLVM is a compiler, not a deobfuscator, a normal program wouldn’t use memory slots like this.
We need to implement these solutions:
- Custom aliasing system
- Pattern matching for memory tracking
I also encountered LLVM’s KnownBits class; basically, it tracks one’s and zero’s in a value and allows us to concretize values more easily. The idea is simple: we have KnownOnes and KnownZeros.
??01
represents 4 bits, KnownOnes is 0001
and KnownZeros is 0010
, so we know the lower two bits and the rest is unknown. We can keep tracking the value if its bits get and’d with 0, or’d with 1, etc.
Action
After solving these challenges and implementing 60+ instructions, we could get the trace of instructions and optimize them; it was a huge breakthrough, even though it was taking 200+ seconds to trace a simple function.
This bottleneck was caused because we were creating a clone of our module and applying optimizations each time we come across a jump. So I came up with a solution that would optimize the instructions while lifting and making sure we don’t invalidate any necessary value. After doing this, our runtime dropped to 500~ ms, 400x speed up, an incredible milestone for our project!
The idea was to treat each instruction as a unique instruction. This gives us more room to analyze and modify the “original” function. Here, these graphs explain the overall idea of a VM.
If we flatten the control flow and reorder the blocks sequentially, its much easier to analyze.
Before jumping into VMProtect or Themida examples, let’s see how our lifter performs on this toy example.
If we call callvm
control flow of this program somewhat looks like this:
and we just reorder the blocks to this:
The result after compiling -O3 and custom passes:
define i64 @main(i64 %rax, i64 %rcx, i64 %rdx, i64 %rbx, i64 %rsp, i64 %rbp, i64 %rsi, i64 %rdi, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, ptr nocapture readnone %TEB, ptr nocapture readnone %memory) local_unnamed_addr #0 {
entry:
%0 = trunc i64 %rdx to i32
%realxor-5368715542-84 = xor i64 %rdx, %rcx
%realxor-5368715542- = trunc i64 %realxor-5368715542-84 to i32
%realadd-5368715382- = add i32 %realxor-5368715542-, %0
%1 = zext i32 %realadd-5368715382- to i64
ret i64 %1
}
We can also throw this into a decompiler, but I believe LLVM IR format is easier to read because we don’t need to fix arguments. If we wanted to see this in IDA, it would look like this:
int __fastcall main(
__int64 rax,
__int64 rcx,
__int64 rdx,
__int64 rbx,
__int64 rsp,
__int64 rbp,
__int64 rsi,
__int64 rdi,
__int64 r8,
__int64 r9,
__int64 r10,
__int64 r11,
__int64 r12,
__int64 r13,
__int64 r14,
__int64 r15)
{
return (rdx ^ rcx) + rdx;
}
After that, I just needed to implement more instructions and lift flags appropriately for supporting VMProtect 3.8 and Themida.
For comparison, here is how a function protected by VMProtect 3.8 and Themida differ in terms of control flow and results.
int foo(int a, int b) {
return a + b;
}
VMProtect 3.8 control flow (39894 non-unique instructions):
Result after -O3:
define i64 @main(i64 %rax, i64 %rcx, i64 %rdx, i64 %rbx, i64 %rsp, i64 %rbp, i64 %rsi, i64 %rdi, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, ptr nocapture readnone %TEB, ptr nocapture writeonly %memory) local_unnamed_addr #0 {
entry:
%0 = trunc i64 %rdx to i32
%1 = trunc i64 %rcx to i32
%realadd-5370932149- = add i32 %0, %1
%3 = zext i32 %realadd-5370932149- to i64
ret i64 %3
}
Themida(Fish White) control flow (45138 non-unique instructions):
Result after -O3:
After a little manual clean-up (cleaning up excess stores to .data section):
define range(i64 0, 4294967296) i64 @main(i64 %rax, i64 %rcx, i64 %rdx, i64 %rbx, i64 %rsp, i64 %rbp, i64 %rsi, i64 %rdi, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, ptr readnone captures(none) %TEB, ptr readnone captures(none) %memory) local_unnamed_addr #0 {
%0 = trunc i64 %rdx to i32
%1 = trunc i64 %rcx to i32
%realadd-5368880833- = add i32 %0, %1
%2 = zext i32 %realadd-5368880833- to i64
ret i64 %2
}
VM based obfuscators do not provide significantly stronger protection against static analysis compared to control flow flattening, as both techniques primarily aim to obscure control flow without altering the original program logic. However, bin2bin VM based solutions face additional challenges, such as reliably translating assembly instructions, reconstructing jump tables, and handling exceptions, which can introduce complexity without necessarily enhancing protection against static analysis. Additionally, the added layers of abstraction in VM based obfuscation often result in increased runtime overhead, making the protected program slower and less efficient.
Even though VM based obfuscation is more advanced than flattening, if a VM based obfuscator did not use dispatchers, and simply executed the handlers in sequence, it would be trivial to analyze it even without any specialized tools. That said, when combined with other protection mechanisms, VM based obfuscation remains a unique and valuable tool in software protection.
Whats next?
So, we’ve learned it’s possible to create a generic deobfuscator that even works with commercial VM based obfuscations. Even though there are several challenges, such as MBA (Mixed Boolean Arithmetics) and unrollable loops. Hopefully, we will talk about them in next posts.
In the next post, we will dive deeper into the technical details mentioned here and demonstrate devirtualization using Tigress examples.
Special thanks to
phage
sneed
Thanks for reading!
If you found this post helpful, consider supporting me on Github Sponsors!
Useful resources:
https://www.youtube.com/watch?v=9EgnstyACKA
https://www.msreverseengineering.com/blog/2014/6/23/vmprotect-part-0-basics
https://0xnobody.github.io/devirtualization-intro/
https://secret.club/2021/09/08/vmprotect-llvm-lifting-1.html
https://github.com/JonathanSalwan/VMProtect-devirtualization
https://blog.back.engineering/17/05/2021/
https://blog.thalium.re/posts/llvm-powered-devirtualization/