Data flow analysis roadmap #1
em-eight
announced in
Announcements
Replies: 1 comment
-
|
Update: Most of the stuff mentioned have been implemented with more or less the desired features |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Milestone: 32-bit PPC data flow analysis
Goal: Extract the control flow and data flow graph of assembly functions
Given those two analyses, a semantic (instead of bytewise) equivalence checker can be implemented.
This equivalence check can eliminate differences in register allocation, input asymmetry in commutative operations (e.g. addition) and dead code (liveliness analysis required).
This similarity check can eliminate having to define strict slice definitions and all other forms of global ABI engineering (making sure that symbols have exactly the same alignment and order). Functions and data can be implemented anywhere in code, as long as they have the correct symbol name
These two checks, if implemented correctly, can produce a decompilation that has all the desired properties (equivalence, shiftability, portability) while allowing us to focus on actual reverse engineering and not tedious and unproductive compiler chasing.
Also, a data flow analysis puts us very close to (another) custom decompiler.
Control Flow Analysis
Data flow analysis is first done at a basic block level, so a control flow analysis must be conducted first. Each basic block is a contiguous range of assembly with no jumps in control flow. Each basic block represents a node in the control flow graph and is connected to other nodes which can be either predecessors or successors.
Control flow analysis begins at the top of the function and consumes non-branching instructions. Each basic block ends with a branching instruction, which determines the type of the succeeding basic block
In the case of internal blocks, the analysis is performed recursively whereas for the other cases a dummy block is created. The semantic equivalence checker for each basic block checks that the jump targets are the same (for a jump in the first two cases) and that the CR expression is equivalent.
Unreachable blocks are ignored and do not affect equivalence.
*knowledge of all symbol locations and sizes is required and can already be accomplished with tools like ppcdis and decomp-toolkit
Data flow analysis
Certain data flow must be performed at the instruction level. This will be a function that takes a disassembled instruction and outputs its input and output operands.
In a data flow graph, each node is a source (immediates, addresses, function arguments, global registers) or a result to an intermediate operation.
The data flow analysis starts with an empty machine state (OSContext) and gradually assigns each register to the control flow it represents in each point in the program
Basic block data flow
Here is some WIP, possible incorrect/inconvenient pseudocode for a basic block data flow analysis.
Function data flow
That was the data flow analysis for a single block, implementing the basic block data flow equations (https://en.wikipedia.org/wiki/Data-flow_analysis#Basic_principles). data flow analysis continues in the succeding blocks, breadth first.
The starting machine state of each block should be the union of the machine state of the previous nodes (The input definitions of a basic block is a set, unlike the inputs of an instruction).
The control flow on all the blocks in the function is repeated until convergence, optionally using the worklist approach for efficiency (see https://en.wikipedia.org/wiki/Data-flow_analysis#An_iterative_algorithm).
Other
Optional: liveliness analysis
The data flow analysis described so far creates a data flow graph with the processing of all inputs (forward flow). We can ignore outputs which are detected to be unused (prune the dead variable nodes of the DFG) by performing a backward flow analysis (see https://en.wikipedia.org/wiki/Data-flow_analysis#Backward_analysis).
A liveliness analysis can be performed to figure out which nodes of the DFG are actually used and the rest can be eliminated from the scope of the equivalence checker.
What has been done?
TODO
function semantic equivalence checker:
program equivalence checker:
extra:
References
[1] Cifuentes, Cristina Garcia. “Reverse compilation techniques.” (1994).
[2] Wikipedia, "Data flow analysis"
Beta Was this translation helpful? Give feedback.
All reactions