Data flow analysis roadmap #1

em-eight · 2023-06-01T17:54:44Z

em-eight
Jun 1, 2023
Maintainer

Milestone: 32-bit PPC data flow analysis

Goal: Extract the control flow and data flow graph of assembly functions

Given those two analyses, a semantic (instead of bytewise) equivalence checker can be implemented.

Function level: Two functions can be found to be provably equivalent if their data flow graphs match, that is the function performs the exact same operations on its memory (registers, ram etc) in the exact same order.
This equivalence check can eliminate differences in register allocation, input asymmetry in commutative operations (e.g. addition) and dead code (liveliness analysis required).
Program level. A source program can be found to be similar to a target program all of the target's symbols exist in the source program and are equivalent (with each reference (relocation) being made pointing to the same symbol).
This similarity check can eliminate having to define strict slice definitions and all other forms of global ABI engineering (making sure that symbols have exactly the same alignment and order). Functions and data can be implemented anywhere in code, as long as they have the correct symbol name

These two checks, if implemented correctly, can produce a decompilation that has all the desired properties (equivalence, shiftability, portability) while allowing us to focus on actual reverse engineering and not tedious and unproductive compiler chasing.

Also, a data flow analysis puts us very close to (another) custom decompiler.

Control Flow Analysis

Data flow analysis is first done at a basic block level, so a control flow analysis must be conducted first. Each basic block is a contiguous range of assembly with no jumps in control flow. Each basic block represents a node in the control flow graph and is connected to other nodes which can be either predecessors or successors.
Control flow analysis begins at the top of the function and consumes non-branching instructions. Each basic block ends with a branching instruction, which determines the type of the succeeding basic block

Internal: basic block is contained within the current function*.
External: basic block is a symbol outside the current function
Runtime: jump target is computed a runtime.
In the case of internal blocks, the analysis is performed recursively whereas for the other cases a dummy block is created. The semantic equivalence checker for each basic block checks that the jump targets are the same (for a jump in the first two cases) and that the CR expression is equivalent.
Unreachable blocks are ignored and do not affect equivalence.

*knowledge of all symbol locations and sizes is required and can already be accomplished with tools like ppcdis and decomp-toolkit

Data flow analysis

Certain data flow must be performed at the instruction level. This will be a function that takes a disassembled instruction and outputs its input and output operands.
In a data flow graph, each node is a source (immediates, addresses, function arguments, global registers) or a result to an intermediate operation.
The data flow analysis starts with an empty machine state (OSContext) and gradually assigns each register to the control flow it represents in each point in the program

Basic block data flow
Here is some WIP, possible incorrect/inconvenient pseudocode for a basic block data flow analysis.

for instruction in basic_block:
  insn_inputs, insn_outputs = insn_flow(instruction)
  for input in insn_inputs
    if isImmediate(input) || isAddr(input)
	  inNodes.add(createSourceNode(input))
	elif isUndefined(input) // input is a register
	  if isCalleeSaved(input) or isGlobalReg(input)
	    newNode = createSourceNode(input)
	    inNodes.add(newNode)
		machineState(input).setNode(newNode)
	  else // caller saved
	    //ignore prolog. Not sure if this is the bast the to ignore prolog/epilog or a separate analysis that detects them is more convenient
    else // input is an already defined register
	  inNodes.add(machineState(input).getNode)
	  
  for output in insn_outputs:
    machineState(output).setNode(createNode(output))

Function data flow

That was the data flow analysis for a single block, implementing the basic block data flow equations (https://en.wikipedia.org/wiki/Data-flow_analysis#Basic_principles). data flow analysis continues in the succeding blocks, breadth first.
The starting machine state of each block should be the union of the machine state of the previous nodes (The input definitions of a basic block is a set, unlike the inputs of an instruction).
The control flow on all the blocks in the function is repeated until convergence, optionally using the worklist approach for efficiency (see https://en.wikipedia.org/wiki/Data-flow_analysis#An_iterative_algorithm).

Other

Optional: liveliness analysis
The data flow analysis described so far creates a data flow graph with the processing of all inputs (forward flow). We can ignore outputs which are detected to be unused (prune the dead variable nodes of the DFG) by performing a backward flow analysis (see https://en.wikipedia.org/wiki/Data-flow_analysis#Backward_analysis).

A liveliness analysis can be performed to figure out which nodes of the DFG are actually used and the rest can be eliminated from the scope of the equivalence checker.

What has been done?

ProgramLoader: an abstraction over binaries that can handle both RVL and ELF binaries
RVL backend to program loader
Control flow analysis
Disassembler
Global symbol analysis (ppcdis/decomp-toolkit)

TODO

function semantic equivalence checker:

Instruction data flow analysis
Basic block data flow analysis
Library that wraps the functionality and provides an interface

program equivalence checker:

semantic equivalence checker
library that wraps the functionality and provides an interface

extra:

ELF backend to program loader
Tests cases for data flow and

References

[1] Cifuentes, Cristina Garcia. “Reverse compilation techniques.” (1994).
[2] Wikipedia, "Data flow analysis"

em-eight · 2024-02-11T14:51:21Z

em-eight
Feb 11, 2024
Maintainer Author

Update: Most of the stuff mentioned have been implemented with more or less the desired features

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data flow analysis roadmap #1

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Data flow analysis roadmap #1

Uh oh!

em-eight Jun 1, 2023 Maintainer

Control Flow Analysis

Data flow analysis

Function data flow

Other

What has been done?

TODO

References

Replies: 1 comment

Uh oh!

em-eight Feb 11, 2024 Maintainer Author

em-eight
Jun 1, 2023
Maintainer

em-eight
Feb 11, 2024
Maintainer Author