Binary Generic Objects: a framework for binary analysis
See README.STATUS for current status of the project.
There are many open-source tools available for binary analysis: binutils, capstone, elfsh, llvm, metasm, radare, volatility... the list keeps growing.
The BGO project aims to provide a platform for integrating these targeted tool into a more general-purpose analysis suite. The BGO framework provides no native loaders, disassemblers, debggers, or tracers; instead, it uses plugins to invoke external tools and convert their output into BGO's data model. This allows unrelated analysis tools to be chained together, and additional plugins can be developed to operate entirely on the BGO representation.
Applications such as user interfaces or server-side processes can be built on top of the BGO framework in order to provide a standard interface to the user of these external anaysis tools.
- scriptable in Ruby
- extensible, plugin-based architecture
- Git-backed projects for version control and distributed collaboration
- an object model for binary code and data objects
- JSON serialization
- access to Java classes (and JARs) via jruby
- "layer"-based representation of memory addresses to accomodate different interpretations of code and data, or changes to code and data over time
A 'binary' object refers to a collection of bytes which may contain code, data, a mixture of code and data, or metadata for another binary object.
The following are examples of binary objects:
- a binary data file such as a PNG
- an ASCII text file
- an executable file, such as an ELF or PE program
- a shared object file, such as a DLL or .so
- an unlinked object file
- an archive file, such as a ZIP or AR file
- a disk image
- a network packet
- a boot sector
- a "snapshot" of CPU registers and memory contents
- a control-flow trace of a process
- a trace of the heap usage for a process
- a trace of system calls invoked by a process
BGO provides an object model for managing these objects. The objects themselves are loaded into the object model by plugins.
See README.STATUS and README.COMMAND for examples of using the framework and the CLI.
The BGO framework includes the bgo command line utility, which provides a toolchain for binary analysis.
Ream (unreleased) is a Qt4 reverse engineering and refactoring tool built on top of BGO.
ReWB (unreleased) is a Qt4 UI for BGO.
ReHash (unreleased) is a GitLab-based web application for collaborative reverse engineering, based on the BGO framework.
The basic working environment in BGO. A Project can contain multiple related Files, Images, Processes, etc. A Git-backed Project is also a Git repository, meaning that Git can be used to collaborate on a single Project.
- name
- description
- bgo_version
- created
- files : Iterator over TargetFile objects
- packets : Iterator over Packet objects
- images : Iterator over Image objects
- processes : Iterator over Process objects
- comments : All ModelItem comments in Project objects
- tags : All ModelItem tags in Project objects
- properties : All ModelItem properties in Project objects
- symbols : All symbol tables in File and Process objects
A sort of ActiveRecord for BGO data types. ModelItems store metadata and track parent/child nodes. Git-backed ModelItems serialize to disk; in-memory ModelItems do not.
- obj_path : path to object in current project (e.g. /process/1000/map/999)
- uuid : project-independent obj_path
- ident : Unique identifier for this object, often used as a key
- properties : Hash of user-supplied data
- comments : Hash of Context (Symbol) to a Hash of Author (String) to Comment (String). This means each author can have a single context-specific comment per object.
- tags : Array of Symbols which represent tags for objects
A comment attached to a ModelItem.
- author
- content
- timestamp
- text
Associates an Image object with an on-disk file. A TargetFile can contain nested child TargetFile objects, as in an archive.
- name
- full_path
- dir
- child_files
A contiguous sequence of bytes from an Image which can be divided into Section objects, but which cannot be loaded into a Process object. This is used to represent network packet data.
A sequence of bytes. This can be the contents of a file, the contents of a location in memory, or the contents of a patch, and so forth. All binary data is stored in an Image object.
- ident : the SHA digets of the Image contents
- contents
A sequence of bytes whose contents lie outside the Project (and the Project repository). This allows binary images to be stored outside of the Project in order to accomodate concerns about storage space or security.
An Image that has no on-disk contents, such as a zero-initialized area of memory.
- size
- fill
An in-memory representation of an Image whose bytes have been modified ("patched"). This associates a base Image object with an ImageChangeset and a Revision number.
- start_addr
- changeset
- revision
A sequence of revisions to an Image. Each revision is a collection of bytes that have changed in the Image, or Address objects that have been defined on the Image.
- current_revision
- base_image
- start_addr
This contains Address objects and byte patches to the base Image object that are defined in this ImageRevision. This is managed by ImageChangeset.
Architecture information for a file or image.
- arch : CPU architecture
- mach : CPU model or revision
- endian
Identification information for code or data
- format : file format (e.g. ELF, JPG)
- summary
- full
- mime
- contents (:code | :data)
A contiguous sequence of bytes in a TargetFile or Packet. This is normally used in object file formats to distiguish between code, data, and metadata. The Section object contains Address objects, such as the results of static disassemby.
- ident
- name
- file_offset
- size
- flags : (rwx)
- addresses
Representation of an OS process, which includes mapping one or more Image objects into a virtual memory region.
- ident : A numeric ID for this process. This corresponds to PID, and is arbitrary (but unique within the project) unless the Process object is being created from a tracing utility. The same command line or executable can be used with different idents in order to represent multiple executions of the same program (e.g. a server process) which vary based on runtime input
- command : Command line used to launch this process. The same executable can be used with multiple commands in order to represent different executions of the same program which vary based on command line arguments
- filename : Name of the TargetFile object for the main executable of this process, if applicable
- arch_info
- maps
- addresses
A mapping of a portion of an Image into a Procss address space.
- vma
- size
- image
- image_offset
- arch_info
- addresses
An address definition. This defines a region of fixed size at a specific VMA in a Process, or at an offset into a File.
- image
- vma
- offset
- size
- bytes
- names
- references
- content_type (:code | :data | :unknown)
- contents : An Instruction or data object
An assembly language instruction. This associates an Opcode object with an OperandList object to represent a disassembled instruction for a specific byte sequence. An Instruction object is considered to be an instance-of an Opcode object. It consists of an Opcode object and an OperandList of zero or more Operand objects. Each Opcode object and each Operand object can be considered a Singleton object, and may be stored as such in the backend database in order to conserve space.
- arch : CPU Architecture
- ascii : ASCII (String) representation of the instruction
- opcode : Opcode object
- prefixes : Array of instruction prefixes (Strings)
- side effects : Array of side effects for combination of Opcode + Operands
- operands : Array of Operand objects
- comment : Comment for all occurrences of instruction
A CPU instruction definition, ignoring operands. Generally, there will only be a single instance of each Opcode in memory or in a stored database; all Instruction objects reference the same instruction.
- mnemonic : ASCII (String) mnemonic of the instruction
- isa : ISA subset (e.g. General, MMX, SMM, etc) of the instruction
- category : Descriptive category (e.g. MATH, STACK) of the instruction
- operations : Array of operations (ADD, SUB, etc) performed by instruction
- flags_read : Array of flags read by instruction
- flags_set : Array of flags modified by instruction
A subclass of Array for storing an arbitrary number of ordered operands
- dest : Destination operand
- src : Source operand
- target : Target of a branch instruction
- access : Array of access strings (rwx) for each operand
- ascii
- value : a Register, IndirectAddress, or Numeric object
As with Opcode, there will only be a single instance of a Register object. This object is shared betweenthe disassembler and the VM.
- mnemonic : ASCII (String) name of register
- size : Size of register in bytes
- type : Type of register, e.g. GEN, FPU, SIMD
- purpose : Array of strings desribing the general purpose of a register (e.g. accumulator, stack pointer, instruction pointer, flags)
- id : A numeric identifier for the register, unique to each physical register (e.g. RAX, EAX, AX, AH, AL have the same id on x86-64)
- mask : A binary mask of bits that are significant for this register. On x86, EAX would be 0xFFFFFFFF, AX would be 0xFFFF, AH would be 0xFF00, and AL would be 0xFF. Shift is implicit in mask.
- scale
- index
- base
- shift
- segment
A collection of related instructions. The instructions need not be sequential or contiguous. A block is used for descriptive purposes and has its own namespace/scope.
- start_addr
- size
- parent
- scope
- revision
A sequence of instructions which has no entry or exit points between the start and end instructions.
- start_addr
- size
- parent
A mapping from a name to a value.
- name
- namespace
- value
an address in a code segment
an address in a data segment
a constant value
a symbol in a file header (metadata) such as a section name
A collection of Symbols defined in a single scope (usually a Block)
- children : child (enclosed) Scope objects
- parent : parent (enclosing) Scope object
- symbols : Array of Symbol objects defined in this and child scopes
A reference from one object to another.
- from : Referrer
- to : Referent
- access (rwx)
- revision : Revision number in changeset
Reference to an Address object
Reference to a Function object
Reference to a TargetFile object
Reference to a Library object
Reference to a Process object
Reference to a URI (String)
A description of a disassembly task passed to the :disassemble method of a Plugin. The Task defines a perform() method which determines the disassembly algorithm (e.g. linear, control-flow, emulated).
- start_addr
- output : Collection of VMA:IAddress key:value pairs
- handler : optional callback to invoke on each Address object
- options : a Hash of plugin-specific options
BGO uses the tg-plugins module for its plugin system.
All of the core functionality is implemented in plugins; the framework itself simply provides project management and a data model.
The following Plugin interfaces are defined:
- analysis : Analyze the target, returning an AnalysisResults object
- decode_insn : Generate an Instruction object from a disassembled instruction (String)
- disassemble : Disassemble the target, creating Address and Instruction objects for each instruction. Takes a DisasmTask to determine the algorithm, and uses a decoder plugin for the ISA to create the Instruction object
- ident
- export : Export data from BGO to a plugin-defined file format
- import : Import data to BGO from a plugin-defined file format
- ident : Identify the type of a file or String of bytes
- load : Create memory maps for a Process object based on an executable file format (e.g. ELF)
- load_target : This loads one or more object files into a Process, then performs a disassembly on all entrypoints and (function) symbols in the Process. Performs the operations of load and disassemble plugins, and is generally provided by plugins for 3rd-party applications which perform the entire load-parse-disassemble cycle themselves, such as Metasm or IDA. Can also be used by an application to automate frequently-used combinations of load and disassemble plugins.
- parse_file : Create sections for a TargetFile object based on an executable file format (e.g. ELF).
- target : Add targets into a Project and perform analysis on them. The actions taken are determined by the plugin, and should be general (e.g. ident, parse, and load an executable file). This is used by applications to automate commonly-used analyses.
- unpack : Unpack an Image object to a new Image object. Used to convert compressed or encrypted Images to plaintext.
unpack parse load target export analysis
https://github.com/mkfs/pogo-license
This is the standard BSD 3-clause license with a 4th clause added to prohibit non-collaborative communication with project developers.