DeepExtract - PE Context Extraction Framework

Project Overview

DeepExtract is an IDA Pro 9.x Plugin and Headless Analysis Framework designed primarily for headless batch processing of PE binaries to facilitate AI-assisted Vulnerability Research (VR).

Traditional reverse engineering is manual and GUI-centric. This framework provides an automated interface for processing large datasets of PE (Portable Executable) files using IDA Pro's analysis engine. It transforms binary data—including assembly, control flow graphs, and decompiled code—into a structured, AI-ready SQLite database.

Architecture: Built on IDA 9.x's plugmod_t plugin architecture, DeepExtract provides dual-mode operation:

Headless Mode: Command-line batch processing for large-scale analysis
Interactive Mode: GUI integration for targeted analysis

By converting unstructured data into a queryable schema, this tool facilitates:

Programmatic Agentic Systems: Utilize the structured SQLite output as a semantic knowledge base for Research Agents (e.g., via LangGraph) to perform automated code analysis and Agentic Vulnerability Research. This establishes a data layer for Cyber Reasoning Systems (CRS) to process binary logic at scale.
AI-Native Code Review: Export sanitized decompiled code to C++ for analysis in Claude Code, Cursor, or Codex, enabling LLMs to process function logic and data-flow invariants without the volume of unprocessed disassembly.
Large-Scale Threat Hunting: Automate the analysis of binaries to identify cross-ecosystem vulnerability patterns, insecure API usage, and structural characteristics.

Use Cases

The primary purpose of DeepExtract is to extract structured data from PE binaries to support specialized research workflows. The following use cases demonstrate the application of this data in automated and interactive research. Detailed documentation for each case is pending and will be released in the coming days.

AI-Assisted Code Understanding (Cursor / Claude Code)

This use case demonstrates the headless extractor feature by generating the necessary context for AI grounding. The tool exports a structured C++ representation of the binary into the extracted_code/ directory, organized by module folders.

Grounding Architecture: LLMs (e.g., Claude Code, Cursor) utilize the generated .cpp files and file_info.md index to evaluate implementation logic.
Workflow: A researcher uses Cursor to audit specific functions, such as ShellExecuteW. The AI leverages the local context to explain parameters, detect call patterns, and identify logical invariants.
Reporting: Automated generation of technical reports based on the source-level representation of decompiled logic.

Interactive Analysis & Structured Data Export

This use case focuses on the interactive UI plugin for targeted analysis of individual binaries. It is designed to capture the latest state of a researcher's session, including renamed variables, custom comments, and manual type definitions stored in the .idb/.i64.

Data Capture: The plugin exports the current IDA database state into a structured SQLite database.
Schema Visibility: Researchers can query the functions and file_info tables to analyze data types, cross-references, and metadata directly via SQL.
Session Integration: Facilitates the transfer of manual reverse engineering insights into a format compatible with external analysis tools.

Deep Research Agents via Callgraph Traversal

Automated agents utilize the LangGraph Deep Agent abstraction to perform semantic reasoning across the binary's execution graph using structured inbound_xrefs and outbound_xrefs.

Callgraph Reasoning: Agents traverse the simple_outbound_xrefs to evaluate reachability and component dependencies.
Automated Synthesis: The system generates high-level technical summaries of subroutines by analyzing their position and interactions within the global callgraph.

Autonomous Vulnerability Research (Claude Agent SDK)

This configuration implements an autonomous auditor using the Claude Agent SDK.

Skill-Based Extraction: The agent utilizes "Skills" to interface with the SQLite backend, retrieving decompiled code and cross-reference data on-demand.
Primitive Discovery: Automated scanning for vulnerability sinks (e.g., insecure API usage) grounded by the structured data layer.
Fail-Safe Monitoring: Evaluation of complex logical paths where standard automated heuristics may require agentic verification.

Extraction Capabilities

The extractor performs hierarchical analysis, transitioning from global binary metadata to function-level data.

Binary & Metadata Extraction (File Level)

The tool captures over 30+ metadata points for every binary, creating a comprehensive metadata profile for the file:

Identification & Hashes: MD5, SHA256, file size, and extension.
PE Header Intelligence: Extraction of sections, entry_point JSON, rich_header (linker data), and tls_callbacks.
Version & Authenticity: Product/Company names, legal copyright, original filenames, and internal PDB paths (pdb_path).
Security Posture: dll_characteristics and security_features (ASLR, DEP, NX), along with load_config and exception_info.
Runtime Environment: Detection of .NET assemblies (is_net_assembly) and full clr_metadata extraction.

Function-Level Analysis

For every identified function, the tool extracts:

Identity & Signatures: Both function_signature and function_signature_extended, including demangled and mangled names.
Assembly & Decompiled Code: Full assembly_code and high-level decompiled_code (if Hex-Rays is available) are stored for direct semantic analysis.

Security Context & Semantic Analysis

Beyond raw code, the tool performs deep heuristics to find vulnerability signals:

Dangerous API Detection: Scans for 480+ security-critical APIs (e.g., strcpy, CreateProcess) stored in dangerous_api_calls.
String & Data Analysis: Extracts string_literals and global_var_accesses specific to each function.
Stack & Memory Intelligence: Detailed stack_frame layouts and variable sizes to identify potential overflow primitives.
Loop Intelligence: Implements Tarjan’s Algorithm for loop_analysis, identifying natural loops, infinite loops, and cyclomatic complexity.

Relationship & Control Flow Intelligence

Graph Connectivity: Full inbound_xrefs (callers) and outbound_xrefs (callees), including "simple" versions for faster graph traversal.
C++ Reconstruction: Resolves vtable_contexts and trace virtual function calls to reconstruct class hierarchies and polymorphism logic.

Usage Guide

Installation

DeepExtract supports multiple deployment methods:

Plugin Deployment: Installation into the IDA plugins directory for integrated headless and interactive execution.
Standalone Execution: Execution directly from the source directory.

To install as a plugin, you can use hcli:

hcli plugin install DeepExtract

Headless Batch Extraction (PowerShell Script)

For large-scale analysis, clone the repository and use the headless_batch_extractor.ps1 PowerShell script to automate batch processing with concurrent IDA instances.

Features

Three Extraction Modes:
- Directory Scan: Recursively scan directories for PE files
- File List: Process files from a text list (one path per line)
- PID Mode: Extract all modules loaded by a running process
IDA Auto-Detection: Automatically identifies the IDA installation (9.x series)
Concurrent Processing: Spawns multiple IDA processes (default: 4) for parallel analysis
Conditional Filtering: Tracks analyzed files to prevent redundant processing
Detailed Logging: Per-file logs and error reporting

IDA Auto-Detection

The script automatically searches for IDA Pro installations in common paths:

C:\Program Files\IDA Professional 9.x\
C:\Program Files\IDA Pro 9.x\
C:\Program Files (x86)\IDA Professional 9.x\
C:\Program Files (x86)\IDA Pro 9.x\

The latest version is selected automatically. Override with -IdaPath parameter.

Usage Examples

Directory Scan Mode (Recursive)

.\headless_batch_extractor.ps1 -ExtractDir "C:\Windows\System32" -StorageDir "C:\Analysis" -Recursive

Scans all PE files in System32 and subdirectories.

File List Mode

.\headless_batch_extractor.ps1 -FilesToAnalyze "targets.txt" -StorageDir "C:\Analysis"

Where targets.txt contains:

C:\Windows\System32\kernel32.dll
C:\Windows\System32\ntdll.dll
C:\Program Files\MyApp\app.exe

PID Mode (Process Module Extraction)

.\headless_batch_extractor.ps1 -TargetPid 1234 -StorageDir "C:\Analysis"

Extracts all modules loaded by process ID 1234. Creates a dedicated subfolder with naming format:

C:\Analysis\pid_1234_processname_20260115_143022\

Custom IDA Path

.\headless_batch_extractor.ps1 -ExtractDir "C:\Malware" -StorageDir "C:\Analysis" -IdaPath "C:\IDA92\idat64.exe"

Disable Specific Features (Faster Analysis)

# Skip string extraction and C++ generation for faster processing
.\headless_batch_extractor.ps1 -ExtractDir "C:\Binaries" -StorageDir "C:\Analysis" -NoExtractStrings -NoGenerateCpp

Adjust Concurrency

# Run 8 concurrent IDA processes (for high-core systems)
.\headless_batch_extractor.ps1 -ExtractDir "C:\Large\Dataset" -StorageDir "C:\Analysis" -MaxConcurrentProcesses 8

Analysis Flags

Flag	Description
`-NoExtractDangerousApis`	Skip dangerous API detection (300+ APIs)
`-NoExtractStrings`	Skip string literal extraction
`-NoExtractStackFrame`	Skip stack frame analysis
`-NoExtractGlobals`	Skip global variable tracking
`-NoAnalyzeLoops`	Skip loop analysis (Tarjan's algorithm)
`-NoPeInfo`	Skip PE version information extraction
`-NoPeMetadata`	Skip PE metadata extraction
`-NoAdvancedPe`	Skip Rich header and TLS callback analysis
`-NoRuntimeInfo`	Skip .NET and delay-load DLL analysis
`-ForceReanalyze`	Force re-analysis even if already processed
`-NoGenerateCpp`	Skip C++ code generation for AI review

Output Structure

<StorageDir>/
├─ analyzed_modules_list.txt      # List of files analyzed (all modes)
├─ extraction_report.json         # Summary report with success/failure stats
├─ analyzed_files.db              # Master tracking database
├─ extracted_dbs/
│  └─ <filename>_<hash>.db        # Individual analysis databases (one per file)
├─ extracted_code/
│  └─ <filename>/                 # Generated C++ code (if enabled)
│     └─ *.cpp
├─ logs/
│  └─ <filename>_<timestamp>.log  # IDA analysis logs
└─ idb_cache/
   └─ <filename>_<hash>.i64       # IDA database files

The extraction_report.json contains:

Extraction timestamp and mode
Summary statistics (total, successful, failed)
List of successfully extracted files with paths
List of failed extractions with error details

Getting Help

# Display built-in help with colorized output
.\headless_batch_extractor.ps1 -Help

# Use PowerShell's Get-Help for detailed parameter documentation
Get-Help .\headless_batch_extractor.ps1 -Detailed

# Show all available parameters
Get-Help .\headless_batch_extractor.ps1 -Full

# Show usage examples only
Get-Help .\headless_batch_extractor.ps1 -Examples

Enterprise Workflow Example

# Phase 1: Initial scan of system binaries (skip C++ for speed)
.\headless_batch_extractor.ps1 `
    -ExtractDir "C:\Windows\System32" `
    -StorageDir "C:\Analysis\SystemBinaries" `
    -Recursive `
    -NoGenerateCpp `
    -MaxConcurrentProcesses 8

# Phase 2: Targeted analysis of specific malware samples with full extraction
.\headless_batch_extractor.ps1 `
    -FilesToAnalyze "C:\Samples\targets.txt" `
    -StorageDir "C:\Analysis\MalwareSamples" `
    -MaxConcurrentProcesses 4

# Phase 3: Runtime module extraction from suspicious process
.\headless_batch_extractor.ps1 `
    -TargetPid 5678 `
    -StorageDir "C:\Analysis\RuntimeExtraction"

Headless Mode (Individual File Extraction)

For single-file analysis or custom scripting, run the plugin directly in headless mode using IDA's command-line tool (idat.exe or idat64.exe).

Example: Analyze a single binary

"C:\Program Files\IDA Professional 9.2\idat.exe" -A -L"C:\temp\pe_extraction_tests\output.log" -S"main.py --sqlite-db C:\temp\pe_extraction_tests\bitlockercsp.db" "C:\windows\system32\bitlockercsp.dll"

Command-Line Arguments:

-A: Autonomous mode (no GUI)
-L: Log file path
-S: Plugin script to execute (main.py)
--sqlite-db: Absolute path to the output SQLite database (required)

Optional Analysis Flags:

# Disable specific extraction features
--no-extract-dangerous-apis   # Skip dangerous API detection
--no-extract-strings          # Skip string literal extraction
--no-extract-stack-frame      # Skip stack frame analysis
--no-extract-globals          # Skip global variable tracking
--no-analyze-loops            # Skip loop analysis
--no-pe-info                  # Skip PE version info
--no-pe-metadata              # Skip PE metadata
--no-advanced-pe              # Skip Rich header/TLS callbacks
--no-runtime-info             # Skip .NET/delay-load analysis

# Additional options
--force-reanalyze            # Force re-analysis even if already complete
--generate-cpp               # Generate C++ output files for AI review
--cpp-output-dir <path>      # Custom directory for C++ output (defaults to extracted_raw_code/ next to db)
--thunk-depth N              # Maximum thunk resolution depth (default: 10)
--min-call-conf N            # Minimum confidence for call validation (10-100)

Interactive Mode (GUI)

When a binary is open in the IDA Pro GUI, the plugin is accessible via:

Menu: Edit → Plugins → DeepExtract
Hotkey: Ctrl-Shift-E

The interactive mode provides a configuration interface for:

Output Management: Specification of the SQLite database path and C++ output directory.
Feature Selection: Selection of analysis modules (Dangerous APIs, Strings, Loops, Stack Frames).
PE Metadata Configuration: Selection of PE extraction parameters (Metadata, Advanced PE, Runtime Info).
Analysis Parameters: Configuration of thunk resolution depth and confidence thresholds for call validation.
Execution Monitoring: A progress indicator displays the status of the analysis pipeline.

Output Architecture

For a comprehensive technical reference of the data architecture, schemas, and analysis heuristics, see the Data Format Reference.

The results are stored in two primary relational tables within the SQLite database.

Table: `file_info`

High-level metadata for the binary, including:

file_path, file_name, file_extension, file_size_bytes.
md5_hash, sha256_hash.
imports, exports, entry_point.
file_version, product_version, company_name, pdb_path.
rich_header, tls_callbacks, is_net_assembly, clr_metadata.
dll_characteristics, security_features, exception_info.

Table: `functions`

The core table containing granular data for every function in the binary:

function_signature, mangled_name, function_name.
assembly_code, decompiled_code.
inbound_xrefs, outbound_xrefs (Full & Simple JSON).
vtable_contexts, global_var_accesses, dangerous_api_calls.
string_literals, stack_frame, loop_analysis.
analysis_errors, created_at.

Directory: `extracted_raw_code/` (Optional)

If --generate-cpp is used, the tool creates a folder structure containing one individual .cpp file per function. Additionally, it generates a single Markdown file per module (file_info.md) that serves as a high-level index and technical report for the binary.

Technical Requirements

Operating System: Windows 10/11
IDA Pro: Version 9.0 or later (Pro edition required for headless mode)
Decompiler: Hex-Rays Decompiler (optional, but required for C-code generation and advanced analysis)
Python: Python 3 environment configured within IDA (built-in with IDA 9.x)
Dependencies:
- pefile (Bundled in deps/; used for PE header parsing)
- IDA Python SDK (built-in with IDA Pro)

Plugin Architecture

DeepExtract conforms to the IDA 9.x plugin architecture for compatibility and maintainability:

Entry Point: main.py - IDA plugin entry point using PLUGIN_ENTRY()

Core Modules:

deep_extract/pe_context_extractor.py - Main analysis pipeline and orchestration
deep_extract/extractor_core.py - Core extraction functions (xrefs, strings, stack frames)
deep_extract/xref_analysis.py - Cross-reference analysis and call graph building
deep_extract/vtable_analysis.py - C++ vtable reconstruction
deep_extract/loop_analysis.py - Control flow and loop detection (Tarjan's algorithm)
deep_extract/pe_metadata.py - PE header, Rich header, TLS callback extraction
deep_extract/cpp_generator.py - C++ code generation for AI consumption
deep_extract/schema.py - SQLite schema management and migration
deep_extract/config.py - Configuration dataclass and validation

Plugin Lifecycle:

IDA loads main.py and calls PLUGIN_ENTRY()
Plugin factory (DeepExtractPlugin) initializes and creates module instance
Plugin module (DeepExtractModule) handles per-database execution
Detects headless vs. interactive mode based on command-line arguments
In headless mode: runs full pipeline and exits via ida_pro.qexit()
In interactive mode: displays the configuration interface for user selection.

Plugin Distribution

DeepExtract is packaged as an IDA 9.x plugin following Hex-Rays' HCLI plugin format:

Package Contents:

ida-plugin.json - Plugin metadata and dependency specification
main.py - Plugin entry point
deep_extract/ - Core analysis framework
deps/ - Bundled dependencies (pefile)

Distribution Methods:

Manual Installation: Copy to IDA plugins directory
HCLI Package: Distribute as ZIP with ida-plugin.json for automated installation
GitHub Release: Publish tagged releases for version management

Compatibility:

IDA Pro 9.0+
Windows
x86-64 architecture

DeepExtract - Developed by Marcos Oviedo for Agentic Vulnerability Research

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
deep_extract		deep_extract
docs		docs
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
headless_batch_extractor.ps1		headless_batch_extractor.ps1
ida-plugin.json		ida-plugin.json
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepExtract - PE Context Extraction Framework

Project Overview

Use Cases

AI-Assisted Code Understanding (Cursor / Claude Code)

Interactive Analysis & Structured Data Export

Deep Research Agents via Callgraph Traversal

Autonomous Vulnerability Research (Claude Agent SDK)

Extraction Capabilities

Binary & Metadata Extraction (File Level)

Function-Level Analysis

Security Context & Semantic Analysis

Relationship & Control Flow Intelligence

Usage Guide

Installation

Headless Batch Extraction (PowerShell Script)

Features

IDA Auto-Detection

Usage Examples

Analysis Flags

Output Structure

Getting Help

Enterprise Workflow Example

Headless Mode (Individual File Extraction)

Interactive Mode (GUI)

Output Architecture

Table: `file_info`

Table: `functions`

Directory: `extracted_raw_code/` (Optional)

Technical Requirements

Plugin Architecture

Plugin Distribution

About

Uh oh!

Releases 3

Packages

Languages

License

marcosd4h/DeepExtractIDA

Folders and files

Latest commit

History

Repository files navigation

DeepExtract - PE Context Extraction Framework

Project Overview

Use Cases

AI-Assisted Code Understanding (Cursor / Claude Code)

Interactive Analysis & Structured Data Export

Deep Research Agents via Callgraph Traversal

Autonomous Vulnerability Research (Claude Agent SDK)

Extraction Capabilities

Binary & Metadata Extraction (File Level)

Function-Level Analysis

Security Context & Semantic Analysis

Relationship & Control Flow Intelligence

Usage Guide

Installation

Headless Batch Extraction (PowerShell Script)

Features

IDA Auto-Detection

Usage Examples

Analysis Flags

Output Structure

Getting Help

Enterprise Workflow Example

Headless Mode (Individual File Extraction)

Interactive Mode (GUI)

Output Architecture

Table: file_info

Table: functions

Directory: extracted_raw_code/ (Optional)

Technical Requirements

Plugin Architecture

Plugin Distribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Table: `file_info`

Table: `functions`

Directory: `extracted_raw_code/` (Optional)

Packages