DeepExtract is an IDA Pro 9.x Plugin and Headless Analysis Framework designed primarily for headless batch processing of PE binaries to facilitate AI-assisted Vulnerability Research (VR).
Traditional reverse engineering is manual and GUI-centric. This framework provides an automated interface for processing large datasets of PE (Portable Executable) files using IDA Pro's analysis engine. It transforms binary data—including assembly, control flow graphs, and decompiled code—into a structured, AI-ready SQLite database.
Architecture: Built on IDA 9.x's plugmod_t plugin architecture, DeepExtract provides dual-mode operation:
- Headless Mode: Command-line batch processing for large-scale analysis
- Interactive Mode: GUI integration for targeted analysis
By converting unstructured data into a queryable schema, this tool facilitates:
- Programmatic Agentic Systems: Utilize the structured SQLite output as a semantic knowledge base for Research Agents (e.g., via LangGraph) to perform automated code analysis and Agentic Vulnerability Research. This establishes a data layer for Cyber Reasoning Systems (CRS) to process binary logic at scale.
- AI-Native Code Review: Export sanitized decompiled code to C++ for analysis in Claude Code, Cursor, or Codex, enabling LLMs to process function logic and data-flow invariants without the volume of unprocessed disassembly.
- Large-Scale Threat Hunting: Automate the analysis of binaries to identify cross-ecosystem vulnerability patterns, insecure API usage, and structural characteristics.
The primary purpose of DeepExtract is to extract structured data from PE binaries to support specialized research workflows. The following use cases demonstrate the application of this data in automated and interactive research. Detailed documentation for each case is pending and will be released in the coming days.
This use case demonstrates the headless extractor feature by generating the necessary context for AI grounding. The tool exports a structured C++ representation of the binary into the extracted_code/ directory, organized by module folders.
- Grounding Architecture: LLMs (e.g., Claude Code, Cursor) utilize the generated
.cppfiles andfile_info.mdindex to evaluate implementation logic. - Workflow: A researcher uses Cursor to audit specific functions, such as
ShellExecuteW. The AI leverages the local context to explain parameters, detect call patterns, and identify logical invariants. - Reporting: Automated generation of technical reports based on the source-level representation of decompiled logic.
This use case focuses on the interactive UI plugin for targeted analysis of individual binaries. It is designed to capture the latest state of a researcher's session, including renamed variables, custom comments, and manual type definitions stored in the .idb/.i64.
- Data Capture: The plugin exports the current IDA database state into a structured SQLite database.
- Schema Visibility: Researchers can query the
functionsandfile_infotables to analyze data types, cross-references, and metadata directly via SQL. - Session Integration: Facilitates the transfer of manual reverse engineering insights into a format compatible with external analysis tools.
Automated agents utilize the LangGraph Deep Agent abstraction to perform semantic reasoning across the binary's execution graph using structured inbound_xrefs and outbound_xrefs.
- Callgraph Reasoning: Agents traverse the
simple_outbound_xrefsto evaluate reachability and component dependencies. - Automated Synthesis: The system generates high-level technical summaries of subroutines by analyzing their position and interactions within the global callgraph.
This configuration implements an autonomous auditor using the Claude Agent SDK.
- Skill-Based Extraction: The agent utilizes "Skills" to interface with the SQLite backend, retrieving decompiled code and cross-reference data on-demand.
- Primitive Discovery: Automated scanning for vulnerability sinks (e.g., insecure API usage) grounded by the structured data layer.
- Fail-Safe Monitoring: Evaluation of complex logical paths where standard automated heuristics may require agentic verification.
The extractor performs hierarchical analysis, transitioning from global binary metadata to function-level data.
The tool captures over 30+ metadata points for every binary, creating a comprehensive metadata profile for the file:
- Identification & Hashes: MD5, SHA256, file size, and extension.
- PE Header Intelligence: Extraction of
sections,entry_pointJSON,rich_header(linker data), andtls_callbacks. - Version & Authenticity: Product/Company names, legal copyright, original filenames, and internal PDB paths (
pdb_path). - Security Posture:
dll_characteristicsandsecurity_features(ASLR, DEP, NX), along withload_configandexception_info. - Runtime Environment: Detection of
.NETassemblies (is_net_assembly) and fullclr_metadataextraction.
For every identified function, the tool extracts:
- Identity & Signatures: Both
function_signatureandfunction_signature_extended, including demangled and mangled names. - Assembly & Decompiled Code: Full
assembly_codeand high-leveldecompiled_code(if Hex-Rays is available) are stored for direct semantic analysis.
Beyond raw code, the tool performs deep heuristics to find vulnerability signals:
- Dangerous API Detection: Scans for 480+ security-critical APIs (e.g.,
strcpy,CreateProcess) stored indangerous_api_calls. - String & Data Analysis: Extracts
string_literalsandglobal_var_accessesspecific to each function. - Stack & Memory Intelligence: Detailed
stack_framelayouts and variable sizes to identify potential overflow primitives. - Loop Intelligence: Implements Tarjan’s Algorithm for
loop_analysis, identifying natural loops, infinite loops, and cyclomatic complexity.
- Graph Connectivity: Full
inbound_xrefs(callers) andoutbound_xrefs(callees), including "simple" versions for faster graph traversal. - C++ Reconstruction: Resolves
vtable_contextsand trace virtual function calls to reconstruct class hierarchies and polymorphism logic.
DeepExtract supports multiple deployment methods:
- Plugin Deployment: Installation into the IDA plugins directory for integrated headless and interactive execution.
- Standalone Execution: Execution directly from the source directory.
To install as a plugin, you can use hcli:
hcli plugin install DeepExtractFor large-scale analysis, clone the repository and use the headless_batch_extractor.ps1 PowerShell script to automate batch processing with concurrent IDA instances.
- Three Extraction Modes:
- Directory Scan: Recursively scan directories for PE files
- File List: Process files from a text list (one path per line)
- PID Mode: Extract all modules loaded by a running process
- IDA Auto-Detection: Automatically identifies the IDA installation (9.x series)
- Concurrent Processing: Spawns multiple IDA processes (default: 4) for parallel analysis
- Conditional Filtering: Tracks analyzed files to prevent redundant processing
- Detailed Logging: Per-file logs and error reporting
The script automatically searches for IDA Pro installations in common paths:
C:\Program Files\IDA Professional 9.x\
C:\Program Files\IDA Pro 9.x\
C:\Program Files (x86)\IDA Professional 9.x\
C:\Program Files (x86)\IDA Pro 9.x\
The latest version is selected automatically. Override with -IdaPath parameter.
Directory Scan Mode (Recursive)
.\headless_batch_extractor.ps1 -ExtractDir "C:\Windows\System32" -StorageDir "C:\Analysis" -RecursiveScans all PE files in System32 and subdirectories.
File List Mode
.\headless_batch_extractor.ps1 -FilesToAnalyze "targets.txt" -StorageDir "C:\Analysis"Where targets.txt contains:
C:\Windows\System32\kernel32.dll
C:\Windows\System32\ntdll.dll
C:\Program Files\MyApp\app.exe
PID Mode (Process Module Extraction)
.\headless_batch_extractor.ps1 -TargetPid 1234 -StorageDir "C:\Analysis"Extracts all modules loaded by process ID 1234. Creates a dedicated subfolder with naming format:
C:\Analysis\pid_1234_processname_20260115_143022\
Custom IDA Path
.\headless_batch_extractor.ps1 -ExtractDir "C:\Malware" -StorageDir "C:\Analysis" -IdaPath "C:\IDA92\idat64.exe"Disable Specific Features (Faster Analysis)
# Skip string extraction and C++ generation for faster processing
.\headless_batch_extractor.ps1 -ExtractDir "C:\Binaries" -StorageDir "C:\Analysis" -NoExtractStrings -NoGenerateCppAdjust Concurrency
# Run 8 concurrent IDA processes (for high-core systems)
.\headless_batch_extractor.ps1 -ExtractDir "C:\Large\Dataset" -StorageDir "C:\Analysis" -MaxConcurrentProcesses 8| Flag | Description |
|---|---|
-NoExtractDangerousApis |
Skip dangerous API detection (300+ APIs) |
-NoExtractStrings |
Skip string literal extraction |
-NoExtractStackFrame |
Skip stack frame analysis |
-NoExtractGlobals |
Skip global variable tracking |
-NoAnalyzeLoops |
Skip loop analysis (Tarjan's algorithm) |
-NoPeInfo |
Skip PE version information extraction |
-NoPeMetadata |
Skip PE metadata extraction |
-NoAdvancedPe |
Skip Rich header and TLS callback analysis |
-NoRuntimeInfo |
Skip .NET and delay-load DLL analysis |
-ForceReanalyze |
Force re-analysis even if already processed |
-NoGenerateCpp |
Skip C++ code generation for AI review |
<StorageDir>/
├─ analyzed_modules_list.txt # List of files analyzed (all modes)
├─ extraction_report.json # Summary report with success/failure stats
├─ analyzed_files.db # Master tracking database
├─ extracted_dbs/
│ └─ <filename>_<hash>.db # Individual analysis databases (one per file)
├─ extracted_code/
│ └─ <filename>/ # Generated C++ code (if enabled)
│ └─ *.cpp
├─ logs/
│ └─ <filename>_<timestamp>.log # IDA analysis logs
└─ idb_cache/
└─ <filename>_<hash>.i64 # IDA database files
The extraction_report.json contains:
- Extraction timestamp and mode
- Summary statistics (total, successful, failed)
- List of successfully extracted files with paths
- List of failed extractions with error details
# Display built-in help with colorized output
.\headless_batch_extractor.ps1 -Help
# Use PowerShell's Get-Help for detailed parameter documentation
Get-Help .\headless_batch_extractor.ps1 -Detailed
# Show all available parameters
Get-Help .\headless_batch_extractor.ps1 -Full
# Show usage examples only
Get-Help .\headless_batch_extractor.ps1 -Examples# Phase 1: Initial scan of system binaries (skip C++ for speed)
.\headless_batch_extractor.ps1 `
-ExtractDir "C:\Windows\System32" `
-StorageDir "C:\Analysis\SystemBinaries" `
-Recursive `
-NoGenerateCpp `
-MaxConcurrentProcesses 8
# Phase 2: Targeted analysis of specific malware samples with full extraction
.\headless_batch_extractor.ps1 `
-FilesToAnalyze "C:\Samples\targets.txt" `
-StorageDir "C:\Analysis\MalwareSamples" `
-MaxConcurrentProcesses 4
# Phase 3: Runtime module extraction from suspicious process
.\headless_batch_extractor.ps1 `
-TargetPid 5678 `
-StorageDir "C:\Analysis\RuntimeExtraction"For single-file analysis or custom scripting, run the plugin directly in headless mode using IDA's command-line tool (idat.exe or idat64.exe).
Example: Analyze a single binary
"C:\Program Files\IDA Professional 9.2\idat.exe" -A -L"C:\temp\pe_extraction_tests\output.log" -S"main.py --sqlite-db C:\temp\pe_extraction_tests\bitlockercsp.db" "C:\windows\system32\bitlockercsp.dll"Command-Line Arguments:
-A: Autonomous mode (no GUI)-L: Log file path-S: Plugin script to execute (main.py)--sqlite-db: Absolute path to the output SQLite database (required)
Optional Analysis Flags:
# Disable specific extraction features
--no-extract-dangerous-apis # Skip dangerous API detection
--no-extract-strings # Skip string literal extraction
--no-extract-stack-frame # Skip stack frame analysis
--no-extract-globals # Skip global variable tracking
--no-analyze-loops # Skip loop analysis
--no-pe-info # Skip PE version info
--no-pe-metadata # Skip PE metadata
--no-advanced-pe # Skip Rich header/TLS callbacks
--no-runtime-info # Skip .NET/delay-load analysis
# Additional options
--force-reanalyze # Force re-analysis even if already complete
--generate-cpp # Generate C++ output files for AI review
--cpp-output-dir <path> # Custom directory for C++ output (defaults to extracted_raw_code/ next to db)
--thunk-depth N # Maximum thunk resolution depth (default: 10)
--min-call-conf N # Minimum confidence for call validation (10-100)When a binary is open in the IDA Pro GUI, the plugin is accessible via:
- Menu: Edit → Plugins → DeepExtract
- Hotkey:
Ctrl-Shift-E
The interactive mode provides a configuration interface for:
- Output Management: Specification of the SQLite database path and C++ output directory.
- Feature Selection: Selection of analysis modules (Dangerous APIs, Strings, Loops, Stack Frames).
- PE Metadata Configuration: Selection of PE extraction parameters (Metadata, Advanced PE, Runtime Info).
- Analysis Parameters: Configuration of thunk resolution depth and confidence thresholds for call validation.
- Execution Monitoring: A progress indicator displays the status of the analysis pipeline.
For a comprehensive technical reference of the data architecture, schemas, and analysis heuristics, see the Data Format Reference.
The results are stored in two primary relational tables within the SQLite database.
High-level metadata for the binary, including:
file_path,file_name,file_extension,file_size_bytes.md5_hash,sha256_hash.imports,exports,entry_point.file_version,product_version,company_name,pdb_path.rich_header,tls_callbacks,is_net_assembly,clr_metadata.dll_characteristics,security_features,exception_info.
The core table containing granular data for every function in the binary:
function_signature,mangled_name,function_name.assembly_code,decompiled_code.inbound_xrefs,outbound_xrefs(Full & Simple JSON).vtable_contexts,global_var_accesses,dangerous_api_calls.string_literals,stack_frame,loop_analysis.analysis_errors,created_at.
If --generate-cpp is used, the tool creates a folder structure containing one individual .cpp file per function. Additionally, it generates a single Markdown file per module (file_info.md) that serves as a high-level index and technical report for the binary.
- Operating System: Windows 10/11
- IDA Pro: Version 9.0 or later (Pro edition required for headless mode)
- Decompiler: Hex-Rays Decompiler (optional, but required for C-code generation and advanced analysis)
- Python: Python 3 environment configured within IDA (built-in with IDA 9.x)
- Dependencies:
pefile(Bundled indeps/; used for PE header parsing)- IDA Python SDK (built-in with IDA Pro)
DeepExtract conforms to the IDA 9.x plugin architecture for compatibility and maintainability:
Entry Point: main.py - IDA plugin entry point using PLUGIN_ENTRY()
Core Modules:
deep_extract/pe_context_extractor.py- Main analysis pipeline and orchestrationdeep_extract/extractor_core.py- Core extraction functions (xrefs, strings, stack frames)deep_extract/xref_analysis.py- Cross-reference analysis and call graph buildingdeep_extract/vtable_analysis.py- C++ vtable reconstructiondeep_extract/loop_analysis.py- Control flow and loop detection (Tarjan's algorithm)deep_extract/pe_metadata.py- PE header, Rich header, TLS callback extractiondeep_extract/cpp_generator.py- C++ code generation for AI consumptiondeep_extract/schema.py- SQLite schema management and migrationdeep_extract/config.py- Configuration dataclass and validation
Plugin Lifecycle:
- IDA loads
main.pyand callsPLUGIN_ENTRY() - Plugin factory (
DeepExtractPlugin) initializes and creates module instance - Plugin module (
DeepExtractModule) handles per-database execution - Detects headless vs. interactive mode based on command-line arguments
- In headless mode: runs full pipeline and exits via
ida_pro.qexit() - In interactive mode: displays the configuration interface for user selection.
DeepExtract is packaged as an IDA 9.x plugin following Hex-Rays' HCLI plugin format:
Package Contents:
ida-plugin.json- Plugin metadata and dependency specificationmain.py- Plugin entry pointdeep_extract/- Core analysis frameworkdeps/- Bundled dependencies (pefile)
Distribution Methods:
- Manual Installation: Copy to IDA plugins directory
- HCLI Package: Distribute as ZIP with
ida-plugin.jsonfor automated installation - GitHub Release: Publish tagged releases for version management
Compatibility:
- IDA Pro 9.0+
- Windows
- x86-64 architecture
DeepExtract - Developed by Marcos Oviedo for Agentic Vulnerability Research