Core Explorer is a comprehensive development audit and analysis platform designed to systematically review and assess the health of large-scale open source projects, with a primary focus on Bitcoin Core. The platform addresses a critical question in open source development: "Who watches the watcher?" by providing tools to identify when code changes may have received insufficient peer review. At its core, Core Explorer processes git repository data—extracting commit history, tracking authors and committers, analyzing relationships between contributors and their code changes—and stores this information in a Neo4j graph database that models the complex relationships between developers, commits, and code paths. The system includes a Flask-based backend with a GraphQL API for flexible data querying, a Next.js web interface for visualizing repository metrics and contributor activity, and automated processing pipelines that can analyze entire repositories or drill down into specific files and directories. Key health metrics tracked include self-merge ratios (when authors merge their own code, indicating potential gaps in peer review), contributor acknowledgment patterns, and per-line code quality indices. By providing transparent, data-driven insights into the peer review process, Core Explorer helps maintainers, contributors, and auditors understand the review coverage and quality of code contributions, ultimately strengthening the integrity and security of critical open source projects.
Core Explorer uses a Git-first schema that models repository history as a graph database. The schema is designed around the principle that commits are the primary stable keys (SHA-based), enabling efficient queries about who changed what, when, and how.
Identity - Represents raw contributor identities from Git
- Properties:
source(e.g., "git"),name,email(composite unique key) - Relationships:
AUTHORED→ Commit,COMMITTED→ Commit,TAGGED→ TagObject
Commit - Represents Git commits
- Properties:
commit_hash(unique SHA),authoredAt,committedAt,message,summary,isMerge(boolean) - Relationships:
HAS_PARENT→ Commit (withidxproperty for parent order),HAS_CHANGE→ FileChange,MERGED_INCLUDES→ Commit (for merge analysis)
FileChange - Tracks file-level changes per commit
- Properties:
status(A/M/D/R for Added/Modified/Deleted/Renamed),add(lines added),del(lines deleted),rename_from(nullable),isSensitive(boolean),commit_hash,path(composite unique key with commit_hash) - Relationships:
OF_PATH→ Path
Path - Represents file paths in the repository
- Properties:
path(unique string) - Relationships: Connected via FileChange nodes
Ref - Represents Git branches and tags
- Properties:
kind("branch" or "tag"),name,remote(nullable, e.g., "origin") - Relationships:
POINTS_TO→ Commit or TagObject
TagObject - Represents annotated Git tags
- Properties:
name,taggerAt(datetime),message - Relationships:
TAG_OF→ Commit,HAS_SIGNATURE→ PGPKey
PGPKey - Represents PGP/GPG keys used for signing
- Properties:
fingerprint(unique),createdAt(nullable),revokedAt(nullable) - Relationships: Connected via
HAS_SIGNATUREfrom Commits and TagObjects
IngestRun - Tracks each data import session
- Properties:
id(unique UUID),pulledAt(datetime),status, progress counters - Relationships:
SAW_REF→ RefState
RefState - Snapshots of ref positions at import time
- Properties:
name,kind,remote,tipSha(commit SHA at snapshot time) - Relationships:
POINTS_TO→ Commit
AUTHORED/COMMITTED: Links Identity nodes to Commit nodes with timestamp propertiesHAS_PARENT: Links commits to their parent commits (enables ancestry traversal)MERGED_INCLUDES: For merge commits, links to commits introduced by the merge (reachable from 2nd parent but not 1st)HAS_CHANGE: Links commits to FileChange nodes representing file modificationsHAS_SIGNATURE: Links Commits and TagObjects to PGPKey nodes with validation statusSAW_REF: Links IngestRun to RefState snapshots (enables tracking ref movement over time)
The schema enforces uniqueness constraints on:
Commit.commit_hashIdentity(source, email, name)Path.pathRef(kind, name, remote)PGPKey.fingerprintIngestRun.idFileChange(commit_hash, path)
These constraints ensure data integrity and enable efficient lookups and merges during incremental imports.
The schema supports powerful analysis queries such as:
-
Sensitive path analysis: Track changes to critical code paths (e.g., consensus, policy)
MATCH (c:Commit)-[:HAS_CHANGE]->(fc:FileChange)-[:OF_PATH]->(p:Path) WHERE p.path STARTS WITH "src/consensus" RETURN c, fc, p ORDER BY c.authoredAt DESC LIMIT 50;
-
Merge ancestry analysis: Identify which commits were introduced by each merge
MATCH (m:Commit {isMerge: true})-[:MERGED_INCLUDES]->(a:Commit) RETURN m, collect(a) AS introduced_commits ORDER BY m.committedAt DESC LIMIT 20;
-
Temporal ref tracking: Show commits added to main, between tagged releases. include nodes that help to illustrate "who contributed to this release?"
MATCH (tagRef:Ref {kind: "tag"})-[:POINTS_TO]->(tag:TagObject)-[:TAG_OF]->(tagTip:Commit) WHERE tag.name IN ["v30.2", "v28.1"] WITH tagRef, tag, tagTip ORDER BY tag.taggerAt DESC WITH collect({ref: tagRef, tag: tag, tip: tagTip}) AS tags WITH tags[0] AS newer, tags[1] AS older WITH newer, older, newer.tip AS newerTip, older.tip AS olderTip MATCH (newerTip)-[:HAS_PARENT*0..]->(c:Commit) WHERE NOT EXISTS { MATCH (olderTip)-[:HAS_PARENT*0..]->(c) } OPTIONAL MATCH (author:Identity)-[:AUTHORED]->(c) RETURN newer.ref AS newer_tag_ref, newer.tag AS newer_tag, newer.tip AS newer_tip, older.ref AS older_tag_ref, older.tag AS older_tag, older.tip AS older_tip, c, author LIMIT 200;
- Temporal ref tracking: Detect force-pushes and history rewrites by comparing RefState snapshots
// NOTE THAT THIS WILL ONLY SHOW FORCE PUSH SINCE YOUR LAST IngestRun MATCH (run:IngestRun)-[:SAW_REF]->(r:RefState)-[:POINTS_TO]->(c:Commit) WHERE r.kind = "branch" AND r.name = "main" WITH run, r, c ORDER BY run.pulledAt DESC WITH collect({run: run, state: r, commit: c}) AS states UNWIND range(0, size(states) - 2) AS idx WITH states[idx] AS newer, states[idx + 1] AS older WITH newer, older, newer.commit AS newerCommit, older.commit AS olderCommit WHERE NOT EXISTS { MATCH (newerCommit)-[:HAS_PARENT*0..]->(olderCommit) } RETURN newer.run AS newer_run, newer.state AS newer_ref, newerCommit AS newer_tip, older.run AS older_run, older.state AS older_ref, olderCommit AS older_tip LIMIT 10;
-
PGP signature auditing: Track which commits and tags are signed, and by which keys
MATCH (c:Commit)-[:HAS_SIGNATURE]->(k:PGPKey) WITH k, collect(c)[0..20] AS signed_commits, count(c) AS signed_count RETURN k, signed_commits ORDER BY signed_count DESC;
The Core Explorer Kit is organized into several key directories, each serving a specific purpose in the data processing and visualization pipeline. Below is a detailed breakdown of the project structure, including Docker configuration and data dependencies.
core-explorer-kit/
│
├── backend/ # Flask backend service (Docker service: "backend")
│ ├── app/ # Python application code
│ │ ├── app.py # Flask app with REST & GraphQL endpoints
│ │ ├── schema.py # GraphQL schema definitions
│ │ ├── git_processor.py # Git repository processing logic
│ │ ├── neo4j_driver.py # Neo4j database connection & queries
│ │ ├── commit_details.py # Commit metadata extraction
│ │ └── config.py # Configuration (Neo4j connection, repo paths)
│ ├── Dockerfile # Backend container build configuration
│ ├── Pipfile # Python dependencies (pipenv)
│ └── wsgi.py # WSGI entry point for production
│
├── CE_demo/ # Next.js frontend application
│ ├── app/ # Next.js app directory
│ │ ├── api/ # API route handlers
│ │ ├── page.jsx # Main dashboard page
│ │ └── pr/[id]/ # Pull request detail pages
│ ├── components/ # React components
│ ├── public/ # Static assets
│ ├── package.json # Node.js dependencies
│ └── README.md # Frontend documentation
│
├── repo_explorer/ # Ruby scripts for data processing
│ ├── github_scrape_commits_or_pulls.rb # GitHub API scraping
│ ├── process_commit_data.rb # Commit data processing
│ └── README.md # Processing pipeline documentation
│
├── frontend/ # Static HTML frontend (served by nginx)
│ ├── index.html # Landing page
│ ├── project.html # Project view page
│ └── profile.html # Profile view page
│
├── data/ # Data persistence directory (⚠️ REQUIRED)
│ ├── neo4j/ # Neo4j database storage (Docker volume)
│ │ ├── databases/ # Neo4j database files
│ │ └── transactions/ # Transaction logs
│ │ └── [Persisted in Docker volume: ./data/neo4j:/data]
│ │
│ └── user_supplied_repo/ # Git repository to analyze (⚠️ REQUIRED)
│ └── [Cloned repository, e.g., bitcoin/bitcoin]
│ └── [Mounted to backend as: ./data/user_supplied_repo:/app/bitcoin]
│
├── docker-compose.yml # Docker orchestration configuration
├── nginx.conf # Nginx reverse proxy configuration
└── README.md # This file
The project uses Docker Compose to orchestrate three main services:
-
neo4j (Database)
- Image:
neo4j:5.20.0 - Ports:
7474(HTTP),7687(Bolt protocol) - Env File:
.env(usesAPP_NEO4J_USER/APP_NEO4J_PASSWORD) - Volume:
./data/neo4j:/data- Persists database files - Health Check: Waits for Neo4j to be ready before starting dependent services
- Dependencies: None (starts first)
- Image:
-
backend (Flask API)
- Build:
./backend(usesbackend/Dockerfile) - Ports:
5000:5000 - Env File:
.env - Volumes:
./backend/app:/app- App code for live reloading./backend/wsgi.py:/app/wsgi.py- WSGI entry point./backend/wsgi.ini:/app/wsgi.ini- WSGI configuration${USER_SUPPLIED_REPO_PATH}:/app/bitcoin- Git repository access (environment-configurable)
- Dependencies: Waits for
neo4jhealth check - Network: Connects to
appnetto communicate with Neo4j
- Build:
-
nginx (Reverse Proxy)
- Image:
nginx:alpine - Ports:
8080:8080 - Volumes:
./nginx.conf:/etc/nginx/nginx.conf:ro- Nginx configuration./frontend:/app/frontend- Static HTML files
- Dependencies: Waits for
neo4jandbackendservices
- Image:
- Routing:
/api/*→ Proxies tobackend:5000/→ Serves static files from/app/frontend
Required Data Directories:
-
data/user_supplied_repo/(⚠️ REQUIRED, or setUSER_SUPPLIED_REPO_PATH)- Purpose: Contains the git repository to be analyzed
- Setup: Clone your target repository here (e.g.,
git clone https://github.com/bitcoin/bitcoin.git data/user_supplied_repo) - Docker Mount: Mounted to backend container at
/app/bitcoin - Used By:
backend/app/git_processor.pyreads fromconfig.CONTAINER_SIDE_REPOSITORY_PATH
-
data/neo4j/(Auto-created, but required for persistence)- Purpose: Stores Neo4j graph database files
- Setup: Created automatically on first run
- Docker Mount: Mounted to Neo4j container at
/data - Persistence: Database data persists across container restarts
- Note: Delete this folder to reset the database
.env: Environment configuration for sensitive credentials and deployment-specific settings (not committed to git)APP_NEO4J_USER: Neo4j database username (default:neo4j)APP_NEO4J_PASSWORD: Neo4j database password (⚠️ change for production!)CONTAINER_SIDE_REPOSITORY_PATH: Path to repository inside container (default:/app/bitcoin)USER_SUPPLIED_REPO_PATH: Path to repository on host (default:./data/user_supplied_repo)
.env.example: Template for.envfile with placeholder values (committed to git)backend/app/config.py: Reads configuration from environment variables with fallback defaultsnginx.conf: Routes API requests to backend and serves static frontend filesdocker-compose.yml: Orchestrates all services and defines network topology
New contributors can stay productive by developing the Python backend locally while still relying on Docker Compose for the stateful services (Neo4j and nginx). The checklist below assumes macOS or Linux, but the same steps work on Windows with WSL2.
- Python 3.11 (earlier versions work but match the Docker image for fewer surprises)
pipenvfor dependency + virtualenv management- Docker Desktop (or Docker Engine) with Compose v2 enabled
- Git, curl, and a modern browser for inspecting GraphQL + Neo4j UIs
git clone https://github.com/coreexplorer-org/core-explorer-kit.git
cd core-explorer-kit
# Create environment configuration file
cp .env.example .env
# Edit .env and update APP_NEO4J_PASSWORD and other settings as needed
# Create + populate the data mounts expected by docker-compose.yml
mkdir -p data
cd data
git clone https://github.com/bitcoin/bitcoin.git user_supplied_repo
cd ..
# Install backend dependencies inside a virtual environment
cd backend
pipenv install --dev- Launch the infrastructure: from the repo root run
docker compose up -d neo4j nginx. This keeps Neo4j + nginx identical to production while freeing the backend for local iteration. Usedocker compose logs -f neo4jif you need to confirm readiness. - Enter the backend virtualenv:
cd backend && pipenv shell. - Start the Flask server with live reload:
FLASK_APP=app.app FLASK_RUN_PORT=5000 flask run --debug. The app will connect to the Neo4j container via the hostname defined inbackend/app/config.py. - Iterate: edit files under
backend/app/and Flask reloads automatically. Hithttp://localhost:5000/api/graphql(direct) orhttp://localhost:8080/api/graphql(via nginx) to interact with GraphQL.
To stop everything, exit the Pipenv shell and run docker compose down from the repository root. If you need to reset Neo4j data, delete the host folder at ./data/neo4j (bind mount) before starting the stack again.
Two bootstrap scripts are provided for automated deployment:
For local development or single-user setups:
./scripts/bootstrap-stack.shThis script:
- Clones the repository to
~/core-explorer-kitif not present - Resets the cloned repo to the newest
origin/mainstate (discarding local changes) - Checks for
.envfile and prompts to create it if missing - Rebuilds the backend Docker image
- Starts Neo4j, backend, and nginx services
- Configures git
safe.directoryfor the mounted repository (readsCONTAINER_SIDE_REPOSITORY_PATHfrom.env)
For production deployments on dedicated servers:
./scripts/bootstrap-sov-stack.shThis script:
- Must be run as the
deployuser (enforces security) - Clones/updates repository to
/opt/core-explorer-kit - Resets the cloned repo to the newest
origin/mainstate (discarding local changes) - Checks for
.envfile and prompts to create it if missing - Links persistent data storage from
/srv/core-explorer-kit/data - Pulls pre-built Docker images (no local builds)
- Starts the stack with production-ready configuration
- Configures git
safe.directoryfor the mounted repository (readsCONTAINER_SIDE_REPOSITORY_PATHfrom.env)
If you are intentionally discarding Neo4j data during a production upgrade, follow this exact sequence to avoid bind-mount confusion:
- Stop the stack from
/opt/core-explorer-kit:docker compose down
- Remove the host data directory (this is the bind mount target):
If you are running from the repo directory and
rm -rf /srv/core-explorer-kit/data/neo4j
./datais a symlink to/srv/core-explorer-kit/data, the equivalent is:rm -rf ./data/neo4j
- Re-run the production bootstrap:
./scripts/bootstrap-sov-stack.sh
Both scripts will interactively prompt you to create a .env file if one doesn't exist:
WARNING: .env file not found
Would you like to create a .env file now? (y/n)
y
Creating .env file...
APP_NEO4J_USER [neo4j]:
APP_NEO4J_PASSWORD [your_secure_password_here]: my_secure_password
CONTAINER_SIDE_REPOSITORY_PATH [/app/bitcoin]:
USER_SUPPLIED_REPO_PATH [./data/user_supplied_repo]:
.env file created successfully!
You can accept defaults by pressing Enter, or provide custom values. The .env file is automatically ignored by git to protect sensitive credentials.
# Lint + format (if you add tooling later, wire it up here)
pipenv run pytest backend/tests -q # run fast unit tests
docker compose logs -f backend # tail backend logs when containerized
docker compose exec backend bash # hop into the container when debugging
pipenv run flask --app app.app routes # inspect available Flask routesThese commands mirror what CI/CD will do: install dependencies with Pipenv, talk to Dockerized services, and run pytest. Staying close to this flow locally keeps surprises to a minimum.
The repository path configuration is a critical aspect of Core Explorer's setup, as it determines where the system looks for the git repository to analyze. Understanding this configuration is essential for both initial setup and troubleshooting.
The repository path is configured through a combination of Docker volume mounts and Python configuration:
- Host Directory:
./data/user_supplied_repo/(on your local machine) - Container Path:
/app/bitcoin(inside the backend Docker container) - Configuration Variable:
CONTAINER_SIDE_REPOSITORY_PATH = "/app/bitcoin"inbackend/app/config.py
The path mapping is established in docker-compose.yml:
backend:
volumes:
- ./data/user_supplied_repo:/app/bitcoinThis Docker volume mount creates a bridge between:
- Host path:
./data/user_supplied_repo/(relative to thecore-explorer-kitdirectory) - Container path:
/app/bitcoin(absolute path inside the container)
When the backend container runs, it sees the cloned repository at /app/bitcoin, regardless of what the repository is actually called on the host system.
The repository path is referenced in several places:
-
backend/app/config.py(Line 5):CONTAINER_SIDE_REPOSITORY_PATH = "/app/bitcoin" # Where a cloned repo exists
This is the primary configuration that all Python code uses.
-
backend/app/git_processor.py(Line 28):repo = Repo(config.CONTAINER_SIDE_REPOSITORY_PATH)
The git processor uses this path to initialize the GitPython
Repoobject for processing commits. -
backend/app/schema.py(Line 144):repo_path = os.path.join(config.CONTAINER_SIDE_REPOSITORY_PATH, folder) gitfame.main(['-t', repo_path, '--format=json', '--show-email'])
The GraphQL
fameresolver uses the configuration variable to construct the repository path dynamically, ensuring consistency across the codebase.
Standard Setup (Bitcoin Core):
-
Create the data directory structure:
mkdir -p data cd data/ -
Clone the repository into the expected location:
git clone https://github.com/bitcoin/bitcoin.git user_supplied_repo
-
The repository structure should be:
data/user_supplied_repo/ ├── .git/ # Required: Git metadata directory └── ... (repository files and directories)Note: Core Explorer only requires a valid git repository with a
.gitdirectory. The specific files and structure of the repository are not important - any git repository will work. -
When Docker starts, this maps to
/app/bitcoin/inside the container, which matchesconfig.CONTAINER_SIDE_REPOSITORY_PATH.
If you want to analyze a different repository, you have two options:
Option 1: Keep the same container path (Recommended)
-
Clone your repository to
data/user_supplied_repo/:rm -rf data/user_supplied_repo # Remove old repo if needed cd data/ git clone <your-repo-url> user_supplied_repo
-
No code changes needed - the Docker mount and
config.pyremain the same.
Option 2: Change the container path
If you need to use a different container path:
-
Update
docker-compose.yml:volumes: - ./data/user_supplied_repo:/app/your_repo_name
-
Update
backend/app/config.py:CONTAINER_SIDE_REPOSITORY_PATH = "/app/your_repo_name"
-
Path Consistency: The path in
config.pymust match the Docker volume mount destination path. If the mount is:/app/bitcoin, thenCONTAINER_SIDE_REPOSITORY_PATHmust be/app/bitcoin. -
Working Directory: The backend container's working directory is
/app(set inbackend/Dockerfile). This means:- Absolute paths like
/app/bitcoinwork from anywhere - Relative paths like
./bitcoin/work when the working directory is/app
- Absolute paths like
-
Repository Requirements: The repository must be a valid git repository with:
- A
.gitdirectory - At least one commit
- Readable by the container user (typically root or the user specified in Dockerfile)
- A
-
Path in GraphQL: The
fameresolver inschema.pyusesconfig.CONTAINER_SIDE_REPOSITORY_PATHto construct paths dynamically, so it automatically adapts to any repository path configuration.
Error: "fatal: not a git repository"
- Check that
data/user_supplied_repo/contains a valid git repository - Verify the Docker volume mount is working:
docker exec backend ls -la /app/bitcoin - Ensure the
.gitdirectory exists in the mounted location
Error: "No such file or directory"
- Verify the path in
config.pymatches the Docker mount destination - Check that the repository was cloned correctly before starting Docker
- Ensure the volume mount path in
docker-compose.ymlis correct
Error: "SHA is empty, possible dubious ownership in the repository"
- Git 2.35+ blocks repositories whose owner differs from the current container user; the backend image now preconfigures
/app/bitcoinas a safe directory. - If you pulled the repo before this fix, rebuild the backend image so the setting is baked in:
docker compose build backend docker compose up -d backend
- For a running container that you do not want to rebuild yet, run the following once to trust the mounted repo:
docker compose exec backend git config --global --add safe.directory /app/bitcoin
GraphQL fame query fails
- Verify that
CONTAINER_SIDE_REPOSITORY_PATHinconfig.pymatches your Docker mount destination - Check that the folder path parameter is relative to the repository root (e.g.,
"src/policy"not"/app/bitcoin/src/policy") - Ensure the repository is properly mounted and accessible at the configured path
When processing git data:
- Initial Import:
process_git_data()ingit_processor.pyreads fromconfig.CONTAINER_SIDE_REPOSITORY_PATHto get all commits - File-Level Analysis:
find_relevant_commits()usesrepo.iter_commits(paths=folder_or_file_path)where paths are relative to the repository root - GraphQL Queries: The
fameresolver constructs paths usingos.path.join(config.CONTAINER_SIDE_REPOSITORY_PATH, folder)and passes them togitfame, where the folder parameter is relative to the repository root
All paths used in the codebase should be relative to the repository root (e.g., "src/policy", "src/consensus"), not absolute container paths.
To get started, clone the necessary repositories in the parent directory.
Navigate one directory up from your current location This ensures you're outside of core kit repo folder:
mkdir data
cd data/
git clone https://github.com/bitcoin/bitcoin.git user_supplied_repo
cd ..
# we are now back at the root
cd ..
# Clone the required repositories in parent folder
git clone https://github.com/coreexplorer-org/repo_explorer.git
git clone https://github.com/coreexplorer-org/repex.git
git clone https://github.com/coreexplorer-org/CE_demo.gitBefore running the stack, create a .env file with your configuration:
# Copy the example file
cp .env.example .env
# Edit with your preferred editor
nano .env # or vim, code, etc.Important: Update at least the APP_NEO4J_PASSWORD for security:
APP_NEO4J_USER=neo4j
APP_NEO4J_PASSWORD=your_secure_password_here # ⚠️ Change this!
CONTAINER_SIDE_REPOSITORY_PATH=/app/bitcoin
USER_SUPPLIED_REPO_PATH=./data/user_supplied_repoThe .env file is automatically ignored by git to protect your credentials.
From inside core-explorer-kit (here), run:
docker compose upAfter running docker compose up, Core Explorer starts its services in a specific order. Here's what happens and what you need to do next:
-
Neo4j Database starts first
- Initializes the graph database
- Creates the
data/neo4j/directory if it doesn't exist - Waits for health check to pass (checks HTTP interface on port 7474)
- Access: Neo4j browser UI available at
http://localhost:7474(username:neo4j, password:password)
-
Backend Service starts after Neo4j is healthy
- Flask application starts on port 5000
- Connects to Neo4j database
- Note: The backend does NOT automatically process git data on startup
- Access: API available at
http://localhost:5000/api/or via nginx athttp://localhost:8080/api/
-
Nginx Reverse Proxy starts last
- Routes API requests to the backend
- Serves static frontend files
- Access: Main entry point at
http://localhost:8080/
Important: Core Explorer does not automatically import git data when it starts. You must manually trigger the import process.
Step 1: Verify Services Are Running
Check that all services are up:
docker compose psYou should see all three services (neo4j, backend, nginx) with status "Up".
Step 2: Trigger Git Data Processing
Navigate to the processing endpoint in your browser or use curl:
# Via nginx (recommended)
curl http://localhost:8080/api/initiate_data_ingest/
# Or directly to backend
curl http://localhost:5000/api/initiate_data_ingest/Or open in your browser:
http://localhost:8080/api/initiate_data_ingest/
Note: The processing runs asynchronously in a background thread and returns immediately with a Run ID. You can monitor progress using the status endpoint.
Step 3: What Happens During Ingestion
When you trigger the processing endpoint:
- Background Execution: The ingestion starts in a separate thread, returning an immediate Run ID.
- Schema Setup: The system creates all required Neo4j constraints and indexes, including uniqueness constraints for commits, identities, paths, refs, PGP keys, ingest runs, and file changes.
- Ingest Run Creation: The system creates an
IngestRunnode with aSTARTEDstatus to track this import session. - Commit Processing (Backbone):
- Reads commits from the git repository (incrementally processes only new commits if the database already contains data).
- For each commit, creates/updates:
- Identity nodes for authors and committers (with
source,name, andemailproperties). - Commit nodes with
commit_hash,message,summary,authoredAt,committedAt, andisMergeproperties. - Relationships:
AUTHOREDandCOMMITTEDedges (with timestamp properties), andHAS_PARENTedges (withidxproperty for parent order).
- Identity nodes for authors and committers (with
- Processes commits in batches for efficiency.
- Marks status as
COMMITS_COMPLETEupon successful backbone sync.
- Stage Gate Verification:
- The system verifies the integrity of the commit backbone before proceeding.
- If verification fails (e.g., interrupted run), advanced analysis is skipped to protect data integrity.
- Advanced Enrichment (status transitions to
ENRICHING):- Refs and Tags: Creates
RefandTagObjectnodes, andRefStatesnapshots linked to theIngestRun. - File Changes: Tracks additions/deletions/renames for specified paths (defaults to sensitive paths like
src/policy,src/consensus), creatingFileChangeandPathnodes withHAS_CHANGEandOF_PATHrelationships. - PGP Signatures: Extracts GPG signatures from commits and tags, creating
PGPKeynodes andHAS_SIGNATURErelationships with validation status. - Merge Analysis: Computes
MERGED_INCLUDESrelationships to identify which commits were introduced by each merge commit.
- Refs and Tags: Creates
- Completion: The
IngestRunstatus is updated toCOMPLETED.
Step 4: Monitor Progress
You can monitor the import progress in two ways:
-
Status Endpoint:
- Visit
http://localhost:8080/api/ingest_status/<run_id>/ - Shows real-time status (e.g.,
STARTED,COMMITS_COMPLETE,COMPLETED) and counters for commits, signatures, and merges.
- Visit
-
Backend Logs:
docker compose logs -f backend- Look for progress messages:
"Updated IngestRun <id> status to COMMITS_COMPLETE"
Step 5: Verify Import Success
Once the status endpoint shows COMPLETED, verify the data:
-
Check Neo4j directly:
- Run queries to check node counts:
MATCH (i:Identity) RETURN count(i) as identities MATCH (c:Commit) RETURN count(c) as commits
-
Query via GraphQL (http://localhost:8080/api/graphql):
query { identities { name email source } }
-
Check Neo4j directly:
MATCH (i:Identity)-[:AUTHORED]->(c:Commit) RETURN i.name, count(c) as commits ORDER BY commits DESC LIMIT 10
The latest version introduces several powerful analysis features:
-
PGP Signature Extraction
- Automatically extracts PGP fingerprints from signed commits and tags.
- Enables auditing of signed vs. unsigned code in sensitive directories.
-
Granular File Change Tracking
- Tracks additions, deletions, and renames at the file level.
- Automatically flags changes to
SENSITIVE_PATHSdefined infile_change_processor.py.
-
Merge Ancestry Analysis
- Computes exactly which commits are brought in by a merge (reachable from 2nd parent but not 1st).
- Enables "Self-Merge Detection" to identify when developers merge their own work without sufficient peer review.
-
Incremental Ingestion
- Only processes new commits added since the last run.
- Efficiently snapshots branch movements over time.
- Small repository (< 1,000 commits): 1-5 minutes
- Medium repository (1,000-10,000 commits): 5-30 minutes
- Large repository (10,000+ commits, like Bitcoin Core): 30 minutes - 2+ hours
Note: Processing time depends on:
- Number of commits in the repository
- Number of unique authors/committers
- System resources (CPU, memory, disk I/O)
Issue: "fatal: not a git repository"
- Ensure
data/user_supplied_repo/contains a valid git repository - Check that the repository was cloned before starting Docker
- Verify the Docker volume mount:
docker exec backend ls -la /app/bitcoin
Issue: Processing endpoint times out (504 error)
- This is normal for large repositories - the import is still running
- Check backend logs:
docker compose logs -f backend - The process continues even if the HTTP request times out
- Wait for the "Processed X commits" message in logs
Issue: Neo4j connection errors
- Verify Neo4j is healthy:
docker compose ps - Check Neo4j logs:
docker compose logs neo4j - Ensure Neo4j health check passed before backend started
Issue: No data appears in GraphQL queries
- Verify the import completed successfully (check backend logs)
- Check Neo4j browser to see if nodes exist
- Ensure you're querying the correct GraphQL endpoint
Once the initial import is complete:
- Explore the GraphQL API: Visit
http://localhost:8080/api/graphqlfor the GraphiQL interface - Query repository data: Use GraphQL queries to explore identities, commits, and relationships
- Access the frontend: Visit
http://localhost:8080/to see the web interface - Re-run processing: Subsequent calls to
/api/initiate_data_ingest/will process additional file paths (if configured)
The system is now ready to analyze your repository's development history and peer review patterns!
End-to-end tests now cover the git → Neo4j pipeline using disposable resources. They rely on Docker to launch a temporary Neo4j instance, so ensure Docker Desktop is running before executing them.
- Install backend dependencies (production + dev):
cd backend pipenv install --dev - Run the pytest suite (spins up a short-lived Neo4j container automatically):
pipenv run pytest
The fixture fabricates a small Git repository with multiple authors and merge commits, keeping the suite fast while protecting your real data directories.