GitlabHarvester is a fast, scalable tool for searching keywords across an entire GitLab instance using the API — without cloning repositories. Built for security audits, secret discovery, compliance checks, and large-scale code intelligence across thousands of projects.
Global term search across a full GitLab instance — especially valuable for GitLab CE environments.
Search a keyword:
gitlab-harvester -u https://gitlab.example.com -t $TOKEN --search passwordSearch from file:
gitlab-harvester -u https://gitlab.example.com -t $TOKEN --terms-file words.txtBuild project index only:
gitlab-harvester -u https://gitlab.example.com -t $TOKEN -m dump-indexDeduplicate results:
gitlab-harvester -m dedup --input-file session.jsonl --output-file clean.jsonlConvert JSONL → JSON:
gitlab-harvester -m convert --input-file session.jsonl --output-file result.jsonGitLab Community Edition does not provide full instance-wide code search like EE. GitlabHarvester fills this gap by:
- building a lightweight instance project index
- scanning repositories via API
- streaming results in JSONL
- supporting resumable sessions
- keeping memory usage constant
Designed to operate efficiently on environments with 10k–100k repositories.
| Problem | Solution |
|---|---|
| No global search | Instance-wide scan |
| Cloning thousands repos | API-only scanning |
| Large instances | Streaming architecture |
| Repeated audits | Cached project index |
- Instance-wide keyword search
- No repository cloning
- JSONL project index
- Branch scanning strategies
- Smart fork analysis
- Resume interrupted scans
- Streaming output
- Low memory footprint
- Automation-friendly
- Built-in post-processing tools
pipx install gitlab-harvesterRun:
gitlab-harvester --helppip install gitlab-harvestergit clone https://github.com/Cur1iosity/GitlabHarvester.git
cd GitlabHarvester
pip install .Editable mode:
pip install -e .pipx install git+https://github.com/Cur1iosity/GitlabHarvester.git- Python 3.10+
- GitLab token with read_api permission
Two independent controls:
--index-branches— stored branches--scan-branches— scanned branches
Example:
gitlab-harvester -u ... -t ... --scan-branches 10Store all + scan all:
gitlab-harvester -u ... -t ... --index-branches all --scan-branches allShortcut:
--branches N--forks skip|include|branch-diff|all-branches
Recommended → branch-diff
| Mode | Behavior |
|---|---|
| skip | ignore forks |
| include | treat as normal repos |
| branch-diff | scan default + unique branches |
| all-branches | full exhaustive scan |
Create session:
gitlab-harvester -u ... -t ... --terms-file words.txt --session auditResume:
gitlab-harvester -u ... -t ... --session-file audit.jsonl --resumeTwo file types:
| File | Purpose |
|---|---|
| Project index | cached project metadata |
| Session file | hits + checkpoints |
Format → JSONL (streaming-friendly)
GitlabHarvester includes built-in post-processing utilities.
gitlab-harvester -m dedup \
--input-file session.jsonl \
--output-file clean.jsonlOptions:
--sqlite-path file.sqlite--hash-algo blake2b|sha1|sha256--no-normalize-hits
gitlab-harvester -m convert \
--input-file session.jsonl \
--output-file result.jsonPretty print:
jq . result.json > formatted.jsonGitLab API
↓
Indexer
↓
Branch planner
↓
Matcher
↓
JSONL stream
Constant memory usage regardless of instance size.
- secret discovery
- credential leaks detection
- internal audits
- redteam/pentest reconnaissance
- DevSecOps validation
- large-scale code search
Use only on GitLab instances where you are authorized to perform scanning.
Pull requests and ideas welcome.
MIT