Office Cloc and Count — document metrics, structure extraction, content inspection, and code exploration for real repositories.
Experimental: All features in OCC are currently experimental. This project cannot be considered stable software yet. APIs, output formats, and command interfaces may change between minor versions.
OCC started as a way to make office documents visible in the same workflows that already work well for code metrics tools like scc and cloc. It has since grown into a multi-purpose CLI that can:
- scan office documents for word/page/sheet/slide metrics
- extract document heading structure for navigation and RAG-style use cases
- inspect documents (
occ doc inspect), spreadsheets (occ sheet inspect), and presentations (occ slide inspect) for metadata, risk flags, and content previews - extract structured table content from documents (
occ table inspect) - analyze workspaces for combined code, document, and structure metrics (
occ workspace analyze) and cross-document references (occ workspace documents) - summarize code metrics through
scc - explore JavaScript, TypeScript, and Python repositories with symbol search, call analysis, dependency inspection, and inheritance queries (
occ code)
- Office document metrics — words, pages, paragraphs, slides, sheets, rows, cells
- Seven formats supported — DOCX, XLSX, PPTX, PDF, ODT, ODS, ODP
- Document structure extraction —
--structureparses heading hierarchy into a navigable tree with dotted section codes (1, 1.1, 1.2, ...) - Document inspection via
occ doc inspect— metadata, risk flags, content stats, heading structure, and content preview for DOCX and ODT - Spreadsheet inspection via
occ sheet inspect— workbook properties, hidden sheets, names, formulas, links, comments, schema preview, and token estimates for XLSX - Presentation inspection via
occ slide inspect— metadata, risk flags, per-slide inventory, and content preview for PPTX and ODP - Table extraction via
occ table inspect— structured table content from DOCX, XLSX, PPTX, ODT, and ODP with auto-detected headers, sample row limits, and merged cell support - Code metrics via scc — auto-detects code files and integrates scc output
- Code exploration via
occ code— JS/TS and Python-first symbol lookup, content search, callers/callees, dependency categories, inheritance, module coupling, and ambiguity-aware chains - Workspace analysis via
occ workspace— combined code, document, and structure analysis with versioned JSON contracts, per-document summaries, and cross-reference detection - Multiple output modes — grouped by type, per-file breakdown, or JSON
- CI-friendly — ASCII-only, no-color mode for pipelines
- Flexible filtering — include/exclude extensions, exclude directories, .gitignore-aware
- Progress bar — with ETA for large scans
- Zero config — auto-downloads scc binary on install, works out of the box
Global install:
npm i -g @cesarandreslopez/occ
occNo-install usage:
npx @cesarandreslopez/occ docs/ reports/From source:
git clone https://github.com/cesarandreslopez/occ.git && cd occ
npm install
npm run build
npm test
npm start# Scan current directory
occ
# Scan specific directories
occ docs/ reports/
# Per-file breakdown
occ --by-file docs/
# JSON output
occ --format json docs/
# Extract document structure (heading hierarchy)
occ --structure docs/
# Structure as JSON
occ --structure --format json docs/
# Inspect a document for metadata, risk flags, and content preview
occ doc inspect report.docx
occ doc inspect report.docx --format json
# Inspect an XLSX workbook before reading its contents deeply
occ sheet inspect finance.xlsx
occ sheet inspect finance.xlsx --format json --sample-rows 3 --max-columns 12
# Inspect a presentation for slide inventory and content preview
occ slide inspect deck.pptx
occ slide inspect deck.pptx --format json --slide 3
# Extract structured table data from documents
occ table inspect report.docx --format json
occ table inspect finance.xlsx --table 1 --sample-rows 10
# Explore JS/TS and Python code
occ code find name UserService --path .
occ code analyze callers createUser --path .
occ code analyze deps src/deps --path .
occ code analyze chain ambiguousCaller duplicate --path .
# Module coupling metrics
occ code analyze coupling src/code --path .
# Dump full codebase index as JSON
occ code index --path . --format json
# Workspace-level analysis (code + documents + structures)
occ workspace analyze --format json
# Document summaries with cross-references
occ workspace documents --format json
# Only specific formats
occ --include-ext pdf,docx docs/
# Skip code analysis
occ --no-code docs/
# CI-friendly (ASCII, no color)
occ --ci docs/-- Documents ---------------------------------------------------------------
Format Files Words Pages Details Size
----------------------------------------------------------------------------
Word 12 34,210 137 1,203 paras 1.2 MB
PDF 8 22,540 64 4.5 MB
Excel 3 12 sheets 890 KB
----------------------------------------------------------------------------
Total 23 56,750 201 1,203 paras 6.5 MB
-- Code (via scc) ----------------------------------------------------------
Language Files Lines Blanks Comments Code
----------------------------------------------------------------------------
JavaScript 15 2340 180 320 1840
Python 8 1200 90 150 960
----------------------------------------------------------------------------
Total 23 3540 270 470 2800
Scanned 23 documents (56,750 words, 201 pages) in 120ms
-- Structure: report.docx --------------------------------------------------
1 Executive Summary
1.1 Background ......................................... p.1
1.2 Key Findings ....................................... p.1-2
2 Methodology
2.1 Data Collection .................................... p.3
2.2 Analysis Framework ................................. p.4
2.2.1 Quantitative Methods ........................... p.4
2.2.2 Qualitative Methods ............................ p.5
3 Results ................................................ p.6-8
4 Conclusions ............................................ p.9
4 sections, 10 nodes, max depth 3
| Format | Extension | Metrics | Structure |
|---|---|---|---|
| Word | .docx |
words, pages*, paragraphs | Yes |
.pdf |
words, pages | Yes (with page mapping) | |
| Excel | .xlsx |
sheets, rows, cells | — |
| PowerPoint | .pptx |
words, slides | Yes (slide headers) |
| ODT | .odt |
words, pages*, paragraphs | Yes (best-effort) |
| ODS | .ods |
sheets, rows, cells | — |
| ODP | .odp |
words, slides | Yes (slide headers) |
* Pages for Word/ODT are estimated at 250 words/page.
| Flag | Description | Default |
|---|---|---|
--by-file / -f |
Row per file | grouped by type |
--format <type> |
tabular or json |
tabular |
--structure |
Extract and display document heading hierarchy | off |
--include-ext <exts> |
Comma-separated extensions | all supported |
--exclude-ext <exts> |
Comma-separated to skip | none |
--exclude-dir <dirs> |
Directories to skip | node_modules,.git |
--no-gitignore |
Disable .gitignore respect | enabled |
--sort <col> |
Sort by: files, name, words, size | files |
--output <file> / -o |
Write to file | stdout |
--ci |
ASCII-only, no color | off |
--large-file-limit <mb> |
Skip files over this size | 50 |
--no-code |
Skip scc code analysis | off |
--show-confidence |
Show confidence levels for each metric | off |
occ code adds on-demand code exploration without changing the existing document-scan workflow. It builds an in-memory repository graph for each command and does not require a database, daemon, or background indexer.
The first-class support path is JavaScript, TypeScript, and Python. Other languages may be discovered and partially parsed, but the current resolver, fixtures, and output contracts are intentionally optimized around JS/TS and Python behavior.
# Exact symbol lookup
occ code find name Greeter --path test/fixtures/code-explore
# Substring search
occ code find pattern service --path .
# Full-text content search
occ code find content normalize_name --path .
# Outgoing and incoming call analysis
occ code analyze calls bootstrap --path test/fixtures/code-explore
occ code analyze callers createUser --path test/fixtures/code-explore
# Dependency and inheritance inspection
occ code analyze deps src/service --path test/fixtures/code-explore
occ code analyze tree UserService --path test/fixtures/code-explore
# Module coupling analysis
occ code analyze coupling src/code --path test/fixtures/code-explore
# Ambiguity-aware chain analysis
occ code analyze chain ambiguousCaller duplicate --path test/fixtures/code-exploreHighlights of the current code exploration behavior:
- Full index export via
occ code index— dump the complete graph (files, symbols, edges, language capabilities) as JSON or a summary line - Exact, pattern, type, and content search over the repository graph
- Call analysis with explicit
resolved,ambiguous, andunresolvedstates - Receiver-aware method resolution for
this,super,self, andcls - Dependency analysis grouped into local, external, and unresolved imports
- Module coupling analysis with afferent/efferent coupling, instability, and key classes
- Chain analysis that reports when a path is blocked by ambiguity instead of silently returning nothing
- Shared CLI ergonomics with
--path,--format,--output,--exclude-dir, and.gitignoresupport
All occ code commands support --format tabular|json. Most symbol-targeted commands also support --file for disambiguation, and JSON output includes repository metadata, query metadata, results, repository stats, and per-language capability flags.
The code exploration module is available as a library via subpath exports:
import { buildCodebaseIndex } from '@cesarandreslopez/occ/code/build';
import { discoverCodeFiles } from '@cesarandreslopez/occ/code/discover';
import { findByName, analyzeCalls } from '@cesarandreslopez/occ/code/query';
import type { CodebaseIndex, CodeNode } from '@cesarandreslopez/occ/code/types';
const index = await buildCodebaseIndex({ repoRoot: './my-repo' });
const results = findByName(index, 'UserService');For a stateful session that caches the index across queries:
import { createCodeQuerySession } from '@cesarandreslopez/occ/code/session';
const session = await createCodeQuerySession({ repoRoot: './my-repo' });
session.findByName('UserService');
session.analyzeCalls('bootstrap');
session.chunk({ maxChunkWords: 200 });
await session.refresh(); // rebuild index when files changeFor workspace-level analysis:
import { analyzeWorkspace } from '@cesarandreslopez/occ/workspace/analyze';
import { inspectWorkspaceDocumentSet } from '@cesarandreslopez/occ/workspace/documents';
const analysis = await analyzeWorkspace('./my-project', { includeCode: true });
const docs = await inspectWorkspaceDocumentSet('./my-project', { maxFiles: 20 });Available subpath exports:
| Import path | Description |
|---|---|
@cesarandreslopez/occ/code/build |
buildCodebaseIndex — graph construction |
@cesarandreslopez/occ/code/types |
TypeScript types (CodebaseIndex, CodeNode, CodeEdge, etc.) |
@cesarandreslopez/occ/code/query |
Query functions (findByName, analyzeCalls, analyzeDeps, etc.) |
@cesarandreslopez/occ/code/discover |
discoverCodeFiles — file discovery |
@cesarandreslopez/occ/code/chunk |
chunkCodebase, chunkFromIndex — semantic code chunking |
@cesarandreslopez/occ/code/session |
createCodeQuerySession — stateful code query session |
@cesarandreslopez/occ/code/cache |
Index caching utilities |
@cesarandreslopez/occ/doc/inspect |
inspectDocument — document metadata and content extraction |
@cesarandreslopez/occ/doc/types |
Document inspection types |
@cesarandreslopez/occ/doc/discover |
Document file discovery |
@cesarandreslopez/occ/doc/batch |
Batch document inspection |
@cesarandreslopez/occ/doc/entities |
Entity and keyword extraction |
@cesarandreslopez/occ/doc/references |
Cross-reference detection |
@cesarandreslopez/occ/workspace/analyze |
analyzeWorkspace — workspace-level analysis |
@cesarandreslopez/occ/workspace/documents |
inspectWorkspaceDocumentSet — document summaries and cross-references |
@cesarandreslopez/occ/workspace/types |
Workspace analysis types |
@cesarandreslopez/occ/markdown/convert |
documentToMarkdown — document-to-markdown conversion |
@cesarandreslopez/occ/structure/extract |
extractFromMarkdown — heading tree extraction |
@cesarandreslopez/occ/structure/types |
Structure types and helpers |
@cesarandreslopez/occ/sheet/inspect |
inspectWorkbook — XLSX workbook inspection |
@cesarandreslopez/occ/sheet/types |
Sheet inspection types |
@cesarandreslopez/occ/slide/inspect |
inspectPresentation — presentation inspection |
@cesarandreslopez/occ/table/inspect |
Table extraction from documents |
@cesarandreslopez/occ/types |
Shared types (ConfidenceLevel, ParseResult, ParserOutput, etc.) |
@cesarandreslopez/occ/tokens |
Token estimation utilities |
@cesarandreslopez/occ/progress-event |
Progress event types |
@cesarandreslopez/occ/stats |
Stats types (StatsRow, AggregateResult) and aggregate() |
TypeScript ships with OCC as a direct dependency, so the code exploration module works after a normal install. You only need a separate TypeScript setup if your own project uses tsc.
occ doc inspect extracts metadata, risk flags, content stats, heading structure, and a content preview from DOCX and ODT documents.
# Document overview with content preview
occ doc inspect report.docx
# Machine-readable payload
occ doc inspect report.docx --format json
# More paragraphs in the preview
occ doc inspect report.docx --sample-paragraphs 10Current document inspection surfaces:
- Document properties — title, author, dates, keywords
- Risk flags — comments, tracked changes, hyperlinks, embedded objects, macros, tables, encryption
- Content stats — words, pages, paragraphs, characters, tables, images
- Heading structure — tree with section codes and depth
- Content preview — first N paragraphs with heading detection
- Token estimates — preview and full-document token estimates
occ sheet inspect is a lightweight XLSX preflight command aimed at both humans and agents. It helps answer "is this workbook worth reading in depth?" before spending tokens serializing cells or opening the file in Excel.
# Workbook-level summary + per-sheet schema/sample preview
occ sheet inspect finance.xlsx
# Machine-readable inspection payload
occ sheet inspect finance.xlsx --format json
# Narrow to one sheet and reduce preview width
occ sheet inspect finance.xlsx --sheet Revenue --sample-rows 3 --max-columns 8Current XLSX inspection highlights:
- Workbook metadata — file size, workbook properties, custom properties, workbook-scoped names
- Sheet inventory — visible / hidden / very hidden sheets, used ranges, cell counts, formula/comment/link counts
- Schema preview — detected header row, inferred column types, coverage ratios, example values
- Lightweight sampling — small row previews designed for preflight rather than full extraction
- Token estimates — sample and full-sheet token estimates to guide downstream agent reads
occ slide inspect provides presentation metadata, risk flags, per-slide inventory, and content previews for PPTX and ODP files.
# Presentation overview with slide preview
occ slide inspect deck.pptx
# Machine-readable payload
occ slide inspect deck.pptx --format json
# Inspect a specific slide
occ slide inspect deck.pptx --slide 3Current presentation inspection surfaces:
- Presentation properties — title, author, dates
- Risk flags — comments, speaker notes, hyperlinks, embedded media, animations, macros, charts, tables
- Slide inventory — per-slide title, word count, notes, images, tables, charts
- Content preview — text preview for sample slides
- Token estimates — preview and full-presentation token estimates
occ table inspect extracts structured table content from DOCX, XLSX, PPTX, ODT, and ODP documents. For AI agents, this is the primary way to read financial summaries, comparison matrices, and data tables without parsing raw XML.
# Extract all tables as JSON
occ table inspect report.docx --format json
# Tabular preview of table content
occ table inspect finance.xlsx
# Extract a specific table
occ table inspect finance.xlsx --table 1
# Limit sample rows
occ table inspect report.docx --sample-rows 5Current table extraction highlights:
- Multi-format support — DOCX (via mammoth HTML), XLSX (via SheetJS), PPTX (from slide XML), ODT and ODP (from content.xml)
- Auto-detected headers — first row is treated as headers when values are unique strings
- Merged cell support — colspan and rowspan are preserved in the output
- Sample row limits — configurable maximum rows per table (default: 20)
- Table filtering — extract a specific table by index with
--table N - Token estimates — per-table and total token estimates
- PDF graceful degradation — returns empty tables with an informative note instead of unreliable heuristic output
occ workspace provides combined analysis of code, documents, and structures in a single versioned JSON payload — useful for AI agents that need a complete workspace overview.
# Full workspace analysis (code + documents + structures)
occ workspace analyze --format json
# Skip code analysis
occ workspace analyze --no-code --format json
# Document summaries with cross-reference detection
occ workspace documents --format json
# Limit documents and include markdown content
occ workspace documents --max-files 20 --include-markdown --format jsonocc workspace analyze returns a schemaVersion: 1 JSON envelope containing code metrics (via scc), document aggregates, heading structures, skipped files, and errors. occ workspace documents returns per-document summaries with cross-references (filename mentions, hyperlinks, citations) and unresolved mentions detected across the document set.
Full documentation is available at cesarandreslopez.github.io/occ, including:
Tools like scc, cloc, and tokei give you instant visibility into codebases — lines, languages, complexity. But most projects also contain Word documents, PDFs, spreadsheets, and presentations that are invisible to these tools. OCC fills that gap.
- Project audits — instantly see how much documentation lives alongside your code: total word counts, page counts, spreadsheet sizes, and presentation lengths
- Tracking documentation growth — run OCC in CI to monitor how documentation scales over time, catch bloat early, or enforce minimums
- Onboarding — new team members get a quick sense of a project's documentation footprint before diving in
- Migration planning — when moving to a new platform, know exactly what you're dealing with across hundreds of files and formats
- Context budgeting — LLMs have finite context windows. OCC's word and page counts let agents estimate how much of a document set they can ingest before hitting token limits
- Prioritization — an agent deciding which documents to read can use OCC's JSON output to rank files by size, word count, or type, focusing on the most relevant content first
- RAG chunk mapping —
--structure --format jsonoutputs heading trees with character offsets, enabling chunk-to-section mapping, scoped retrieval, and citation paths in RAG pipelines - Document triage —
occ doc inspect --format jsonsurfaces risk flags, content stats, structure, and token estimates before an agent reads the full document - Spreadsheet triage —
occ sheet inspect --format jsonexposes sheet visibility, formulas, links, comments, schema hints, and token estimates before an agent expands workbook contents - Presentation triage —
occ slide inspect --format jsonprovides slide inventory, risk flags, and content previews for quick assessment - Table extraction —
occ table inspect --format jsonextracts structured table data (headers, rows, cells) from documents, giving agents direct access to tabular content without parsing raw XML - Repository mapping — agents exploring an unfamiliar codebase can combine
occ --format jsonfor document inventory withocc code ... --format jsonfor symbol and relationship data - Pipeline integration — JSON output pipes directly into agent toolchains for automated document analysis, summarization, or compliance checking
OCC is written in TypeScript and uses fast-glob for file discovery, dispatches to format-specific parsers (mammoth for DOCX, pdf-parse for PDF, SheetJS for XLSX, JSZip + officeparser for PPTX/ODF), aggregates metrics, and renders output via cli-table3. For code metrics, it shells out to a vendored scc binary (auto-downloaded during npm install, with PATH fallback).
For structure extraction (--structure), documents are first converted to markdown (mammoth + turndown for DOCX, pdf-parse with page markers for PDF), then headers are extracted and assembled into a tree with dotted section codes.
For occ code, OCC builds an in-memory code graph on demand. JavaScript and TypeScript are parsed with the TypeScript compiler API, Python uses a language-specific parser, and the query engine resolves symbols, imports, calls, inheritance, ambiguities, and dependency categories without a persistent database.
Contributions are welcome! See CONTRIBUTING.md for setup instructions and guidelines.