feat: add France Market Scanner with INPI data loader #1

Acid3croco · 2025-12-30T21:08:58Z

Summary

Tools to identify French company "gems" - profitable companies for AI automation or acquisition targets.

Data Added (2.9GB Parquet files)

Key Columns

Source	Column	Coverage	What it tells you
SIRENE	`siren`	100%	Company ID (join key)
SIRENE	`denomination`	53%	Company name
SIRENE	`activite_principale`	99%	Industry (NAF code)
SIRENE	`tranche_effectifs`	6%	Employee bracket - ⚠️ UNRELIABLE
INPI	`chiffre_affaires`	23%	Revenue
INPI	`resultat_net`	29%	Net profit
INPI	`charges_personnel`	23%	Payroll
INPI	`confidentialite`	100%	55% are confidential

Key Limitations

Issue	Impact
77% revenue hidden	Small companies file confidential
94% employees unknown	SIRENE `tranche_effectifs` useless
Can't find "small teams"	No reliable employee counts

The Payroll Trap

Using payroll / 70K to estimate employees is circular:

profit_per_employee = profit / (payroll / 70K) = profit × 70K / payroll

Just profit/payroll with extra steps. Biased toward low-wage sectors.

What We CAN Find

Metric	Formula	Reliability
Profit margin	`profit / revenue`	✅ Unbiased
Profit/payroll	`profit / payroll`	⚠️ Biased to low-wage
By sector	Filter by NAF code	✅ Medical labs, software have good data

Best Sectors (visible data)

NAF	Sector	Avg Margin
86.90B	Medical labs	22%
58.29C	Software publishing	18%
62.02A	IT consulting	12%

Files

README.md - Full data dictionary & honest limitations
find_*.py - Analysis scripts
data/parquet/ - 2.9GB (gitignored)

🤖 Generated with Claude Code

Add complete data collection system for French company analysis: ## Data Sources - SIRENE (INSEE): 29M companies, 42M establishments - BODACC: Legal announcements with date windowing (7-day chunks) - INPI: Annual accounts via data.cquest.org mirror (2017-2023) ## Features - DuckDB database with optimized schema - CLI commands for download, load, sync operations - Support for both Complete (C) and Simplified (S) bilan types - XML parser for INPI liasse fiscale codes - Automatic date windowing to handle API limits ## Key Files - cli.py: Click-based CLI interface - src/extractors/inpi.py: INPI data loader with mirror support - src/extractors/bodacc.py: BODACC API client with windowing - src/extractors/sirene.py: SIRENE bulk data loader - src/core/database.py: DuckDB schema and manager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel · 2025-12-30T21:09:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
exploration-app	Ready	Preview, Comment	Jan 21, 2026 0:38am

- Document SIRENE vs INPI data reliability (payroll > employee brackets) - Add profit/payroll ratio as primary metric (no circular assumptions) - Document holding company detection (profit > revenue = dividends) - Add analysis scripts for finding PME gems by sector - Best sectors: medical labs (86.90B), software publishing (58.29C) - Note: 80% of small PMEs file confidential accounts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- SIRENE unités légales: 25 columns, coverage stats - SIRENE établissements: 36 columns, address fields - INPI comptes: 29 columns, balance sheet + income statement - BODACC annonces: 24 columns, event types - Data quality summary highlighting key gaps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The payroll trick (employees = payroll / 70K) is mathematically circular: - profit_per_employee = profit × 70K / payroll = just profit/payroll scaled - Biased toward low-wage sectors (look like gems) - Misses high-wage tech gems (look mediocre) Bottom line: without real employee counts, we can only find high-margin or high-profit/payroll businesses, not "small teams" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Pipeline now saves both raw JSON and flattened Parquet - Schema-agnostic flattening based on JSON structure - Recursively flattens nested STRUCT columns - Extracts fields from JSON string columns (jugement, acte, depot) - Handles column name collisions with distinguishing prefixes - ZSTD compression reduces 665MB JSON to 36MB Parquet Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview December 30, 2025 21:09 View deployment

vercel bot deployed to Preview January 7, 2026 08:33 View deployment

vercel bot deployed to Preview January 7, 2026 08:45 View deployment

vercel bot deployed to Preview January 7, 2026 09:01 View deployment

vercel bot deployed to Preview January 21, 2026 12:38 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add France Market Scanner with INPI data loader #1

feat: add France Market Scanner with INPI data loader #1

Uh oh!

Acid3croco commented Dec 30, 2025 •

edited

Loading

Uh oh!

vercel bot commented Dec 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add France Market Scanner with INPI data loader #1

Are you sure you want to change the base?

feat: add France Market Scanner with INPI data loader #1

Uh oh!

Conversation

Acid3croco commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data Added (2.9GB Parquet files)

Key Columns

Key Limitations

The Payroll Trap

What We CAN Find

Best Sectors (visible data)

Files

Uh oh!

vercel bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Acid3croco commented Dec 30, 2025 •

edited

Loading

vercel bot commented Dec 30, 2025 •

edited

Loading