Skip to content

Conversation

@Acid3croco
Copy link
Collaborator

@Acid3croco Acid3croco commented Dec 30, 2025

Summary

Tools to identify French company "gems" - profitable companies for AI automation or acquisition targets.

Data Added (2.9GB Parquet files)

Key Columns

Source Column Coverage What it tells you
SIRENE siren 100% Company ID (join key)
SIRENE denomination 53% Company name
SIRENE activite_principale 99% Industry (NAF code)
SIRENE tranche_effectifs 6% Employee bracket - ⚠️ UNRELIABLE
INPI chiffre_affaires 23% Revenue
INPI resultat_net 29% Net profit
INPI charges_personnel 23% Payroll
INPI confidentialite 100% 55% are confidential

Key Limitations

Issue Impact
77% revenue hidden Small companies file confidential
94% employees unknown SIRENE tranche_effectifs useless
Can't find "small teams" No reliable employee counts

The Payroll Trap

Using payroll / 70K to estimate employees is circular:

profit_per_employee = profit / (payroll / 70K) = profit × 70K / payroll

Just profit/payroll with extra steps. Biased toward low-wage sectors.

What We CAN Find

Metric Formula Reliability
Profit margin profit / revenue ✅ Unbiased
Profit/payroll profit / payroll ⚠️ Biased to low-wage
By sector Filter by NAF code ✅ Medical labs, software have good data

Best Sectors (visible data)

NAF Sector Avg Margin
86.90B Medical labs 22%
58.29C Software publishing 18%
62.02A IT consulting 12%

Files

  • README.md - Full data dictionary & honest limitations
  • find_*.py - Analysis scripts
  • data/parquet/ - 2.9GB (gitignored)

🤖 Generated with Claude Code

Add complete data collection system for French company analysis:

## Data Sources
- SIRENE (INSEE): 29M companies, 42M establishments
- BODACC: Legal announcements with date windowing (7-day chunks)
- INPI: Annual accounts via data.cquest.org mirror (2017-2023)

## Features
- DuckDB database with optimized schema
- CLI commands for download, load, sync operations
- Support for both Complete (C) and Simplified (S) bilan types
- XML parser for INPI liasse fiscale codes
- Automatic date windowing to handle API limits

## Key Files
- cli.py: Click-based CLI interface
- src/extractors/inpi.py: INPI data loader with mirror support
- src/extractors/bodacc.py: BODACC API client with windowing
- src/extractors/sirene.py: SIRENE bulk data loader
- src/core/database.py: DuckDB schema and manager

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Dec 30, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
exploration-app Ready Ready Preview, Comment Jan 21, 2026 0:38am

- Document SIRENE vs INPI data reliability (payroll > employee brackets)
- Add profit/payroll ratio as primary metric (no circular assumptions)
- Document holding company detection (profit > revenue = dividends)
- Add analysis scripts for finding PME gems by sector
- Best sectors: medical labs (86.90B), software publishing (58.29C)
- Note: 80% of small PMEs file confidential accounts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SIRENE unités légales: 25 columns, coverage stats
- SIRENE établissements: 36 columns, address fields
- INPI comptes: 29 columns, balance sheet + income statement
- BODACC annonces: 24 columns, event types
- Data quality summary highlighting key gaps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The payroll trick (employees = payroll / 70K) is mathematically circular:
- profit_per_employee = profit × 70K / payroll = just profit/payroll scaled
- Biased toward low-wage sectors (look like gems)
- Misses high-wage tech gems (look mediocre)

Bottom line: without real employee counts, we can only find
high-margin or high-profit/payroll businesses, not "small teams"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Pipeline now saves both raw JSON and flattened Parquet
- Schema-agnostic flattening based on JSON structure
- Recursively flattens nested STRUCT columns
- Extracts fields from JSON string columns (jugement, acte, depot)
- Handles column name collisions with distinguishing prefixes
- ZSTD compression reduces 665MB JSON to 36MB Parquet

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants