-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add France Market Scanner with INPI data loader #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Acid3croco
wants to merge
5
commits into
master
Choose a base branch
from
feat/inpi-data-loader
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add complete data collection system for French company analysis: ## Data Sources - SIRENE (INSEE): 29M companies, 42M establishments - BODACC: Legal announcements with date windowing (7-day chunks) - INPI: Annual accounts via data.cquest.org mirror (2017-2023) ## Features - DuckDB database with optimized schema - CLI commands for download, load, sync operations - Support for both Complete (C) and Simplified (S) bilan types - XML parser for INPI liasse fiscale codes - Automatic date windowing to handle API limits ## Key Files - cli.py: Click-based CLI interface - src/extractors/inpi.py: INPI data loader with mirror support - src/extractors/bodacc.py: BODACC API client with windowing - src/extractors/sirene.py: SIRENE bulk data loader - src/core/database.py: DuckDB schema and manager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
- Document SIRENE vs INPI data reliability (payroll > employee brackets) - Add profit/payroll ratio as primary metric (no circular assumptions) - Document holding company detection (profit > revenue = dividends) - Add analysis scripts for finding PME gems by sector - Best sectors: medical labs (86.90B), software publishing (58.29C) - Note: 80% of small PMEs file confidential accounts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SIRENE unités légales: 25 columns, coverage stats - SIRENE établissements: 36 columns, address fields - INPI comptes: 29 columns, balance sheet + income statement - BODACC annonces: 24 columns, event types - Data quality summary highlighting key gaps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The payroll trick (employees = payroll / 70K) is mathematically circular: - profit_per_employee = profit × 70K / payroll = just profit/payroll scaled - Biased toward low-wage sectors (look like gems) - Misses high-wage tech gems (look mediocre) Bottom line: without real employee counts, we can only find high-margin or high-profit/payroll businesses, not "small teams" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Pipeline now saves both raw JSON and flattened Parquet - Schema-agnostic flattening based on JSON structure - Recursively flattens nested STRUCT columns - Extracts fields from JSON string columns (jugement, acte, depot) - Handles column name collisions with distinguishing prefixes - ZSTD compression reduces 665MB JSON to 36MB Parquet Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Tools to identify French company "gems" - profitable companies for AI automation or acquisition targets.
Data Added (2.9GB Parquet files)
Key Columns
sirendenominationactivite_principaletranche_effectifschiffre_affairesresultat_netcharges_personnelconfidentialiteKey Limitations
tranche_effectifsuselessThe Payroll Trap
Using
payroll / 70Kto estimate employees is circular:Just
profit/payrollwith extra steps. Biased toward low-wage sectors.What We CAN Find
profit / revenueprofit / payrollBest Sectors (visible data)
Files
README.md- Full data dictionary & honest limitationsfind_*.py- Analysis scriptsdata/parquet/- 2.9GB (gitignored)🤖 Generated with Claude Code