A scraper for ema.europa.eu, hobby project
The scraper does scrape the web page into a mongoDB database with the aim to provide a dataset for developing of a graph RAG retrieval pipeline.
- ema-rag/
config.yaml- All configuration (patterns, paths, etc.)config_loader.py- YAML config loadingrun_crawl.py- Entry point for crawlingexplore_graph.py- Explore graph after crawlingrequirements.txt- scraper/
spider.py- Thin orchestratorclassifiers.py- URL classification (Strategy pattern)extractors.py- Content extraction (Strategy pattern)items.py- Data containerspipelines.py- Spider output → Graphsettings.py- Scrapy settings
- storage/
pymongodb.py- Connector to MongoDB(Repository pattern)
- parsers/ - PDF parsing (Strategy + Factory)
base.py__init__.py- Factory: get_parser()pymupdf_parser.py
- embeddings/ - Embedding models (Strategy + Factory)
base.py__init__.pylocal_hf.py
- vectordb/ - Vector store (for later)