A human-first document intelligence store for drug discovery, built to keep scientific work explainable, searchable, and ready for the next breakthrough.
Drug discovery generates a river of documents: experimental protocols, assay results, regulatory artifacts, and research narratives. Docu Store exists to keep that river navigable. We want a system where every update is traceable, every decision is defensible, and every insight is easy to rediscover.
- Event-sourced core for durable provenance and rollback analysis.
- CQRS read models tuned for search, dashboards, and review flows.
- Streaming integration via Kafka to connect lab systems and pipelines.
- API-first architecture built on FastAPI.
- AI-powered enrichment (in progress): OCR extraction of SMILES from images and PDFs, document embeddings across formats, and a vector database for semantic retrieval.
flowchart LR
UI[Client / Integrations] --> API[FastAPI Command API]
API -->|Commands| ES[(Event Store)]
ES -->|Events| Kafka[Kafka Topics]
Kafka --> Proj[Projection Workers]
Proj --> RM[(MongoDB Read Models)]
RM --> QAPI[FastAPI Query API]
QAPI --> UI
sequenceDiagram
participant U as Scientist
participant C as Command API
participant E as Event Store
participant K as Kafka
participant P as Projector
participant M as Read Models
U->>C: Submit Document Update
C->>E: Append Events
E-->>K: Publish Events
K-->>P: Stream Events
P->>M: Update Projections
M-->>U: Consistent Query Results
Docu Store is in active development. The vision is to pair trustworthy data lineage with modern retrieval so scientists can move from “where is that file?” to “what does it imply?” in seconds.
flowchart TD
Raw[Images / PDFs / Lab Docs] --> OCR[OCR + SMILES Extraction]
Raw --> Parser[Structured Parsers]
OCR --> Embed[Embedding Pipeline]
Parser --> Embed
Embed --> VectorDB[(Vector Database)]
VectorDB --> Search[Semantic Search + Reranking]
Search --> UI[Discovery UI / API]
- Context-aware history: every document state is derived, not overwritten.
- Separation of concerns: write paths stay correct, read paths stay fast.
- Composable signals: events and embeddings become reusable blocks for analytics.
- Search that feels human: semantic retrieval that understands chemistry artifacts and experimental context.
- Operational clarity: streaming pipelines are explicit, observable, testable.
- Skim
TESTING_QUICK_REFERENCE.mdfor a fast local setup. - Review
WORKER_SETUP.mdfor projector and worker configuration.
- Event sourcing architecture
- CQRS pattern for read/write separation
- Kafka for event streaming
- MongoDB for read models
- FastAPI for REST API