Skip to content

sidxz/docu-store

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

183 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAIKON DOCU STORE

A human-first document intelligence store for drug discovery, built to keep scientific work explainable, searchable, and ready for the next breakthrough.

Status

Tests Python Event Sourcing CQRS Kafka MongoDB Docker Temporal

Why it exists

Drug discovery generates a river of documents: experimental protocols, assay results, regulatory artifacts, and research narratives. Docu Store exists to keep that river navigable. We want a system where every update is traceable, every decision is defensible, and every insight is easy to rediscover.

Capabilities

  • Event-sourced core for durable provenance and rollback analysis.
  • CQRS read models tuned for search, dashboards, and review flows.
  • Streaming integration via Kafka to connect lab systems and pipelines.
  • API-first architecture built on FastAPI.
  • AI-powered enrichment (in progress): OCR extraction of SMILES from images and PDFs, document embeddings across formats, and a vector database for semantic retrieval.

Architecture at a glance

flowchart LR
    UI[Client / Integrations] --> API[FastAPI Command API]
    API -->|Commands| ES[(Event Store)]
    ES -->|Events| Kafka[Kafka Topics]
    Kafka --> Proj[Projection Workers]
    Proj --> RM[(MongoDB Read Models)]
    RM --> QAPI[FastAPI Query API]
    QAPI --> UI
Loading

Event lifecycle

sequenceDiagram
    participant U as Scientist
    participant C as Command API
    participant E as Event Store
    participant K as Kafka
    participant P as Projector
    participant M as Read Models

    U->>C: Submit Document Update
    C->>E: Append Events
    E-->>K: Publish Events
    K-->>P: Stream Events
    P->>M: Update Projections
    M-->>U: Consistent Query Results
Loading

Intelligence roadmap

Docu Store is in active development. The vision is to pair trustworthy data lineage with modern retrieval so scientists can move from “where is that file?” to “what does it imply?” in seconds.

flowchart TD
    Raw[Images / PDFs / Lab Docs] --> OCR[OCR + SMILES Extraction]
    Raw --> Parser[Structured Parsers]
    OCR --> Embed[Embedding Pipeline]
    Parser --> Embed
    Embed --> VectorDB[(Vector Database)]
    VectorDB --> Search[Semantic Search + Reranking]
    Search --> UI[Discovery UI / API]
Loading

What makes it intelligent

  • Context-aware history: every document state is derived, not overwritten.
  • Separation of concerns: write paths stay correct, read paths stay fast.
  • Composable signals: events and embeddings become reusable blocks for analytics.
  • Search that feels human: semantic retrieval that understands chemistry artifacts and experimental context.
  • Operational clarity: streaming pipelines are explicit, observable, testable.

Next steps

  • Skim TESTING_QUICK_REFERENCE.md for a fast local setup.
  • Review WORKER_SETUP.md for projector and worker configuration.

Features

  • Event sourcing architecture
  • CQRS pattern for read/write separation
  • Kafka for event streaming
  • MongoDB for read models
  • FastAPI for REST API

About

A human-first document intelligence store for drug discovery.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors