Skip to content

Conversation

@r-chong
Copy link
Contributor

@r-chong r-chong commented Apr 7, 2025

Key changes:

Introduced content_buffer to DocumentProcessingContext for temporary batching of KnowledgeBaseContent objects.

Reworked insert_content to use a dynamic buffer that auto-flushes to the database when full.

Implemented bulk_save_objects for fast batched database insertions, replacing individual add/commit calls.

Ensured constant memory usage during ingestion to handle multi-thousand page documents without stalls or blowups.

r-chong added 12 commits April 5, 2025 18:17
- Replace invalid temp_dict.add() usage with a dict literal to fix runtime errors.
- Pass section_stack into traverse_blocks to ensure proper block hierarchy traversal.
- Pass db parameter into insert_content to allow database access where needed.
- Remove dead or commented-out config_parser code to clean up unused logic.
… speedup

- Added content_buffer to DocumentProcessingContext
- Modified insert_content to buffer KnowledgeBaseContent objects
- Used bulk_save_objects for batched database insertions
- Improved processing speed for large PDFs and textbooks
- Implemented dynamic buffering for database inserts
- Auto-flushes content when buffer fills to maintain constant memory usage
- Enables smooth streaming ingestion for multi-thousand page documents without stalls
feat: document processing

This PR significantly improves the performance and scalability of document ingestion in the document_processing module.

# Key changes:

Introduced content_buffer to DocumentProcessingContext for temporary batching of KnowledgeBaseContent objects.

Reworked insert_content to use a dynamic buffer that auto-flushes to the database when full.

Implemented bulk_save_objects for fast batched database insertions, replacing individual add/commit calls.

Ensured constant memory usage during ingestion to handle multi-thousand page documents without stalls or blowups.
@r-chong r-chong merged commit 08fd9ef into main Apr 7, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants