Features-api: Add download audit system with PDF generation, analytics, and CSV export#45
Features-api: Add download audit system with PDF generation, analytics, and CSV export#45nando-bingani wants to merge 28 commits intov2-prodfrom
Conversation
**Backend API Endpoints:** - GET /download/logs - Paginated download logs with filtering (date range, email, organisation, download_type) - GET /download/stats - Download statistics (total downloads, unique users, data volume, success rate) - GET /download/export/csv - CSV export of download logs **Database Improvements:** - Add indexes for efficient querying: - timestamp DESC for date range queries - client_id for user tracking - GIN indexes on meta JSON fields (email, organisation, download_type) - Composite index (timestamp DESC, client_id) for common filter patterns **Features:** - Paginated results with configurable page size (max 200) - Multiple filter options (date range, email, organisation, download type) - Aggregate statistics calculation - CSV export with dynamic filename - Graceful error handling with proper HTTP status codes Alembic Migration: 2025_11_14_b7dd3a950ed5_add_download_audit_indexes
…y data **New Files:** 1. scripts/populate_download_audit.py - Python script for generating realistic dummy download audit records - Uses Faker library for realistic names, emails, organisations - Generates 100 records spread over 30 days (configurable) - Simulates 95% success rate, 5% failure rate - Creates mix of single record and ZIP bundle downloads - Supports --count and --days command line arguments - Prints statistics after insertion 2. sql/download/populate_dummy_data.sql - Pre-written SQL script with 20 hand-crafted records - Quick option for manual database population - Uses PostgreSQL JSONB functions - Includes realistic South African organisations - Mix of download types and dates - Can be run directly with psql or from admin tools 3. DOWNLOAD_AUDIT_README.md - Complete documentation for the download audit system - Instructions for both population methods - API endpoint documentation with examples - Testing guide for reporting features - Performance notes and database schema - Data clearing instructions - Sample organisations and user data **Features:** - Generates realistic user data (names, emails from various organisations) - South African IP addresses and real organisation names - Proper file size distributions - Complete metadata for both single and batch downloads - JSONB meta field structure matching production data - Timezone-aware timestamps distributed across date range
**New Files:** 1. scripts/populate_simple.py - Simplified Python script for populating download_audit table - NO external dependencies required (no Faker) - Generates 50 realistic dummy records - Spreads over last 30 days - Mix of single record and ZIP bundle downloads - 95% success rate, 5% failure rate - Shows statistics after insertion - Easy to run: python scripts/populate_simple.py 2. POPULATE_DATA_GUIDE.md - Complete guide for populating download audit table - Three methods: Python (simple), SQL, or Python (with Faker) - Step-by-step instructions - Database connection troubleshooting - Verification commands - Data clearing instructions - Sample data details and format - Expected output examples **Usage:** ``` cd odp-server python scripts/populate_simple.py ``` This will insert 50 realistic test records into download_audit table with proper metadata matching production format.
- Implement POST /catalog/download/bundle endpoint - Streams ZIP file containing metadata PDFs for multiple records - Validates record existence and enforces 2GB size limit - Logs downloads to download_audit table with bundle metadata - Uses built-in zipfile module (no external dependencies) - Returns StreamingResponse for memory efficiency This replaces client-side ZIP generation which caused browser memory issues.
… odp-ui - Move build_metadata_pdf() function to odp-server - Add ReportLab to odp-server dependencies - Update /catalog/download/bundle to generate PDFs internally - Eliminate HTTP calls to odp-ui during bundle creation - Faster bundle creation with no inter-service latency - Single service responsible for all file operations - Simplifies error handling and monitoring Benefits: - Eliminates network roundtrip for PDF generation - Faster bundle creation and streaming - More reliable error handling - Better separation: odp-ui handles UI, odp-server handles data processing
- Changed final_size calculation from zip_buffer.seek() to len(getvalue()) - Ensures accurate file size is captured after ZIP compression - Previously was recording NULL values for bundle downloads - Now properly logs bundle file size to download_audit table for reporting
- Add file size validation and debug logging - Check if record_data is empty before PDF generation - Check if PDF blob is empty after generation - Add bundle_size_bytes to meta field for audit trail - Only set file_size if > 0 to catch empty bundles - Better error messages for troubleshooting This requires restarting the odp-server container to take effect.
- Add 'files_added' tracking for each file added to bundle - Record filename, size, DOI, and type for each file - Include 'files_in_bundle' in audit metadata with complete file list - Add 'total_files' count to audit metadata - Include 'zip_file_size' in audit metadata for clarity This provides complete visibility into what was downloaded and file sizes.
Changed Session.begin() to Session.begin() as session to properly use the session context when adding audit records. This ensures file_size and other metadata are actually persisted to the database.
Add doi and record_id to the list of fields copied from the audit payload to the meta JSONB field. This enables the admin interface to generate direct links to catalog records for both single record and bundle downloads.
Add the complete meta JSONB field to the response so that the admin interface can access DOI, record_id, and other metadata for generating record links in the download logs table.
Use `with Session.begin() as session:` pattern instead of just `with Session.begin():` to properly capture the session instance. Call session.add() and session.flush() on the session instance rather than the Session class. This aligns with the pattern used in catalog.py and ensures proper transaction handling.
Use the standard `with Session() as session:` pattern instead of `Session.begin()` context manager for both download audit and bundle endpoints. The latter returns a SessionTransaction object which doesn't have the add() method. Changes: - download.py: Create_download_audit endpoint - catalog.py: Create_download_bundle endpoint Both now use session.commit() instead of session.flush() for explicit transaction control. ored-By: Claude <noreply@anthropic.com>
Document expected payload fields for POST /download/audit endpoint, including the optional but important 'doi' and 'record_id' fields that external systems (like MIMS) should include when logging downloads. This clarifies what information external systems need to send to properly track record downloads.
Enhanced download statistics with: - Top 20 organisations by download count and unique users - Top 10 most downloaded records (by DOI) with user engagement - Daily downloads time-series with successful/failed breakdown
- Replace incorrect func.filter() usage with CASE/SUM pattern - PostgreSQL filter() function requires correct aggregate syntax - Use CASE WHEN ... THEN 1 ELSE 0 wrapped in SUM() for conditional aggregates - Add missing 'case' import from sqlalchemy - Fixes: "function filter(boolean, integer) does not exist" error This resolves the 500 error in GET /download/stats endpoint. Co-Authored-By: nando <n.bingani@saeon.nrf.ac.za>
…irements.txt added dependemcies
Replace raw download URLs in CSV export with actual MIMS catalog links. Single record downloads now include the DOI link, and ZIP bundles include the subset query with all DOIs. Uses MIMS_CATALOG_URL from environment. Authored-By: Nando Bingani <n.bingani@saeon.nrf.ac.za>
- Add metadata_adapters.py: Converts DataCite4 and ISO19115 formats to unified RecordMetadata - Add metadata_pdf.py: Core PDF generation using ReportLab with schema-agnostic approach - Add POST /catalog/metadata/generate-pdf: API endpoint for PDF generation - Migrate bundle creation to use unified module with fallback to legacy function - Include comprehensive error handling and download audit logging - All changes follow existing patterns in codebase (download-audit endpoint style) This consolidates PDF generation to single source of truth (ODP Server). Authored by nando-bingani <n.bingani@saeon.nrf.ac.za>
The /catalog/metadata/generate-pdf and /catalog/{catalog_id}/records/{record_id}/metadata.pdf
endpoints were logging audit entries every time they were called.
However, these endpoints are typically called internally by other endpoints
(MIMS downloads, ZIP bundle generation) which already handle their own audit
logging. This was causing duplicate audit entries.
Solution: Remove audit logging from internal PDF generation endpoints.
The calling endpoints handle audit logging appropriately.
This fixes the issue where downloading a ZIP bundle with 2 records would create
3 audit entries instead of 1.
Authored by nando-bingani <n.bingani@saeon.nrf.ac.za>
Remove all debug print() statements from catalog.py: - Remove 'Empty record data' warnings - Remove 'Falling back to legacy PDF generation' info logs - Remove 'Generated empty PDF' warnings - Remove 'Added to ZIP' debug logs - Remove error processing messages - Remove ZIP buffer empty warnings These debug statements are not needed in production and clutter the code. Error handling is maintained via exceptions. Authored by nando-bingani <n.bingani@saeon.nrf.ac.za>
| # output_encoding = utf-8 | ||
|
|
||
| # sqlalchemy.url = driver://user:pass@localhost/dbname | ||
| sqlalchemy.url = postgresql://odp_user:pass@localhost:5432/odp_db |
There was a problem hiding this comment.
Is this a change for testing?
odp/api/routers/catalog.py
Outdated
| ) | ||
|
|
||
| @router.post('/download/bundle') | ||
| async def create_download_bundle( |
There was a problem hiding this comment.
- The auth dependency is not included here.
- Two blank lines above the function. Use formatting shortcu
odp/api/routers/catalog.py
Outdated
|
|
||
|
|
||
| # ============================================================================ | ||
| # PDF Generation API Endpoints (Unified Module) |
There was a problem hiding this comment.
We don't generally eave comments like this. Nae a function in such a way that the user knows what it does
odp/api/routers/catalog.py
Outdated
|
|
||
|
|
||
| # ============================================================================ | ||
| # PDF Generation Utility (Legacy - Kept for backward compatibility) |
There was a problem hiding this comment.
Legacy? This is a completely new part of the system?
odp/api/routers/catalog.py
Outdated
| # PDF Generation Utility (Legacy - Kept for backward compatibility) | ||
| # ============================================================================ | ||
|
|
||
| def build_metadata_pdf(record_data: dict) -> BytesIO: |
There was a problem hiding this comment.
This should probably live outside the router so as not to clutter it
New features: - New library module: odp/lib/zip_generator.py for server-side ZIP generation - ZIP bundles include metadata PDFs and data files - Support for both single record and bulk downloads API Changes: - New endpoint: POST /catalog/generate-zip-bundle - UserData model for audit logging (name, email, organisation) - Authentication required: ODPScope.CATALOG_READ Bug Fixes: - Fixed download.py to use record_ids instead of dois in zip_bundle audit export - Cleaner audit logging structure with proper metadata tracking This enables the unified download flow across both search results and detail pages, supporting both single record and bulk batch downloads with consistent audit logging. Author: nando-bingani <n.bingani@saeon.nrf.ac.za>
…erdata extraction
dylanpivo
left a comment
There was a problem hiding this comment.
On top of the queries. Please look at things like comments, naming and formatting.
In general, have a look at the way things are already being done in the system and use the same methods.
odp/lib/zip_generator.py
Outdated
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def fetch_record_data(id_or_doi: str) -> Optional[Dict[str, Any]]: |
There was a problem hiding this comment.
I don't think you should fetch by either ID or DOI. Fetch only by id. Also consider fetching the metadata from the record. It will simplify the possibility of having more than one metadata record.
odp/lib/zip_generator.py
Outdated
| } | ||
| except Exception as e: | ||
| logger.error(f"Error fetching record {id_or_doi}: {str(e)}") | ||
| return None |
odp/lib/zip_generator.py
Outdated
| for record_id in record_ids: | ||
| try: | ||
| # 1. Fetch | ||
| data_package = fetch_record_data(record_id) |
There was a problem hiding this comment.
Fetching from the DB is an expensive operation. Why not rather fetch all the metadata records in one DB call and iterate through those rather than doing a fetch for each id?
odp/lib/zip_generator.py
Outdated
| } | ||
|
|
||
| # Switch logic based on count | ||
| if is_single_record: |
There was a problem hiding this comment.
Rather than having an is/else, why not set values that are dependent on it being a single or multiple records, and then set meta_data?
odp/lib/zip_generator.py
Outdated
There was a problem hiding this comment.
The naming of files and functions is very important. It tells us at a glance what the role of the files and functions are. zip_generator is very generic and sounds like it will generate a generic zip. This develops a zipped file based on a very specific context. Try name it something that gives us a better clue about what it's doing
There was a problem hiding this comment.
This router has a lot of internal logic. It builds csv's etc. This shouldn't be the case. We are looking for single responsibility where possible.
There was a problem hiding this comment.
Have a look at how we use factory boy to generate test data
DOWNLOAD_AUDIT_README.md
Outdated
There was a problem hiding this comment.
This new download code should not be so strikingly new and strange that it requires it's own readme?
POPULATE_DATA_GUIDE.md
Outdated
There was a problem hiding this comment.
Do we need a readme for this? We have a structure for populating test data. That same structure should be followed which will remove the need for a readme.
test_metadata_standalone.py
Outdated
There was a problem hiding this comment.
Only add tests in the test package.
|
|
||
| class UserData(BaseModel): | ||
| """User information for audit logging.""" | ||
| name: str = Field(..., description="Full name of the user", min_length=1) |
There was a problem hiding this comment.
When collecting info such as the user's name, does this not trigger POPIA issues? Do we need to store the user's name in the database or perhaps we have to obfuscate this.
odp/api/routers/catalog.py
Outdated
| user_agent = request.headers.get('user-agent') | ||
|
|
||
| # Delegate to library function for ZIP generation | ||
| from odp.lib.zip_generator import create_zip_bundle |
There was a problem hiding this comment.
I recommend doing this import at the beginning of the file for better performance (since this feature might be used a lot) and better readability.
…iting
This commit addresses feedback by refactoring key components to ensure
architectural consistency, improve performance, and align with the existing
system's design patterns.
Key Changes:
- Refactored ZIP Generation:
- Renamed 'odp/lib/zip_generator.py' to 'odp/lib/bundle_generator.py' to
better reflect its specific context of bundling metadata PDFs and data
files.
- Optimized database performance by replacing iterative individual record
fetches with a bulk fetch of catalog records in a single query.
- Standardized record lookups to use IDs exclusively for internal metadata
retrieval.
- Improved Metadata Adaptation & PDF Generation:
- Renamed the metadata PDF module to 'odp/lib/pdf_generator.py' for
clarity.
- Unified the internal representation of record metadata into a
'RecordMetadata' dataclass to streamline processing across various
schemas.
- Extracted person information parsing into a shared helper to eliminate
code duplication between DataCite and ISO19115 adapters.
- Refactored 'AutoDetectAdapter' from a class into a factory function
'adapt_metadata' to maintain clean separation of concerns.
- Service-Router Separation:
- Decoupled business logic from the API layer by moving CSV generation,
statistics calculation, and heavy filtering into a new dedicated
'odp/lib/download_service.py' module.
- Updated 'odp/api/routers/download.py' to delegate all core logic to the
service layer, adhering to the single responsibility principle.
- Database & Models Consistency:
- Standardized auditing field names by renaming 'meta_data' to 'meta'
across audit models for cross-system consistency.
- Utilized config-based URL management instead of direct environment
lookups and removed hardcoded dev defaults.
- Testing & Cleanup:
- Integrated 'factory_boy' and 'faker' into 'test/factories.py' for robust,
dynamic test data generation across the suite.
- Removed redundant standalone documentation and test scripts,
consolidating tests into the standard project structure (e.g.,
'test/lib/test_metadata_pdf.py').
- Applied PEP 8 formatting standards and removed redundant block comments
as requested during review.
Implement comprehensive download audit tracking system with PDF metadata generation, analytics endpoints, and CSV export functionality.
Key Features:
/download/logs,/download/stats,/download/export/csvNew Endpoints:
POST /download/audit- Log download events with user metadataGET /download/logs- Paginated logs with filtering (date, email, org, type)GET /download/stats- Aggregate statistics (total, unique users, top records, trends)GET /download/export/csv- Export filtered logs as CSVPOST /catalog/metadata/generate-pdf- Generate PDF from metadataPOST /catalog/download/bundle- Create ZIP bundle with PDFsDatabase Changes:
download_audittable with JSONBmetafieldFiles Modified: ~15 files (API routers, models, PDF generation modules, migrations)
Bug Fixes:
Testing: