Features-api: Add download audit system with PDF generation, analytics, and CSV export by nando-bingani · Pull Request #45 · SAEON/odp-server

nando-bingani · 2026-02-04T13:14:58Z

Implement comprehensive download audit tracking system with PDF metadata generation, analytics endpoints, and CSV export functionality.

Key Features:

Download Audit API - Track all downloads with flexible JSONB metadata storage
Analytics Endpoints - /download/logs, /download/stats, /download/export/csv
Unified PDF Generation- Schema-agnostic PDF creation (DataCite4 + ISO19115)
ZIP Bundle API- Server-side multi-record bundle creation with metadata PDFs
Catalog View Links - CSV export includes "View Record" and "View Bundle" links

New Endpoints:

POST /download/audit - Log download events with user metadata
GET /download/logs - Paginated logs with filtering (date, email, org, type)
GET /download/stats - Aggregate statistics (total, unique users, top records, trends)
GET /download/export/csv - Export filtered logs as CSV
POST /catalog/metadata/generate-pdf - Generate PDF from metadata
POST /catalog/download/bundle - Create ZIP bundle with PDFs

Database Changes:

download_audit table with JSONB meta field
7 indexes (timestamp, email, organisation, download_type, doi)

Files Modified: ~15 files (API routers, models, PDF generation modules, migrations)

Bug Fixes:

Fixed duplicate audit logging (3 entries → 1 entry per download)
Fixed SQLAlchemy session usage
Fixed SQL aggregate function errors
Removed debug print statements

Testing:

Database migrations tested
API endpoints validated with 200+ test audit records
PDF generation tested with both DataCite4 and ISO19115 schemas

**Backend API Endpoints:** - GET /download/logs - Paginated download logs with filtering (date range, email, organisation, download_type) - GET /download/stats - Download statistics (total downloads, unique users, data volume, success rate) - GET /download/export/csv - CSV export of download logs **Database Improvements:** - Add indexes for efficient querying: - timestamp DESC for date range queries - client_id for user tracking - GIN indexes on meta JSON fields (email, organisation, download_type) - Composite index (timestamp DESC, client_id) for common filter patterns **Features:** - Paginated results with configurable page size (max 200) - Multiple filter options (date range, email, organisation, download type) - Aggregate statistics calculation - CSV export with dynamic filename - Graceful error handling with proper HTTP status codes Alembic Migration: 2025_11_14_b7dd3a950ed5_add_download_audit_indexes

…y data **New Files:** 1. scripts/populate_download_audit.py - Python script for generating realistic dummy download audit records - Uses Faker library for realistic names, emails, organisations - Generates 100 records spread over 30 days (configurable) - Simulates 95% success rate, 5% failure rate - Creates mix of single record and ZIP bundle downloads - Supports --count and --days command line arguments - Prints statistics after insertion 2. sql/download/populate_dummy_data.sql - Pre-written SQL script with 20 hand-crafted records - Quick option for manual database population - Uses PostgreSQL JSONB functions - Includes realistic South African organisations - Mix of download types and dates - Can be run directly with psql or from admin tools 3. DOWNLOAD_AUDIT_README.md - Complete documentation for the download audit system - Instructions for both population methods - API endpoint documentation with examples - Testing guide for reporting features - Performance notes and database schema - Data clearing instructions - Sample organisations and user data **Features:** - Generates realistic user data (names, emails from various organisations) - South African IP addresses and real organisation names - Proper file size distributions - Complete metadata for both single and batch downloads - JSONB meta field structure matching production data - Timezone-aware timestamps distributed across date range

**New Files:** 1. scripts/populate_simple.py - Simplified Python script for populating download_audit table - NO external dependencies required (no Faker) - Generates 50 realistic dummy records - Spreads over last 30 days - Mix of single record and ZIP bundle downloads - 95% success rate, 5% failure rate - Shows statistics after insertion - Easy to run: python scripts/populate_simple.py 2. POPULATE_DATA_GUIDE.md - Complete guide for populating download audit table - Three methods: Python (simple), SQL, or Python (with Faker) - Step-by-step instructions - Database connection troubleshooting - Verification commands - Data clearing instructions - Sample data details and format - Expected output examples **Usage:** ``` cd odp-server python scripts/populate_simple.py ``` This will insert 50 realistic test records into download_audit table with proper metadata matching production format.

- Implement POST /catalog/download/bundle endpoint - Streams ZIP file containing metadata PDFs for multiple records - Validates record existence and enforces 2GB size limit - Logs downloads to download_audit table with bundle metadata - Uses built-in zipfile module (no external dependencies) - Returns StreamingResponse for memory efficiency This replaces client-side ZIP generation which caused browser memory issues.

… odp-ui - Move build_metadata_pdf() function to odp-server - Add ReportLab to odp-server dependencies - Update /catalog/download/bundle to generate PDFs internally - Eliminate HTTP calls to odp-ui during bundle creation - Faster bundle creation with no inter-service latency - Single service responsible for all file operations - Simplifies error handling and monitoring Benefits: - Eliminates network roundtrip for PDF generation - Faster bundle creation and streaming - More reliable error handling - Better separation: odp-ui handles UI, odp-server handles data processing

- Changed final_size calculation from zip_buffer.seek() to len(getvalue()) - Ensures accurate file size is captured after ZIP compression - Previously was recording NULL values for bundle downloads - Now properly logs bundle file size to download_audit table for reporting

- Add file size validation and debug logging - Check if record_data is empty before PDF generation - Check if PDF blob is empty after generation - Add bundle_size_bytes to meta field for audit trail - Only set file_size if > 0 to catch empty bundles - Better error messages for troubleshooting This requires restarting the odp-server container to take effect.

- Add 'files_added' tracking for each file added to bundle - Record filename, size, DOI, and type for each file - Include 'files_in_bundle' in audit metadata with complete file list - Add 'total_files' count to audit metadata - Include 'zip_file_size' in audit metadata for clarity This provides complete visibility into what was downloaded and file sizes.

Changed Session.begin() to Session.begin() as session to properly use the session context when adding audit records. This ensures file_size and other metadata are actually persisted to the database.

Add doi and record_id to the list of fields copied from the audit payload to the meta JSONB field. This enables the admin interface to generate direct links to catalog records for both single record and bundle downloads.

Add the complete meta JSONB field to the response so that the admin interface can access DOI, record_id, and other metadata for generating record links in the download logs table.

Use `with Session.begin() as session:` pattern instead of just `with Session.begin():` to properly capture the session instance. Call session.add() and session.flush() on the session instance rather than the Session class. This aligns with the pattern used in catalog.py and ensures proper transaction handling.

Use the standard `with Session() as session:` pattern instead of `Session.begin()` context manager for both download audit and bundle endpoints. The latter returns a SessionTransaction object which doesn't have the add() method. Changes: - download.py: Create_download_audit endpoint - catalog.py: Create_download_bundle endpoint Both now use session.commit() instead of session.flush() for explicit transaction control. ored-By: Claude <noreply@anthropic.com>

Document expected payload fields for POST /download/audit endpoint, including the optional but important 'doi' and 'record_id' fields that external systems (like MIMS) should include when logging downloads. This clarifies what information external systems need to send to properly track record downloads.

Enhanced download statistics with: - Top 20 organisations by download count and unique users - Top 10 most downloaded records (by DOI) with user engagement - Daily downloads time-series with successful/failed breakdown

- Replace incorrect func.filter() usage with CASE/SUM pattern - PostgreSQL filter() function requires correct aggregate syntax - Use CASE WHEN ... THEN 1 ELSE 0 wrapped in SUM() for conditional aggregates - Add missing 'case' import from sqlalchemy - Fixes: "function filter(boolean, integer) does not exist" error This resolves the 500 error in GET /download/stats endpoint. Co-Authored-By: nando <n.bingani@saeon.nrf.ac.za>

…irements.txt added dependemcies

Replace raw download URLs in CSV export with actual MIMS catalog links. Single record downloads now include the DOI link, and ZIP bundles include the subset query with all DOIs. Uses MIMS_CATALOG_URL from environment. Authored-By: Nando Bingani <n.bingani@saeon.nrf.ac.za>

- Add metadata_adapters.py: Converts DataCite4 and ISO19115 formats to unified RecordMetadata - Add metadata_pdf.py: Core PDF generation using ReportLab with schema-agnostic approach - Add POST /catalog/metadata/generate-pdf: API endpoint for PDF generation - Migrate bundle creation to use unified module with fallback to legacy function - Include comprehensive error handling and download audit logging - All changes follow existing patterns in codebase (download-audit endpoint style) This consolidates PDF generation to single source of truth (ODP Server). Authored by nando-bingani <n.bingani@saeon.nrf.ac.za>

The /catalog/metadata/generate-pdf and /catalog/{catalog_id}/records/{record_id}/metadata.pdf endpoints were logging audit entries every time they were called. However, these endpoints are typically called internally by other endpoints (MIMS downloads, ZIP bundle generation) which already handle their own audit logging. This was causing duplicate audit entries. Solution: Remove audit logging from internal PDF generation endpoints. The calling endpoints handle audit logging appropriately. This fixes the issue where downloading a ZIP bundle with 2 records would create 3 audit entries instead of 1. Authored by nando-bingani <n.bingani@saeon.nrf.ac.za>

Remove all debug print() statements from catalog.py: - Remove 'Empty record data' warnings - Remove 'Falling back to legacy PDF generation' info logs - Remove 'Generated empty PDF' warnings - Remove 'Added to ZIP' debug logs - Remove error processing messages - Remove ZIP buffer empty warnings These debug statements are not needed in production and clutter the code. Error handling is maintained via exceptions. Authored by nando-bingani <n.bingani@saeon.nrf.ac.za>

dylanpivo · 2026-02-06T09:20:28Z

migrate/alembic.ini

 # output_encoding = utf-8

-# sqlalchemy.url = driver://user:pass@localhost/dbname
+sqlalchemy.url = postgresql://odp_user:pass@localhost:5432/odp_db


Is this a change for testing?

dylanpivo · 2026-02-06T09:23:37Z

odp/api/routers/catalog.py

+    )
+
+@router.post('/download/bundle')
+async def create_download_bundle(


The auth dependency is not included here.

Two blank lines above the function. Use formatting shortcu

dylanpivo · 2026-02-06T09:33:39Z

odp/api/routers/catalog.py

+
+
+# ============================================================================
+# PDF Generation API Endpoints (Unified Module)


We don't generally eave comments like this. Nae a function in such a way that the user knows what it does

dylanpivo · 2026-02-06T09:34:27Z

odp/api/routers/catalog.py

+
+
+# ============================================================================
+# PDF Generation Utility (Legacy - Kept for backward compatibility)


Legacy? This is a completely new part of the system?

dylanpivo · 2026-02-06T09:39:55Z

odp/api/routers/catalog.py

+# PDF Generation Utility (Legacy - Kept for backward compatibility)
+# ============================================================================
+
+def build_metadata_pdf(record_data: dict) -> BytesIO:


This should probably live outside the router so as not to clutter it

New features: - New library module: odp/lib/zip_generator.py for server-side ZIP generation - ZIP bundles include metadata PDFs and data files - Support for both single record and bulk downloads API Changes: - New endpoint: POST /catalog/generate-zip-bundle - UserData model for audit logging (name, email, organisation) - Authentication required: ODPScope.CATALOG_READ Bug Fixes: - Fixed download.py to use record_ids instead of dois in zip_bundle audit export - Cleaner audit logging structure with proper metadata tracking This enables the unified download flow across both search results and detail pages, supporting both single record and bulk batch downloads with consistent audit logging. Author: nando-bingani <n.bingani@saeon.nrf.ac.za>

…erdata extraction

dylanpivo

On top of the queries. Please look at things like comments, naming and formatting.
In general, have a look at the way things are already being done in the system and use the same methods.

dylanpivo · 2026-02-13T07:54:18Z

odp/lib/zip_generator.py

+logger = logging.getLogger(__name__)
+
+
+def fetch_record_data(id_or_doi: str) -> Optional[Dict[str, Any]]:


I don't think you should fetch by either ID or DOI. Fetch only by id. Also consider fetching the metadata from the record. It will simplify the possibility of having more than one metadata record.

dylanpivo · 2026-02-13T07:56:00Z

odp/lib/zip_generator.py

+            }
+    except Exception as e:
+        logger.error(f"Error fetching record {id_or_doi}: {str(e)}")
+        return None


Look at formatting

dylanpivo · 2026-02-13T08:02:32Z

odp/lib/zip_generator.py

+        for record_id in record_ids:
+            try:
+                # 1. Fetch
+                data_package = fetch_record_data(record_id)


Fetching from the DB is an expensive operation. Why not rather fetch all the metadata records in one DB call and iterate through those rather than doing a fetch for each id?

dylanpivo · 2026-02-13T08:07:10Z

odp/lib/zip_generator.py

+        }
+
+        # Switch logic based on count
+        if is_single_record:


Rather than having an is/else, why not set values that are dependent on it being a single or multiple records, and then set meta_data?

dylanpivo · 2026-02-13T08:12:02Z

odp/lib/zip_generator.py

The naming of files and functions is very important. It tells us at a glance what the role of the files and functions are. zip_generator is very generic and sounds like it will generate a generic zip. This develops a zipped file based on a very specific context. Try name it something that gives us a better clue about what it's doing

dylanpivo · 2026-02-13T08:33:45Z

odp/api/routers/download.py

This router has a lot of internal logic. It builds csv's etc. This shouldn't be the case. We are looking for single responsibility where possible.

dylanpivo · 2026-02-13T08:35:42Z

test/lib/test_metadata_pdf.py

Have a look at how we use factory boy to generate test data

dylanpivo · 2026-02-13T08:37:11Z

DOWNLOAD_AUDIT_README.md

This new download code should not be so strikingly new and strange that it requires it's own readme?

dylanpivo · 2026-02-13T08:38:17Z

POPULATE_DATA_GUIDE.md

Do we need a readme for this? We have a structure for populating test data. That same structure should be followed which will remove the need for a readme.

dylanpivo · 2026-02-13T08:39:06Z

test_metadata_standalone.py

Only add tests in the test package.

gracezhou-tech · 2026-02-16T15:20:08Z

odp/api/routers/catalog.py


+class UserData(BaseModel):
+    """User information for audit logging."""
+    name: str = Field(..., description="Full name of the user", min_length=1)


When collecting info such as the user's name, does this not trigger POPIA issues? Do we need to store the user's name in the database or perhaps we have to obfuscate this.

gracezhou-tech · 2026-02-16T15:47:30Z

odp/api/routers/catalog.py

+        user_agent = request.headers.get('user-agent')
+
+        # Delegate to library function for ZIP generation
+        from odp.lib.zip_generator import create_zip_bundle


I recommend doing this import at the beginning of the file for better performance (since this feature might be used a lot) and better readability.

…iting This commit addresses feedback by refactoring key components to ensure architectural consistency, improve performance, and align with the existing system's design patterns. Key Changes: - Refactored ZIP Generation: - Renamed 'odp/lib/zip_generator.py' to 'odp/lib/bundle_generator.py' to better reflect its specific context of bundling metadata PDFs and data files. - Optimized database performance by replacing iterative individual record fetches with a bulk fetch of catalog records in a single query. - Standardized record lookups to use IDs exclusively for internal metadata retrieval. - Improved Metadata Adaptation & PDF Generation: - Renamed the metadata PDF module to 'odp/lib/pdf_generator.py' for clarity. - Unified the internal representation of record metadata into a 'RecordMetadata' dataclass to streamline processing across various schemas. - Extracted person information parsing into a shared helper to eliminate code duplication between DataCite and ISO19115 adapters. - Refactored 'AutoDetectAdapter' from a class into a factory function 'adapt_metadata' to maintain clean separation of concerns. - Service-Router Separation: - Decoupled business logic from the API layer by moving CSV generation, statistics calculation, and heavy filtering into a new dedicated 'odp/lib/download_service.py' module. - Updated 'odp/api/routers/download.py' to delegate all core logic to the service layer, adhering to the single responsibility principle. - Database & Models Consistency: - Standardized auditing field names by renaming 'meta_data' to 'meta' across audit models for cross-system consistency. - Utilized config-based URL management instead of direct environment lookups and removed hardcoded dev defaults. - Testing & Cleanup: - Integrated 'factory_boy' and 'faker' into 'test/factories.py' for robust, dynamic test data generation across the suite. - Removed redundant standalone documentation and test scripts, consolidating tests into the standard project structure (e.g., 'test/lib/test_metadata_pdf.py'). - Applied PEP 8 formatting standards and removed redundant block comments as requested during review.

nando-bingani added 25 commits September 9, 2025 10:40

Catalog subet endpoint

5228008

added keywords logic on the catalog search endpoint

0026212

download audit alembic update

6c704e1

Added download audit model and endpoint

ae281d5

Fix Session.add() issue in bundle download audit logging

000afe7

Changed Session.begin() to Session.begin() as session to properly use the session context when adding audit records. This ensures file_size and other metadata are actually persisted to the database.

Store DOI and record_id in download audit meta

27a0ee1

Add doi and record_id to the list of fields copied from the audit payload to the meta JSONB field. This enables the admin interface to generate direct links to catalog records for both single record and bundle downloads.

Include full meta object in download logs API response

9c083bd

Add the complete meta JSONB field to the response so that the admin interface can access DOI, record_id, and other metadata for generating record links in the download logs table.

Add comprehensive analytics to /stats endpoint

da23ce8

Enhanced download statistics with: - Top 20 organisations by download count and unique users - Top 10 most downloaded records (by DOI) with user engagement - Daily downloads time-series with successful/failed breakdown

removed report lab comments and added report lab via pip sync on requ…

28fb8d7

…irements.txt added dependemcies

nando-bingani requested review from dylanpivo and gracezhou-tech February 4, 2026 13:14

dylanpivo reviewed Feb 6, 2026

View reviewed changes

nando-bingani added 2 commits February 10, 2026 18:37

changed meta_data to audit_mata for consistancy, removed redundant us…

b3a8474

…erdata extraction

dylanpivo requested changes Feb 13, 2026

View reviewed changes

gracezhou-tech reviewed Feb 16, 2026

View reviewed changes



		# ============================================================================
		# PDF Generation API Endpoints (Unified Module)



		# ============================================================================
		# PDF Generation Utility (Legacy - Kept for backward compatibility)

		logger = logging.getLogger(__name__)


		def fetch_record_data(id_or_doi: str) -> Optional[Dict[str, Any]]:

Conversation

nando-bingani commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanpivo Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanpivo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dylanpivo Feb 6, 2026 •

edited

Loading