This application provides a web interface to upload a CSV file via URL, process its data, and retrieve various statistics and metrics.
- Features
- Project Structure
- Setup and Installation
- Running the Application
- Usage and Endpoints
- Data Definitions (Processing Statistics)
- Error Handling
- CSV Upload via URL: Easily provide a URL to your CSV file for processing.
- Data Cleaning: Handles blank rows, structural malformations, content-based malformations, and duplicate entries.
- Comprehensive Processing Statistics: Get detailed counts of rows at various stages of processing.
- Sales Metrics Calculation: Compute key sales metrics grouped by month or year.
- In-Memory Storage: Processed file data is stored temporarily in memory for quick access.
- FastAPI Backend: A robust and high-performance API.
- Jinja2 Templates: Simple web interface for file upload.
.
├── main.py
├── requirements.txt
└── app/
├── services/
│ ├── file_handler.py
│ └── metrics_calculator.py
└── templates/
└── form.html
main.py: The main FastAPI application file, handling routing and endpoint definitions.requirements.txt: Lists all Python dependencies required for the project.app/services/: Contains core business logic.file_handler.py: Responsible for downloading CSVs, performing detailed data analysis (counting various row types), and storing processed data in memory.metrics_calculator.py: Calculates sales-related metrics from the cleaned DataFrame.
app/templates/: Stores Jinja2 HTML temp lates.form.html: The web form for CSV URL submission.
Follow these steps to set up and run the application locally.
- Python 3.13+
pip(Python package installer)
It's highly recommended to use a virtual environment to manage project dependencies.
-
Navigate to your project's root directory (where
main.pyis located):cd your_project_directory -
Create a virtual environment (e.g., named
.venv):python -m venv .venv
-
Activate the virtual environment:
- On Windows:
.\.venv\Scripts\activate
- On macOS/Linux:
source ./.venv/bin/activate
You should see
(.venv)or similar prefix in your terminal prompt, indicating the virtual environment is active. - On Windows:
With your virtual environment activated, install all required Python packages:
pip install -r requirements.txtrequirements.txt content:
(Please ensure your requirements.txt contains these. If you need to generate one, run pip freeze > requirements.txt after manually installing fastapi, uvicorn, pandas, requests, jinja2):
fastapi==0.111.0 # Or your preferred version
uvicorn==0.30.1 # Or your preferred version
pandas==2.2.2 # Or your preferred version
requests==2.32.3 # Or your preferred version
Jinja2==3.1.4 # Or your preferred version
python-multipart==0.0.9 # Required for form parsing in FastAPI
(Adjust versions to match your exact setup if you have specific requirements.)
Once dependencies are installed, you can start the Uvicorn server:
uvicorn app.main:app --reloadmain: Refers to themain.pyfile.app: Refers to theFastAPI()instance namedappinsidemain.py.--reload: (Optional) Automatically reloads the server on code changes, useful for development.
You should see output similar to this, indicating the server is running:
INFO: Will watch for changes in these directories: ['/path/to/your_project_directory']
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started reloader process [PID]
INFO: Started server process [PID]
INFO: Waiting for application startup.
INFO: Application startup complete.
The application will be accessible at http://127.0.0.1:8000.
Open your web browser and navigate to the application's root URL:
- URL:
http://127.0.0.1:8000/
You will see a simple form where you can enter the URL of your CSV file. Enter the URL and click "Submit".
Example URL (replace with your actual CSV file URL):
https://example.com/path/to/your_data.csv
The response will provide a file_id which you will use for subsequent API calls.
{
"message": "File processed successfully",
"file_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef"
}You can also use a tool like curl or Postman to upload a CSV URL directly to the API endpoint:
- Endpoint:
POST /upload - Method:
POST - Content-Type:
application/x-www-form-urlencoded - Form Field:
csv_url(string)
Example curl command:
curl -X POST "http://127.0.0.1:8000/upload" \
-H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode "csv_url=https://your-csv-file-url.com/order_items_data.csv"The response will be similar to the web form's response, providing a file_id.
Once you have a file_id from the upload step, you can retrieve detailed processing statistics for that file.
- Endpoint:
GET /api/v1/order-items/uploads/{file_id}/processing-stats - Method:
GET - Path Parameter:
file_id(string, the ID returned from/upload)
Example curl command:
curl -X GET "http://127.0.0.1:8000/api/v1/order-items/uploads/ca7457af-24ff-4aee-90a5-e913da494ac8/processing-stats"Example JSON Response:
{
"uploaded_at": "2025-07-02T11:13:20.172913Z",
"durations": {
"download_seconds": 129,
"processing_seconds": 0,
"total_seconds": 129,
"formatted": {
"download": "0 days 00:02:09.158386",
"processing": "0 days 00:00:00"
}
},
"rows": {
"total": 8369150,
"blank": 624065,
"malformed": 115512,
"encoding_errors": 0,
"duplicated": 867428,
"sanitised": 3318926,
"valid": 3203414,
"usable": 2335986
},
"outcome": {
"accepted": 2335986,
"rejected": 6033164
}
}You can also retrieve sales metrics grouped by month or year.
- Endpoint:
GET /api/v1/order-items/uploads/{file_id}/metrics - Method:
GET - Path Parameter:
file_id(string, the ID returned from/upload) - Query Parameter:
groupby(string, eithermonthoryear)
Example curl command (grouped by month):
curl -X GET "http://127.0.0.1:8000/api/v1/order-items/uploads/ca7457af-24ff-4aee-90a5-e913da494ac8/metrics?groupby=month"Example JSON Response:
{
"group_by": "month",
"start_date": "2025-01-01",
"end_date": "2025-05-31",
"uploaded_at": "2025-07-02T11:13:20.172913Z",
"grand_totals": {
"total_orders": 3074347,
"gross_sales": 341859035.58000004,
"net_sales": 384614250.4,
"grand_total": 334420943.48,
"most_popular_product_sku": "uxm-68468202",
"least_popular_product_sku": "JnV-31918656Ú¿¡"
},
"metrics": [
{
"period": "2025-01",
"total_orders": 630656,
"gross_sales": 70146826.55,
"net_sales": 78919850.69,
"grand_total": 68618129.45,
"most_popular_product_sku": "nli-44565082",
"least_popular_product_sku": "ALv-61199172¬�þ"
},
{
"period": "2025-02",
"total_orders": 568043,
"gross_sales": 63195782.17,
"net_sales": 71103521.39,
"grand_total": 61817565.9,
"most_popular_product_sku": "jUp-18407667",
"least_popular_product_sku": "amB-95367729áúí"
},
{
"period": "2025-03",
"total_orders": 633044,
"gross_sales": 70436231.69,
"net_sales": 79242206.72,
"grand_total": 68917049.36,
"most_popular_product_sku": "amB-95367729",
"least_popular_product_sku": "uxm-68468202�ÁÖ"
},
{
"period": "2025-04",
"total_orders": 611383,
"gross_sales": 67925220.83,
"net_sales": 76424456.19,
"grand_total": 66457754.79,
"most_popular_product_sku": "uHc-41532924",
"least_popular_product_sku": "uxm-68468202åô�"
},
{
"period": "2025-05",
"total_orders": 631221,
"gross_sales": 70154974.34,
"net_sales": 78924215.41,
"grand_total": 68610443.98,
"most_popular_product_sku": "fDQ-79436990",
"least_popular_product_sku": "raO-21663955éÍÜ"
}
]
}(Note: Example metrics might vary based on the actual content of your CSV.)
The rows and outcome sections in the processing statistics provide a detailed breakdown of the data quality and processing results. Here are the precise definitions:
total: The total count of data lines identified in the original CSV file, excluding the header row.blank: The number of data lines that were entirely empty or consisted solely of whitespace characters.malformed: The number of rows that were successfully parsed into columns but contained invalid or unparseable data in critical fields (order_id,sku,item_price,item_tax) after basic cleaning (e.g., non-numeric values in numeric fields, or empty required text fields). This does not include rows that failed structural CSV parsing.encoding_errors: The count of different encoding attempts that failed to successfully decode the raw file content before one succeeded.duplicated: The number of duplicate rows identified based on theorder_item_idfield (ororder_id+skuiforder_item_idis unavailable). Only the first occurrence of a duplicate set is kept.sanitised: The number of data lines that were successfully parsed by the CSV reader into a structured DataFrame. This includes rows that might still be content-malformed or duplicated, but excludes blank lines and lines that failed structural parsing (e.g., incorrect number of fields).valid: The number ofsanitisedrows that passed all content validation rules (i.e.,sanitisedminusmalformedcontent rows).usable: The final number of rows that are considered clean and valid, after removingduplicatedrows from thevalidrows. These are the rows ready for further analytical processing.accepted: The same count asusablerows. These are the rows that the system successfully processed and accepted.rejected: The total count of rows that were ultimately excluded from thetotalset due to beingblank,malformed(content-based), orduplicated.
The application provides HTTP exceptions for various scenarios:
- 400 Bad Request: Invalid URL, invalid file ID format, missing required query parameters, or issues with data content (e.g., invalid
groupbyvalue). - 404 Not Found: File ID does not exist in the in-memory storage.
- 409 Conflict: (Currently not explicitly used, but could indicate data still processing or unavailable).