This project automates the process of extracting, transforming, and loading (ETL) data about the world's largest banks by market capitalization. It fetches data from a Wikipedia page, processes it, and stores the results in both CSV and SQLite database formats.
- Extracts bank data from a Wikipedia page
- Transforms market capitalization data from USD to GBP, EUR, and INR
- Loads processed data into a CSV file and SQLite database
- Logs each step of the ETL process
- Executes sample SQL queries on the resulting database
- Python 3.x
- Required Python packages:
- requests
- pandas
- beautifulsoup4
- sqlite3
main.py: The main Python script that performs the ETL processexchange_rate.csv: CSV file containing currency exchange ratesLargest_banks_data.csv: Output CSV file with processed bank datacode_log.txt: Log file that records the progress of the ETL processBanks.db: SQLite database file storing the processed datadocs/: Directory containing project documentationHLD.md: High-Level Design documentLLD.md: Low-Level Design document
- Ensure all required Python packages are installed:
pip install requests pandas beautifulsoup4
-
Place the
exchange_rate.csvfile in the same directory as the script. -
Run the script:
python main.py
- The script will:
- Extract data from the specified Wikipedia page
- Transform the data using exchange rates from
exchange_rate.csv - Load the data into
Largest_banks_data.csvand the SQLite databaseBanks.db - Log the progress in
code_log.txt - Execute and display results of sample SQL queries
log_progress(message): Logs messages with timestampsextract(url, table_att): Extracts data from the Wikipedia pagetransform(df_): Transforms the data using exchange ratesload_to_csv(df_, file_path): Saves data to a CSV fileload_to_db(df_): Saves data to the SQLite databaserun_query(query_statement, conn_): Executes SQL queries and displays results
The docs directory contains detailed design documents:
HLD.pdf: High-Level Design document outlining the overall architecture and components of the projectLLD.pdf: Low-Level Design document providing detailed specifications and function descriptions
Refer to these documents for a comprehensive understanding of the project's design and implementation.
- To modify the source URL, update the
url_datavariable - To change the output file names or locations, update the respective variables at the beginning of the script
- The script uses a web archive version of the Wikipedia page to ensure consistency
- Ensure you have proper permissions to read/write files in the script's directory
- Add error handling for network issues or data inconsistencies
- Implement command-line arguments for flexible file paths and URLs
- Create a configuration file for easy customization of parameters