This is a Data Analytics Project by a dedicated team.
The Query Squad Project is a web scraping and data analysis project aimed at extracting and analyzing used car data from two popular websites: TheAA and Cinch. This project utilizes R programming along with several libraries to scrape, clean, and analyze the data, producing meaningful visualizations and insights.
- Scrapes used car listings from TheAA and Cinch.
- Extracts essential details such as car name, price, year, mileage, fuel type, and transmission.
- Combines data from both sources into one dataset for comprehensive analysis.
- Generates insightful visualizations including histograms, scatter plots, box plots, and correlation heatmaps.
- Exports the cleaned dataset to a CSV file for further use.
- Installation
- Usage
- Dependencies
- Data Extraction Process
- Data Cleaning & Transformation
- Visualizations
- Output Files
- Contributing
- License
Query_Squad_Project/
├── README.md # Project documentation
├── LICENSE # License file (e.g., MIT License)
├── Query_Squad_Project.R # Main R script
├── data/ # Folder for raw and cleaned data
│ ├── scraped_cars_data.csv
│ └── cleaned_cars_data.csv
├── figures/ # Folder for generated visualizations
│ ├── correlation_heatmap.png
│ └── car_price_vs_mileage_plot.png
└── .gitignore # Ignore unnecessary files
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/Eldeewealth/Query_Squad_Project.git cd query-squad-project
-
Install the required R packages:
install.packages(c("rvest", "httr", "dplyr", "stringr", "ggplot2", "ggcorrplot"))
-
Run the script in your R environment:
source("QueryTeam_Project.R")
This script performs the following tasks:
- Scrapes data from TheAA and Cinch across multiple pages (up to 15 pages for TheAA and up to 13 pages for Cinch).
- Cleans and transforms the scraped data, ensuring consistency between datasets.
- Generates summary statistics and visualizations.
- Saves the cleaned dataset and visualizations to files.
Run the script to generate the dataset (scraped_cars_data.csv
) and visualizations.
The project relies on the following R libraries:
rvest
: For web scraping HTML content.httr
: For making HTTP requests.dplyr
: For data manipulation and transformation.stringr
: For string manipulation.ggplot2
: For creating visualizations.ggcorrplot
: For generating correlation heatmaps.mice
: For handling missing data and imputation in R
Install these dependencies using the install.packages()
function as shown in the Installation section.
- Scrapes data from the used cars section of TheAA's website.
- Extracts car details such as name, price, year, mileage, fuel type, and transmission.
- Loops through pages 1 to 15 to gather data.
- Scrapes data from Cinch's used cars section.
- Dynamically determines the total number of pages (up to 13) based on pagination links.
- Extracts similar car details as TheAA.
Each page's data is stored in a list and combined into a single dataset after all pages are processed.
After scraping, the data undergoes the following cleaning steps:
- Removes unnecessary characters (e.g., "£", ",", " miles").
- Converts numeric columns (Price, Mileage) to appropriate numeric types.
- Handles missing or inconsistent data by replacing them with
NA
. - Adds a "Source" column to differentiate between TheAA and Cinch data.
The cleaned dataset is then ready for analysis and visualization.
The project generates several visualizations to provide insights into the data:
-
Box Plots:
- Compares Price and Mileage distributions across different years.
- Highlights trends and outliers in pricing and mileage.
-
Histogram:
- Displays the distribution of car prices.
- Helps identify common price ranges.
-
Scatter Plot:
- Shows the relationship between Mileage and Price.
- Differentiates data points by source (TheAA vs. Cinch).
-
Correlation Heatmap:
- Visualizes correlations between Price, Mileage, and Year.
- Provides insights into how these variables interact.
All visualizations are saved as image files (*.png
or *.pdf
) in the working directory.
The project generates the following output files:
scraped_cars_data.csv
: The cleaned and combined dataset containing car details from both TheAA and Cinch.- Visualization files:
line_plot_year_price_mileage.png
car_price_vs_mileage_plot.png
correlation_heatmap.pdf
These files can be found in the working directory after running the script.
We welcome contributions to improve this project! To contribute:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Submit a pull request with a detailed description of your changes.
Please ensure your code adheres to best practices and includes appropriate documentation.
This project is licensed under the MIT License. Feel free to use, modify, and distribute the code as per the terms of the license.
For questions or feedback, please contact the project maintainers via GitHub issues or email.
Happy coding! 🚗📊