This project aims to compile a list of key statistics across all common car models and brands, for ease of comparison for a user.
It does this via webscrapping the reliable https://caranddriver.com.
Note there is no front end for this yet
Note for the time being the Data
folder is gitignored as it is quite large
Using python's selenium
module this project is able to webscrape all brands and their models.
The processes of extracting the information from each cars' specs page is currently in construction.
A method to 'iteratively' scrape all models (with variants over all years that model is available) for each brand is currently on the docket to be done.
Ideally once all data is collected it should be displayed properly, most likely in a table.
Graphic of all specs being taken into consideration coming soon or see Base.txt
or Base.csv
in the Docs
folder.
See Todo.txt
-
cd src
-
Obtain links to scrape by running
Targets.py
within theThis writes
AllBrandsAndModels.json
-
Next run
SummaryGenerator.py --summary
This generates the
Links
directory, each brand has its links within a txt file here i.e.Links/${BRAND}.txt
-
Run
ErrLinkLogger.py --check
to go through the links in theLinks
directory and log all invalid links.Invalid links are logged to
Log/ErrorLinks.csv
. -
To correct the invalid links of step 4, copy
Log/ErrorLinks.csv
toLog/ErrorLinks-Fix.csv
appending the corrected link to the model page (main or specs, doesn't really matter) (if any) to the 3rd column. You may use the 4th column for notes. Unfortunately this is a manual process... the coded solution would not be much betterFor example:
- future (car is not yet released thus no specs)
- no specs page (car exists but does not have a specs page provided)
- And more
To correct these links in
Links
directory by runningErrLinkLogger.py --fix
. This overwrites the bad link with the corrected link you found that works. -
(Optional) you may create
Docs/AllLinks.txt
by runningSummaryGenerator.py --all
-
Go up a directory into the main directory
cd ..
-
Run the data scraper
DataScraper.py
- Required: pass the brand you would like to scrape as first argument
- Optional: pass specific model within brand you would like to scrape
- Optional: pass in year or --latest to scrape a specific year or the latest year available
This writes data to
Data/YAML/*
-
(Optional, but recommended) run
Deduplicate.py
in thesrc/data
directory to search for and remove duplicates from the YAML data. -
Run
FileNameVerifyer.py
to find and fix files with bad names, i.e. files that have a "-{year}" in the name.Run with
--detect
option to display all problematic files and--fix
to fix these files.Note: it may be a good idea to back up the
Data
folder to another place before doing this. It should be fine, but just in case. -
Run
Conversion.py --yaml-json
orConversion.py --yaml2json
to convert the YAML data into JSON data. -
Run
Conversion.py --json-csv
orConversion.py --json2csv
to convert the JSON data into the final formatted CSV dataNote that this requires the creation of
Base.txt
andBase.csv
. Which can be done by runningMakeBase.py
with the--txt
and--csv
flags respectively. See the file for more details.Note: both steps 11 and 12 utilize the correction functionalities generated by
CorrectionMaker.py
.Running this file generates the
Corrections
directory and the filesCorrections_Template.py
andCorrections.py
. In theCorrections
directory are autogenerated template files for implementing data corrections/formats.WARNING: running this file again will overwrite manual corrections made to these template files. Use files like
CorrectionUpdater.py
to update the files if keynames need to be changed.The file
Corrections_Template.py
serves as a template/superclass for the correction files. TheCorrections.py
file drives the correction files and connects them to the functionality inConversion.py
.Note: The file
CorrectionStatus.py
logs the status of implementation of specification corrections by brand to an excel (.xlsx
) sheet. -
Compliling the data into one source. Run
CompCSV.py
. This generates a CSV file for each brand named${BRAND}.csv
in each brand's csv data directory. It also creates the fileData/CSV/AllData.csv
. This file will contain all the data obtained.