This is a portfolio of my work applying fundamental data analytic and data science techniques. It includes different code tasks, which are outlined in more detail below.
Organising and Filtering Datasets
Objective: Develop a Python script that filters and organizes a dataset based on specific criteria.
Requirements:
- Pandas library
- Use dictionary, list or set
- Implement functions that filter data
Method: The file Data_Filtering_Organising includes three key user-defined functions:
- Reads the dataset and outputs the values associated with a specific key in the dataset
- Outputs a frequency table for a specified column name
- Creates a filtered data set just including name and age based on a specific age range
Objective:
- To create a programme that plays the rock, paper, scissors game which seeks input from the user and a random choice from the computer to determine a winner.
Method:
- Import Random library
- Utilises exception handling to highlight input (value) errors.
- Define three functions: user's choice, computer's choice, determine a winner
- Create final function: play the game based on previous three functions
Objective:
- To create and interpret simple data visualisations for car manufacturer data
Requirements:
- Pandas library
- Matplotlib.pyplot libray
- NumPy library
- Seaborn library
- Include boxplots, histograms, lineplots and barplots.
Results:
- The manufacturer with the highest revs per mile is the Geo, which has an average revs per mile of 3755
- The distribution of the histogram indicates that fuel consumption is generally higher on the highway as the bar charts are skewed towards the right while city fuel consumption is skewed more towards the left and lower end of the x axis indicating more cars have lower MPG.
- There is a linear relationship between a car's turning circle and its wheelbase. As the wheelbase gets larger the turning circle also gets larger, although there are a few outliers. Where the wheelbase is 115mm, the turning circle angle is at 38, this is lower than for a car with a wheelbase of around 112, which has a higher turning circle of 43 degrees
- Vans have a smaller average horsepower than a sporty or midsized vehicle but a larger car does indeed have a larger average horsepower than both small and compact cars.
Objective:
- To analyse the SQL Lite Sakila database which includes data on DVD rental stores globally.
Output: The SQL file implements queries to create and read the database in order to calculate information such as total payments and average payment information.
Objective: carry out exploratory data analysis of CO2 emissions for countries
Data Source: Kaggle CO2 emissions: 1960-2019
Notebook: Country CO2 Emissions EDA
Requirements:
- Use pandas, numpy, matplotlib and seaborn libraries
- Compute basic stats (mean, median, mode, sd, quartiles)
- Create visualisations (hists, boxplots, scatterplots)
- Apply basic inferential stats (e.g. confidence intervals)
This tool computes descriptive statistics for a dataset on CO2 emissions by countries and visualises these statistics through various charts and graphs. The tool helps in understanding the distribution, central tendency, and variability of the data using timeseries analysis.
Insights:
- The average CO2 emissions for 2019 across all countries in this dataset is 190969.22 kilotons.
- The country with the highest CO2 emissions in 2019 was China. China had 10707219.73 kilotons of CO2 emissions.
- The standard deviation for the CO2 emissions in the 2019 country dataset is 892755.94. This indicates a large variation in data across countries.
- China's CO2 emissions have increased dramatically since the year 2000. In the year 2000, China's CO2 emissions were 3,346,530 kilotons, which rose to 10,707,220 in 2019.
Objective:
- To examine the relationship between specific variables and forest fires in Algeria
Requirements:
- Use pandas, numpy, matplotlib, sklearn and seaborn libraries
Method:
- This analysis involved cleaning a dataset of forest fires in Algeria to handle missing values and inconsistencies.
- Exploratory Data Analysis (EDA) was conducted to examine the relationships between meteorological variables (temperature, wind, and relative humidity) and the area burned by fires.
- Visualisations, including scatter plots and correlation analyses, are used to illustrate these relationships.
- Predictive modeling with Linear Regression and Random Forest Regressor is applied to predict the burned area, revealing the complexity of fire prediction and the need for more sophisticated modeling.
Results: Reveals weak correlations between meteorological factors (temperature, wind, and relative humidity) and the area burned by forest fires. Despite applying Linear Regression models, the predictive accuracy was low, indicating that predicting forest fire areas is complex and may require more advanced models or additional data for better accuracy.
K Means Clustering of Countries by Socio-Economic Features Objective:
- To group countries using socio-economic and health factors to determine the development status of the country.
Requirements:
- Use pandas, numpy, matplotlib, sklearn and seaborn libraries
Method:
The Socioeconomic_Country_Clustering.ipynb notebook applies K-means clustering to group countries based on various socioeconomic indicators. The analysis involves:
- Data Preprocessing: Cleaning and preparing the dataset for analysis.
- Feature Selection: Selecting relevant socioeconomic indicators for clustering.
- K-means Clustering: Applying the K-means algorithm to identify distinct clusters of countries.
- Cluster Analysis: Interpreting the clusters and visualizing the results using various plots.