This project focuses on analyzing air quality data to assess various atmospheric measurements. By applying different data analysis techniques, including data imputation, statistical testing, and graphical visualizations, insights into air quality metrics are uncovered.
This project analyzes air quality measurements from a dataset containing observations of ozone concentration (Ozono), solar radiation (RadSol), wind speed (Vient), temperature (Temp), and temporal data: month (Mes) and day (Dia). The primary goal is to understand the relationship between these variables and their impact on air quality.
- Language: R
- Libraries:
micefactoextragridExtratidyversevisdatdlookrflextableinspectdfqqplotrggpmisccorrplot - IDE: RStudio
The data is loaded from a CSV file into a dataframe named air using the read.csv() function. This dataframe contains 100 observations and 6 numeric variables related to air quality measurements.
An initial review of the dataframe was conducted to identify any missing values (NA). The analysis revealed missing values in the Ozono and RadSol variables, which account for 3.7% of the data.
The mice package was utilized to effectively impute missing values, storing the result in a new dataframe called air_impt.
- Frequency histograms were created for the numeric variables (
Ozono,RadSol,Vient,Temp) with mean and median lines to visualize distributions (Figure 1). - Normal density and cumulative probability graphs were generated for each variable (Figure 2).
Figure 1: Frequency histograms
Figure 2: Normal density and cumulative probability graphs for RadSol and Vient
- Hypothesis Testing: Null and alternative hypotheses were defined to assess normality, applying the Shapiro-Wilk test.
- Skewness and Kurtosis: Skewness and kurtosis were calculated to understand the distribution characteristics of the variables.
Boxplots were utilized to identify outliers in the Ozono and Vient variables (Figure 3). Outlier significance was assessed using the flextable(diagnose_outlier()) function (Table 1).
The analysis highlighted significant deviations from normality in the Ozono and RadSol variables. The Shapiro-Wilk test confirmed these findings, emphasizing the need for potential transformations in further analysis.
Missing values were successfully imputed, and the month and day variables were transformed into categorical types. The analysis indicated skewness in some variables, suggesting the necessity for transformations or non-parametric modeling.
To enhance the findings, future efforts will include:
- Applying transformations (e.g., logarithmic or square root) to skewed variables.
- Conducting multivariate analysis to explore relationships between air quality metrics.




