Detailed teaching plan in this Google Doc
Forecasting air pollution via a data science approach
This is a two-day workshop designed for students in Grade 8-12 with the goal to gain an authentic experience in data science process by building an air pollution forecasting model.
During the first day of the workshop, students will use the 2017 U.S. air quality and population data to model the relationship between population and air pollution. By working with this data set, students will learn how to use histograms to describe the distribution of one variable and use scatter plots to describe the relationship between two variables.
During the second day of the workshop, students will also use two linear models to describe and predict the relationship between air pollutant and population, and use a holdout data set to validate and evaluate the error of the linear models. After all students become familiar with the data, tools, and relevant statistical concepts, students will use the daily air quality, weather, and yearly population data in North Carolina from 2013-2017 to build a PM2.5 forecasting model using the decision tree algorithm. The workshop will conclude with small group presentations to showcase their prediction models.
- The research topic “Human Impacts on Air Pollution” addresses one of the NGSS core ideas in Earth and Space Sciences: ESS3.C: Human Impacts on Earth Systems.
- The statistical concepts used in this workshop align well with NC Mathematics State Core Standards, including Summarize, represent, and interpret data on two categorical and quantitative variables and Interpret linear models.
- This topic is complementary to other ongoing data science curriculum development efforts, such as the workshops developed by ODI.
- Data of U.S. county air quality (five pollutants) and population estimates: air quality data is pulled from EPA Historical Air Quality database. Population estimate data is from U.S. Census Bureau.
- Data exploration interface: CODAP with Shodor Interactivate Histogram as a plug-in.
- Data modeling interface: Jupyter notebook (Google Colab).
- Other materials for hands-on activities exploring air pollutants & decision tree model.
- Using Google products to manage real-time feedback (Form), collaboration (Document & Sheet), data analysis tools (CODAP files) & presentation+demo.
Students explore parameters of a linear model and the resulting mean squared errors.
