Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 0 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +0,0 @@
# United States Temperature and Climate Analysis
#### CS179G - Project in Databases
### Created by Josh Pennington, Yixuan Shang, Ojasvi Godha, Adreyan Distor and Salma Ahmed

In this project, we aim to analyze weather data across the United States using Global Historical Climatology Network – Daily (GHCN-Daily) dataset. We wish to explore which major area of the US has been most noticeably getting warmer throughout the years.

Our dataset includes daily records from various weather stations, including maximum/minimum temperatures, precise geolocation metadata (latitude, longitude, and elevation) and timestamped entries dating back over a century.

We plan to divide the weather stations, via their longitudes and latitudes, to split the US into 3 regions: West, Central, and East. We will analyze any noticeable trends in each region to determine which has the highest warming trend and we believe the West region will be the most affected.

Additionally, using our analysis, we plan to develop a linear regression model using Apache Spark to predict daily weather measurements based on three features: Longitude, Latitude and Date. This model will allow us to estimate the temperature for any locations in the US with date, based on spatial and temporal patterns in historical data.


## About the Dataset

Our main dataset we use is a subset of the Global Historical Climatology Network. Specifically, we are only using part of the United States weather data. The detailed README file for GHCN can be found [here](https://www.ncei.noaa.gov/pub/data/ghcn/daily/readme.txt). We were able to reduce the given raw dataset into four columns: ID, DATE, ELEMENT, and VALUE. The format and definitions of those are as follows:

ID 11 characters. The station identification code which can be linked to the station in "stations.txt".

DATE 8 characters. The date of when the record was recorded (YYYYMMDD).

ELEMENT 4 characters. The three types of elements we are using are as follows:

PRCP = Precipitation (tenths of mm)
TMAX = Maximum temperature (tenths of degrees C)
TMIN = Minimum temperature (tenths of degrees C)

VALUE Integer. The recorded value of the element on this particular day.

## How to run the code:
System requirements: (list)
spark-submit
73 changes: 73 additions & 0 deletions dbinterface.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!bin/bash

DATADIR="$HOME/mysql/data"
SOCKET="$HOME/mysql.sock"
SQLUSER="root"
DB_NAME="ustca"
TABLE_NAME="observation"
FOLDER="$HOME/data/observations"


case "$1" in
startdb)
echo 'startdb'
mysqld --datadir=$DATADIR --socket=$SOCKET --local-infile=1 --port=3307 --innodb-buffer-pool-size=1G &
;;
stopdb)
echo 'stopdb'
mysqladmin --socket="$SOCKET" -u $SQLUSER shutdown
;;
load)
echo 'load'
echo 'Create tables'
echo -n "Proceed? [y/n]: "
read -r ans
if [[ "$ans" == 'y' ]] || [[ "$ans" = 'Y' ]]; then
echo 'ARE YOU SURE?'
echo -n 'WARNING: DOING THIS WILL DELETE ALL EXISTING DATA [y\n]: '
read -r ans
if [[ "$ans" == 'y' ]] || [[ "$ans" = 'Y' ]]; then
mysql --local-infile=1 < sql/load_data.sql
echo 'created tables'
fi
fi

echo 'Load observations'
echo -n "Could take up to 1hr. Proceed? [y/n]: "
read -r ans
if [[ "$ans" == 'y' ]] || [[ "$ans" = 'Y' ]]; then
echo -n 'Confirm? [y\n]: '
read -r ans
if [[ "$ans" == 'y' ]] || [[ "$ans" = 'Y' ]]; then

for file in "$FOLDER"/*.csv; do
echo "Loading $file..."
mysql --local-infile=1 $DB_NAME -e "
LOAD DATA LOCAL INFILE '$file'
INTO TABLE $TABLE_NAME
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(@id, @element, @value, @date)
SET
id = @id,
element = @element,
data = NULLIF(@value, ''),
date = @date;
"
done

mysql $DB_NAME -e "
DELETE FROM $TABLE_NAME where date = 0;
"

echo 'loaded observations'
fi
fi
;;
*)
echo "Usage: source dbinterface.sh {startdb|stopdb|load}"
;;
esac

74 changes: 0 additions & 74 deletions ghcn/ghcnd-states.txt

This file was deleted.

Loading