GitHub - CB-LBG/L5-DE-Hackathon-3

Hackathon Part A – Pipeline Design Objectives

In this exercise, you will:

Design and implement a modular Python data pipeline to process multiple related datasets.
Use Pandas and other tools to ingest, clean, transform, and merge real-world data.
Generate a minute-by-minute temperature estimate for 2012 using a model based on daily weather summaries.
Document your pipeline logic, assumptions, and outputs clearly for reuse by other developers.

Overview

You have been provided with two zipped files in the folder PartA_Data. The first contains data about the time of sunrise, sunset and the length of the day in Edinburgh. The second contains weather data from a weather station in Scotland.

Your goal is to build a Python data pipeline that merges these sources, models temperature changes over time, and outputs a synthetic dataset estimating temperature every minute for 2012. This data will later be used to analyse electronic equipment performance under environmental stress. Pipeline Structure

Your repository should contain the following:

/data-pipeline-hackathon/ │ ├── data/ │ ├── Edinburgh-daytime.xlsx │ └── Strathspey-weather.xlsx │ ├── pipeline/ │ ├── ingest.py │ ├── clean.py │ ├── transform.py │ ├── merge.py │ └── resample.py │ ├── output/ │ └── merged_summary.csv │ ├── main.py └── requirements.txt

Each module should perform a clear function:

ingest.py – reads Excel data, handling multiple worksheets dynamically.
clean.py – removes empty rows, fixes headers, and standardises column names.
transform.py – converts date fields and derives useful numerical values.
merge.py – joins astronomical and weather data on date.
resample.py – interpolates daily data to one-minute intervals using a sinusoidal temperature model.

Temperature Estimation Model

To estimate temperatures between the daily minimum and maximum, use a sinusoidal model:

[ T(t) = T_{avg} + A \cdot \sin\left(\frac{\pi (t - t_{peak})}{12}\right) ]

Where:

( T_{avg} ) is the average of daily min and max temperature.
( A ) is half the difference between max and min.
( t_{peak} ) is the hour of peak temperature, typically around 15:00.

This generates a smooth temperature curve that peaks mid-afternoon and reaches its minimum near dawn. Output

Your final dataset should:

Contain 527,040 rows (one per minute from midnight 31 December 2011 to 23:59 on 31 December 2012).
Include columns for datetime, estimated_temp, and optionally metadata such as Location.

Deliverables

Working pipeline – all stages functional and reproducible.
Test process – confirm row counts and data consistency.
Code quality – meaningful variable names and clear comments.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
data		data
pipeline		pipeline
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages