diff --git a/lessons/07_AI_agents/05_github_copilot.md b/lessons/07_AI_agents/05_github_copilot.md new file mode 100644 index 0000000..9711b3c --- /dev/null +++ b/lessons/07_AI_agents/05_github_copilot.md @@ -0,0 +1,65 @@ +# Paired Programming Demo with GitHub Copilot +In this practical demo, you will use GitHub Copilot to assist you in evaluating and improving a code project. + +## What is GitHub Copilot? +GitHub Copilot is a paired programming agent that can look over an entire code base (not just single files), and help you write, fix, and improve code. It can work with VS Code via extensions and can be surprisingly effective at debugging and "understanding" a code base, making it a powerful tool for developers. This is one practical demonstration of agentic AI in action, and is a direction that you will probably see things move in the future. + +There are other similar tools out there, but GitHub Copilot integrates really nicely with VS Code, is easy to set up and use, and has a free tier that works very well for demonstration purposes. + +## Our project +For our demo we'll work with a simple package that includes just a few functions and corresponding tests. The project, called `mini-etl` (located in `resources/`) performs basic ETL (Extract, Transform, Load) operations on CSV files. More specifically, given a CSV file with a date and a value (sales) column, it has some simple column cleanup functions, a function to aggregate sales by date, and a "pipeline" function that ties everything together. + +The details for the project aren't that important, but what is important is that the code is *broken*, and we need some help fixing it. + +There are five tests that are supposed to validate the functionality of the code (in the projects `tests/` directory), but currently, all five tests are failing. Your task is to use GitHub Copilot to help you fix the code so that all tests pass. Additionally, you will ask GitHub Copilot to generate a Jupyter notebook that demonstrates the functionality of the `mini-etl` project. + +> We recommend creating a copy of the `mini-etl` project before taking the following steps, as GitHub Copilot will be editing the code directly, and you might want to have the original code for reference later. You can always revert changes via git if you are using version control, but making a copy is often easier for demos like this. + +### First steps: setup and get tests passing +1. Sign up for GitHub Copilot online using your GitHub account (there is a free version that will give you a limited number of requests each month, which will be plenty for this demo). +2. In your VS Code IDE, install and enable the following extensions: `GitHub Copilot` and `GitHub Copilot Chat`. +3. Open your terminal and make sure your virtual environment is activated for Python 200, and navigate to the `mini-etl` project directory. Try running the main script: `python mini_etly.py`. It will generate errors. Also, run the test suite: `python -m pytest -q`. This package is a mess! +4. Hit `ctrl-alt-i` to open the interactive Copilot chat window (it will pop up on the right)and set it to "agent" mode. Then, at the bottom, you can select which LLM to use (e.g., Claude Haiku 4.5 is excellent). There will be a message box at the bottom that says "Describe what to build next". See the attached screenshot. + +![copilot_agent_mode](resources/copilot_agent_mode.jpg) + +5. Using your prompt-engineering skills, ask GitHub Copilot to fix the python package. Something like: + +``` + You are in a small Python repo for creating simple ETL operations, and there are failing tests. + + Your task: make `python -m pytest -q` pass by editing `mini_etl.py` only. + Rules: + - Do NOT change anything in `tests/`. + - After each change, re-run tests and continue until all pass. +``` + +Once you hit enter, GitHub Copilot will start analyzing your code and making suggestions. You can accept or reject the suggestions as they come in. Continue this process until all tests pass. Before it starts working, it may require you to give it permission to edit files and other things along the way: it is quite interactive! +6. It may take a while to finish. Once it says it has finished, go ahead and run the tests to confirm they all pass: `python -m pytest -q`. Also, run the main script to see that it works: `python mini_etl.py`. + +### Next, have it write a demo +Once the tests are passing, you can ask GitHub Copilot to generate a Jupyter notebook that demonstrates the functionality of the `mini-etl` package. In the same chat window, you can type something like: + +``` + Now, please create a Jupyter notebook called `mini_etl_demo.ipynb` that demonstrates how to use the `mini-etl` package. The notebook should include: + - A gentle introduction to the package that explains its purpose. + - Examples of how to use each function in the package + - A demonstration of the full ETL pipeline using sample data +``` +Try opening the generated notebook in Jupyter and see if the code cells run. Are the explanations helpful and clear? If there are problems, does it fix them when you tell the agent what went wrong? + +Feel free to explore the package, and continue tweaking it with GitHub Copilot's help, adding and improving functionality as you see fit! + +## Discussion +This demo is meant to showcase the potential of AI agents like GitHub Copilot in assisting with software development tasks. It moves way beyond one-line code compation, and asking an LLM questions about code snippets. Rather, the agent is given an entire *project* as context, and is able to run tests, rewrite code, to make meaningful improvements to a project. + +In addition to helping debug or develop new tools, it can also help with generating documentation and demo materials, as we saw with the Jupyter notebook generation. + +One thing to consider is that the mini-etl project was intentionally kept *very* small and simple, partly so we would stay well within the limitations of the free tier of GitHub Copilot. + +Things may not be as neat and tidy when working on a huge sprawling code base with multiple subcomponents. Also, consider just how easy it would be to simply accept all the changes that GitHub Copilot suggests without really understanding what is going on. This could lead to problems down the line if the code is not well understood by the developer. What if some of the changes lead to problems downstream and you didn't understand the changes? + +Especially when working with large-scale, project-wide code changes, it is extremely important to review and understand AI-generated code before accepting it. *Always treat it as a **first draft** from a junior associate that needs to be carefully reviewed and tested.* This is doubly true for code in production settings where security, performance, and reliability are critical. + +There is a reason that we went through this demo nearly last in the Python data engineering sequence: at this point you have a strong foundation in Python, debugging, and general software development practices: you can critically evaluate the code that tools like GitHub Copilot generates. As AI agents become more prevalent in software development, the abilithy to critically evaluate AI-generated code will be increasingly important. + diff --git a/lessons/07_AI_agents/resources/copilot_agent_mode.jpg b/lessons/07_AI_agents/resources/copilot_agent_mode.jpg new file mode 100644 index 0000000..d3cdf48 Binary files /dev/null and b/lessons/07_AI_agents/resources/copilot_agent_mode.jpg differ diff --git a/lessons/07_AI_agents/resources/mini-etl/README.md b/lessons/07_AI_agents/resources/mini-etl/README.md new file mode 100644 index 0000000..15b50e1 --- /dev/null +++ b/lessons/07_AI_agents/resources/mini-etl/README.md @@ -0,0 +1,22 @@ +# mini-etl module + +This is a tiny module for experimenting with GitHub Copilot. + +The goal is to make the tests pass. + +## Run tests +From this folder: + + python -m pytest -q + +Or if you prefer very quiet tests: + + python -m pytest -q --disable-warnings --tb=no + +## Run the script +From your terminal, inside the folder: + + python mini_etl.py + +This will run the mini ETL pipeline on the sample CSV file `sample_data.csv` and print the daily summary to the console. The first time you run it, you should see errors. The goal is to use github copilot to help you fix things. See the README for the lesson for more details. + diff --git a/lessons/07_AI_agents/resources/mini-etl/data_sample.csv b/lessons/07_AI_agents/resources/mini-etl/data_sample.csv new file mode 100644 index 0000000..3c10602 --- /dev/null +++ b/lessons/07_AI_agents/resources/mini-etl/data_sample.csv @@ -0,0 +1,7 @@ +timestamp,amount,notes +2024-01-01 09:00:00," $1,200.50 ",big order +2024-01-01 10:00:00,"9",small order +2024-01-01 11:00:00,"-5.00",refund +not-a-date,"$7.00",bad timestamp should be dropped +2024-01-02 08:30:00,"3.00",ok +2024-01-02 12:00:00,"$2.00",ok diff --git a/lessons/07_AI_agents/resources/mini-etl/mini_etl.py b/lessons/07_AI_agents/resources/mini-etl/mini_etl.py new file mode 100644 index 0000000..0651661 --- /dev/null +++ b/lessons/07_AI_agents/resources/mini-etl/mini_etl.py @@ -0,0 +1,81 @@ +"""mini_etl.py + +A small ETL module. +""" +from dataclasses import dataclass +from pathlib import Path +import pandas as pd + + +def clean_amount_series(s: pd.Series) -> pd.Series: + """Clean a currency amount column. + + - strip whitespace + - remove "$" and "," characters + - convert to float + - invalid values become NaN + """ + s = s.astype("string").str.strip() + s = s.str.replace("$", "", regex=False) + + return pd.to_numeric(s, errors="coerce") + + +def parse_timestamp_series(s: pd.Series) -> pd.Series: + """Parse a timestamp column into pandas datetimes. + + - pd.to_datetime with errors="coerce" + - invalid values become NaT + """ + return pd.to_datetime(s, errors="ignore") + + +def summarize_daily(df: pd.DataFrame) -> pd.DataFrame: + """Summarize data by day. + + Expected behavior + - input df has columns "timestamp" (datetime) and "amount" (float) + - create "date" from timestamp (date only) + - group by date + - output columns: date, total_amount, num_rows + - total_amount: sum of amount (NaNs ignored) + - num_rows: number of rows in that date group (including rows with NaN amounts) + """ + out = df.copy() + out["date"] = out["timestamp"].dt.date + + grouped = out.groupby("date", as_index=False).agg( + total_amount=("amount", "sum"), + num_rows=("amount", "count"), # BUG: counts non-NaN only + ) + + return grouped.sort_values("date").reset_index(drop=True) + + +def run_pipeline(csv_path: str | Path) -> pd.DataFrame: + """ + Run all of our functions (mini pipeline) on a CSV: + First, read csv given by `csv_path`, then run our functions on the data: + + parse_time_stamp_series -> clean_amount_series -> summarize_daily + + """ + df = pd.read_csv(csv_path) + + # Normalize column names a tiny bit + df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns] + + df["timestamp"] = parse_timestamp_series(df["timestamp"]) + df["amount"] = clean_amount_series(df["amount"]) + + # Drop rows where timestamp is missing (keeps the summary simple) + df = df.dropna(subset=["timestamp"]) + + return summarize_daily(df) + + +if __name__ == "__main__": + # Simple playground: edit this path if you want to try a different file. + csv_path = "data_sample.csv" + summary = run_pipeline(csv_path) + print(summary.to_string(index=False)) # don't print index column diff --git a/lessons/07_AI_agents/resources/mini-etl/tests/test_mini_etl.py b/lessons/07_AI_agents/resources/mini-etl/tests/test_mini_etl.py new file mode 100644 index 0000000..2f74323 --- /dev/null +++ b/lessons/07_AI_agents/resources/mini-etl/tests/test_mini_etl.py @@ -0,0 +1,116 @@ +import math +from pathlib import Path + +import pandas as pd +import pytest + +import mini_etl + + +def test_clean_amount_series_strips_dollar_commas_whitespace(): + # Input: [" $1,200.50 ", "9", "bad"] + # Output should become: [1200.50, 9.0, NaN] + s = pd.Series([" $1,200.50 ", "9", "bad"]) + out = mini_etl.clean_amount_series(s) + + assert out.iloc[0] == pytest.approx(1200.50) + assert out.iloc[1] == pytest.approx(9.0) + assert math.isnan(out.iloc[2]) + + +def test_parse_timestamp_series_coerces_invalid_to_nat(): + """ + parse_timestamp_series should turn timestamp strings into real datetimes. + + Two things should be true: + 1) The result should be a datetime-typed Series (not a Series of strings). + 2) Bad timestamps should not crash the function. They should become "missing". + + Pandas represents a missing datetime value as NaT ("Not a Time"). + pd.isna(...) returns True for NaT. + """ + s = pd.Series(["2024-01-01 10:00:00", "not-a-date"]) + out = mini_etl.parse_timestamp_series(s) + + # 1) The whole Series should be datetime dtype (so we can use .dt later) + assert str(out.dtype).startswith("datetime64"), f"Expected datetime dtype, got {out.dtype}" + + # 2) The invalid timestamp should become missing (NaT) + assert pd.isna(out.iloc[1]) + + +def test_summarize_daily_counts_rows_including_nan_amounts(): + # Input rows: + # 2024-01-01 10:00:00 amount=10.0 + # 2024-01-01 12:00:00 amount=NaN + # 2024-01-02 09:00:00 amount=5.0 + # + # Expected daily summary: + # 2024-01-01: total_amount=10.0, num_rows=2 (NaN ignored in sum, but row still counted) + # 2024-01-02: total_amount=5.0, num_rows=1 + df = pd.DataFrame( + { + "timestamp": pd.to_datetime( + ["2024-01-01 10:00:00", "2024-01-01 12:00:00", "2024-01-02 09:00:00"] + ), + "amount": [10.0, float("nan"), 5.0], + } + ) + + out = mini_etl.summarize_daily(df) + + row_0101 = out[out["date"] == pd.to_datetime("2024-01-01").date()].iloc[0] + assert row_0101["total_amount"] == pytest.approx(10.0) + assert row_0101["num_rows"] == 2 # counts rows, even if amount is NaN + + + +def test_run_pipeline_smoke(tmp_path): + # The pipeline should run end-to-end and return columns: + # ["date", "total_amount", "num_rows"] + here = Path(__file__).resolve().parent.parent + sample = here / "data_sample.csv" + dst = tmp_path / "data_sample.csv" + dst.write_text(sample.read_text(encoding="utf-8"), encoding="utf-8") + + out = mini_etl.run_pipeline(dst) + + # Expected columns + assert list(out.columns) == ["date", "total_amount", "num_rows"] + + # Expected dates (sorted) + assert list(out["date"]) == [ + pd.to_datetime("2024-01-01").date(), + pd.to_datetime("2024-01-02").date(), + ] + + # Expected number of summary rows + assert len(out) == 2 + + +def test_run_pipeline_expected_totals(tmp_path): + # Expected daily summary for the included sample CSV: + # + # 2024-01-01: 1200.50 + 9.00 - 5.00 = 1204.50 (3 rows) + # 2024-01-02: 3.00 + 2.00 = 5.00 (2 rows) + # + # Note: the row with "not-a-date" should be dropped by the pipeline. + here = Path(__file__).resolve().parent.parent + sample = here / "data_sample.csv" + dst = tmp_path / "data_sample.csv" + dst.write_text(sample.read_text(encoding="utf-8"), encoding="utf-8") + + out = mini_etl.run_pipeline(dst) + + expected = pd.DataFrame( + { + "date": [ + pd.to_datetime("2024-01-01").date(), + pd.to_datetime("2024-01-02").date(), + ], + "total_amount": [1204.50, 5.00], + "num_rows": [3, 2], + } + ) + + pd.testing.assert_frame_equal(out, expected)