Skip to content

abhayra12/Docker-Demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docker and PostgreSQL: Data Engineering Workshop

In this workshop, we will explore Docker fundamentals and data engineering workflows using Docker containers.

We will cover:

  • Introduction to Docker and containerization
  • Running PostgreSQL in a Docker container
  • Data ingestion into PostgreSQL
  • Working with pgAdmin for database management
  • Docker networking and port mapping
  • Docker Compose for multi-container applications
  • Creating a data ingestion pipeline
  • SQL refresher with real-world data
  • Best practices for containerized data engineering workflows

Prerequisites

  • Basic understanding of Python
  • Basic SQL knowledge (helpful but not required)
  • Docker and Python installed on your machine
  • Git (optional)

Introduction to Docker

Docker is a containerization software that allows us to isolate software in a similar way to virtual machines but in a much leaner way.

A Docker image is a snapshot of a container that we can define to run our software, or in this case our data pipelines. By exporting our Docker images to Cloud providers such as Amazon Web Services or Google Cloud Platform we can run our containers there.

Why Docker?

Docker provides the following advantages:

  • Reproducibility: Same environment everywhere
  • Isolation: Applications run independently
  • Portability: Run anywhere Docker is installed

They are used in many situations:

  • Integration tests: CI/CD pipelines
  • Running pipelines on the cloud: AWS Batch, Kubernetes jobs
  • Spark: Analytics engine for large-scale data processing
  • Serverless: AWS Lambda, Google Functions

Basic Docker Commands

Check Docker version:

docker --version

Run a simple container:

docker run hello-world

Run something more complex:

docker run ubuntu

Nothing happens. Need to run it in -it mode (interactive terminal):

docker run -it ubuntu

If we don't have python there , let's install it:

apt update && apt install python3
python3 -V

DOCKER image vs container:

  • Image: Blueprint for creating containers
  • Container: Running instance of an image

Important: Docker containers are stateless - any changes done inside a container will NOT be saved when the container is killed and started again.

When you exit the container and use it again, the changes are gone:

docker run -it ubuntu
python3 -V

This is good, because it doesn't affect your host system. Let's say you do something crazy like this:

docker run -it ubuntu
rm -rf / # don't run it on your computer! 

Next time we run it, all the files are back.

But, this is not completely correct. The state is saved somewhere. We can see stopped containers:

docker ps -a

We can restart one of them, but we won't do it, because it's not a good practice. They take space, so let's delete them:

docker rm `docker ps -aq`

Next time we run something, we add --rm:

docker run -it --rm ubuntu

There are other base images besides hello-world and ubuntu. For example, Python:

docker run -it --rm python:3.13.10
# add -slim to get a smaller version

This one starts python. If we want bash, we need to overwrite entrypoint:

docker run -it \
    --rm \
    --entrypoint=bash \
    python:3.13.11-slim

So, we know that with docker we can restore any container to its initial state in a reproducible manner. But what about data? A common way to do so is with volumes.

Let's create some data in test:

mkdir test
cd test
touch file1.txt file2.txt file3.txt
echo "Hello from host" > file1.txt
cd ..

Now let's create a simple script test/list_files.py that shows the files in the folder:

from pathlib import Path

current_dir = Path.cwd()
current_file = Path(__file__).name

print(f"Files in {current_dir}:")

for filepath in current_dir.iterdir():
    if filepath.name == current_file:
        continue

    print(f"  - {filepath.name}")

    if filepath.is_file():
        content = filepath.read_text(encoding='utf-8')
        print(f"    Content: {content}")

Now let's map this to a Python container:

docker run -it \
    --rm \
    -v $(pwd)/test:/app/test \
    --entrypoint=bash \
    python:3.13.11-slim

Inside the container, run:

cd /app/test
ls -la
cat file1.txt
python list_files.py

You'll see the files from your host machine are accessible in the container!

Virtual environment and Data Pipelines

A data pipeline is a service that receives data as input and outputs more data. For example, reading a CSV file, transforming the data somehow and storing it as a table in a PostgreSQL database.

graph LR
    A[CSV File] --> B[Data Pipeline]
    B --> C[Parquet File]
    B --> D[PostgreSQL Database]
    B --> E[Data Warehouse]
    style B fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
Loading

we'll build pipelines that:

  • Download CSV data from the web
  • Transform and clean the data with pandas
  • Load it into PostgreSQL for querying
  • Process data in chunks to handle large files

Let's create an example pipeline. First, create a directory pipeline and inside, create a file pipeline.py:

import sys
print("arguments", sys.argv)

day = int(sys.argv[1])
print(f"Running pipeline for day {day}")

Now let's add pandas:

import pandas as pd

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
print(df.head())

df.to_parquet(f"output_day_{sys.argv[1]}.parquet")

We need pandas, but we don't have it. We want to test it before we run things in a container.

We can install it with pip:

pip install pandas pyarrow

But this installs it globally on your system. This can cause conflicts if different projects need different versions of the same package.

Instead, we want to use a virtual environment - an isolated Python environment that keeps dependencies for this project separate from other projects and from your system Python.

We'll use uv - a modern, fast Python package and project manager written in Rust. It's much faster than pip and handles virtual environments automatically.

pip install uv

Now initialize a Python project with uv:

uv init --python=3.13

This creates a pyproject.toml file for managing dependencies and a .python-version file.

Compare the Python versions:

uv run which python  # Python in the virtual environment
uv run python -V

which python        # System Python
python -V

You'll see they're different - uv run uses the isolated environment.

Now let's add pandas:

uv add pandas pyarrow

This adds pandas to your pyproject.toml and installs it in the virtual environment.

Now we can execute this file:

uv run python pipeline.py 10

We will see:

  • ['pipeline.py', '10']
  • job finished successfully for day = 10

This script produces a binary (parquet) file, so let's make sure we don't accidentally commit it to git by adding parquet extensions to .gitignore:

*.parquet

Dockerizing the Pipeline

Now let's containerize the script. Create the following Dockerfile file:

# base Docker image that we will build on
FROM python:3.13.11-slim

# set up our image by installing prerequisites; pandas in this case
RUN pip install pandas pyarrow

# set up the working directory inside the container
WORKDIR /app
# copy the script to the container. 1st name is source file, 2nd is destination
COPY pipeline.py pipeline.py

# define what to do first when the container runs
# in this example, we will just run the script
ENTRYPOINT ["python", "pipeline.py"]

Explanation:

  • FROM: Base image (Python 3.13)
  • RUN: Execute commands during build
  • WORKDIR: Set working directory
  • COPY: Copy files into the image
  • ENTRYPOINT: Default command to run

Let's build the image:

docker build -t test:pandas .
  • The image name will be test and its tag will be pandas. If the tag isn't specified it will default to latest.

We can now run the container and pass an argument to it, so that our pipeline will receive it:

docker run -it test:pandas some_number

You should get the same output you did when you ran the pipeline script by itself.

Note: these instructions assume that pipeline.py and Dockerfile are in the same directory. The Docker commands should also be run from the same directory as these files.

What about uv? Let's use it instead of using pip:

# Start with slim Python 3.13 image
FROM python:3.13.10-slim

# Copy uv binary from official uv image (multi-stage build pattern)
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

# Set working directory
WORKDIR /app

# Add virtual environment to PATH so we can use installed packages
ENV PATH="/app/.venv/bin:$PATH"

# Copy dependency files first (better layer caching)
COPY "pyproject.toml" "uv.lock" ".python-version" ./
# Install dependencies from lock file (ensures reproducible builds)
RUN uv sync --locked

# Copy application code
COPY pipeline.py pipeline.py

# Set entry point
ENTRYPOINT ["python", "pipeline.py"]

Running PostgreSQL with Docker

Now we want to do real data engineering. Let's use a Postgres database for that.

You can run a containerized version of Postgres that doesn't require any installation steps. You only need to provide a few environment variables to it as well as a volume for storing data.

Create a folder anywhere you'd like for Postgres to store data in. We will use the example folder ny_taxi_postgres_data. Here's how to run the container:

docker run -it --rm \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  postgres:18

Explanation of parameters:

  • -e sets environment variables (user, password, database name)
  • -v ny_taxi_postgres_data:/var/lib/postgresql creates a named volume
    • Docker manages this volume automatically
    • Data persists even after container is removed
    • Volume is stored in Docker's internal storage
  • -p 5432:5432 maps port 5432 from container to host
  • postgres:18 uses PostgreSQL version 18 (latest as of Dec 2025)

Alternative approach - bind mount:

First create the directory, then map it:

mkdir ny_taxi_postgres_data

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  postgres:18

When you create the directory first, it's owned by your user. If you let Docker create it, it will be owned by the Docker/root user, which can cause permission issues on Linux. On Windows and macOS with Docker Desktop, this is handled automatically.

Named volume vs Bind mount:

  • Named volume (name:/path): Managed by Docker, easier
  • Bind mount (/host/path:/container/path): Direct mapping to host filesystem, more control

Once the container is running, we can log into our database with pgcli.

Install pgcli:

uv add --dev pgcli

The --dev flag marks this as a development dependency (not needed in production). It will be added to the [dependency-groups] section of pyproject.toml instead of the main dependencies section.

Now use it to connect to Postgres:

uv run pgcli -h localhost -p 5432 -u root -d ny_taxi
  • uv run executes a command in the context of the virtual environment
  • -h is the host. Since we're running locally we can use localhost.
  • -p is the port.
  • -u is the username.
  • -d is the database name.
  • The password is not provided; it will be requested after running the command.

When prompted, enter the password: root

Try some SQL commands:

-- List tables
\dt

-- Create a test table
CREATE TABLE test (id INTEGER, name VARCHAR(50));

-- Insert data
INSERT INTO test VALUES (1, 'Hello Docker');

-- Query data
SELECT * FROM test;

-- Exit
\q

NY Taxi Dataset and Data Ingestion

We will now create a Jupyter Notebook notebook.ipynb file which we will use to read a CSV file and export it to Postgres.

Install Jupyter:

uv add --dev jupyter

Let's create a Jupyter notebook to explore the data:

uv run jupyter notebook

We will use data from the NYC TLC Trip Record Data website.

Specifically, we will use the Yellow taxi trip records CSV file for January 2021.

This data used to be csv, but later they switched to parquet. We want to keep using CSV because we need to do a bit of extra pre-processing (for the purposes of learning it).

A dictionary to understand each field is available here.

Note: The CSV data is stored as gzipped files. Pandas can read them directly.

Explore the Data

Create a new notebook and run:

import pandas as pd

# Read a sample of the data
prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/'
df = pd.read_csv(prefix + '/yellow_tripdata_2021-01.csv.gz', nrows=100)

# Display first rows
df.head()

# Check data types
df.dtypes

# Check data shape
df.shape

We have a warning:

/tmp/ipykernel_25483/2933316018.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.

So we need to specify the types:

dtype = {
    "VendorID": "Int64",
    "passenger_count": "Int64",
    "trip_distance": "float64",
    "RatecodeID": "Int64",
    "store_and_fwd_flag": "string",
    "PULocationID": "Int64",
    "DOLocationID": "Int64",
    "payment_type": "Int64",
    "fare_amount": "float64",
    "extra": "float64",
    "mta_tax": "float64",
    "tip_amount": "float64",
    "tolls_amount": "float64",
    "improvement_surcharge": "float64",
    "total_amount": "float64",
    "congestion_surcharge": "float64"
}

parse_dates = [
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime"
]

df = pd.read_csv(
    prefix + '/yellow_tripdata_2021-01.csv.gz',
    nrows=100
    dtype=dtype,
    parse_dates=parse_dates
)

Ingesting Data into Postgres

In the Jupyter notebook, we create code to:

  1. Download the CSV file
  2. Read it in chunks with pandas
  3. Convert datetime columns
  4. Insert data into PostgreSQL using SQLAlchemy

First, install SQLAlchemy:

uv add sqlalchemy psycopg2-binary

Create engine

from sqlalchemy import create_engine
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

Get DDL schema for the database:

print(pd.io.sql.get_schema(df, name='yellow_taxi_data', con=engine))

Output:

CREATE TABLE yellow_taxi_data (
    "VendorID" BIGINT, 
    tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE, 
    tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE, 
    passenger_count BIGINT, 
    trip_distance FLOAT(53), 
    "RatecodeID" BIGINT, 
    store_and_fwd_flag TEXT, 
    "PULocationID" BIGINT, 
    "DOLocationID" BIGINT, 
    payment_type BIGINT, 
    fare_amount FLOAT(53), 
    extra FLOAT(53), 
    mta_tax FLOAT(53), 
    tip_amount FLOAT(53), 
    tolls_amount FLOAT(53), 
    improvement_surcharge FLOAT(53), 
    total_amount FLOAT(53), 
    congestion_surcharge FLOAT(53)
)

Create the table:

df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')

head(n=0) makes sure we only create the table, we don't add any data yet.

We don't want to insert all the data at once. Let's do it in batches and use an iterator for that:

df_iter = pd.read_csv(
    ...
    iterator=True,
    chunksize=100000
)

Iterate over it:

for df_chunk in df_iter:
    print(len(df_chunk))

Inserting data:

df_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')

Putting everything together:

first = True

for df_chunk in df_iter:

    if first:
        # Create table schema (no data)
        df_chunk.head(0).to_sql(
            name="yellow_taxi_data",
            con=engine,
            if_exists="replace"
        )
        first = False
        print("Table created")

    # Insert chunk
    df_chunk.to_sql(
        name="yellow_taxi_data",
        con=engine,
        if_exists="append"
    )

    print("Inserted:", len(df_chunk))

If you don't like using the first flag:

first_chunk = next(df_iter)

first_chunk.head(0).to_sql(
    name="yellow_taxi_data",
    con=engine,
    if_exists="replace"
)

print("Table created")

first_chunk.to_sql(
    name="yellow_taxi_data",
    con=engine,
    if_exists="append"
)

print("Inserted first chunk:", len(first_chunk))

for df_chunk in df_iter:
    df_chunk.to_sql(
        name="yellow_taxi_data",
        con=engine,
        if_exists="append"
    )
    print("Inserted chunk:", len(df_chunk))

Add tqdm to see progress:

uv add tqdm

Put it around the iterable:

from tqdm.auto import tqdm

for df_chunk in tqdm(df_iter):
    ...

Connect to it using pgcli:

uv run pgcli -h localhost -p 5432 -u root -d ny_taxi

And explore the data.

Creating the Data Ingestion Script

Now let's convert the notebook to a Python script:

uv run jupyter nbconvert --to=script notebook.ipynb
mv notebook.py ingest_data.py

Then clean up the script:

#!/usr/bin/env python
# coding: utf-8

import pandas as pd
from sqlalchemy import create_engine
from tqdm.auto import tqdm

dtype = {
    "VendorID": "Int64",
    "passenger_count": "Int64",
    "trip_distance": "float64",
    "RatecodeID": "Int64",
    "store_and_fwd_flag": "string",
    "PULocationID": "Int64",
    "DOLocationID": "Int64",
    "payment_type": "Int64",
    "fare_amount": "float64",
    "extra": "float64",
    "mta_tax": "float64",
    "tip_amount": "float64",
    "tolls_amount": "float64",
    "improvement_surcharge": "float64",
    "total_amount": "float64",
    "congestion_surcharge": "float64"
}

parse_dates = [
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime"
]


def ingest_data(
        url: str,
        engine,
        target_table: str,
        chunksize: int = 100000,
) -> pd.DataFrame:
    df_iter = pd.read_csv(
        url,
        dtype=dtype,
        parse_dates=parse_dates,
        iterator=True,
        chunksize=chunksize
    )

    first_chunk = next(df_iter)

    first_chunk.head(0).to_sql(
        name=target_table,
        con=engine,
        if_exists="replace"
    )

    print(f"Table {target_table} created")

    first_chunk.to_sql(
        name=target_table,
        con=engine,
        if_exists="append"
    )

    print(f"Inserted first chunk: {len(first_chunk)}")

    for df_chunk in tqdm(df_iter):
        df_chunk.to_sql(
            name=target_table,
            con=engine,
            if_exists="append"
        )
        print(f"Inserted chunk: {len(df_chunk)}")

    print(f'done ingesting to {target_table}')

def main():
    pg_user = 'root'
    pg_pass = 'root'
    pg_host = 'localhost'
    pg_port = '5432'
    pg_db = 'ny_taxi'
    year = 2021
    month = 1
    chunksize = 100000
    target_table = 'yellow_taxi_data'

    engine = create_engine(f'postgresql://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}')
    url_prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'

    url = f'{url_prefix}/yellow_tripdata_{year:04d}-{month:02d}.csv.gz'

    ingest_data(
        url=url,
        engine=engine,
        target_table=target_table,
        chunksize=chunksize
    )

if __name__ == '__main__':
    main()

Then add click:

use click to parse the arguments. install click with uv 

The result:

import click

...


@click.command()
@click.option('--pg-user', default='root', help='PostgreSQL username')
@click.option('--pg-pass', default='root', help='PostgreSQL password')
@click.option('--pg-host', default='localhost', help='PostgreSQL host')
@click.option('--pg-port', default='5432', help='PostgreSQL port')
@click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name')
@click.option('--year', default=2021, type=int, help='Year of the data')
@click.option('--month', default=1, type=int, help='Month of the data')
@click.option('--chunksize', default=100000, type=int, help='Chunk size for ingestion')
@click.option('--target-table', default='yellow_taxi_data', help='Target table name')
def main(pg_user, pg_pass, pg_host, pg_port, pg_db, year, month, chunksize, target_table):
    ...

The script reads data in chunks (100,000 rows at a time) to handle large files efficiently without running out of memory.

Make sure PostgreSQL is running, then execute the ingestion script:

uv run python ingest_data.py \
  --pg-user=root \
  --pg-pass=root \
  --pg-host=localhost \
  --pg-port=5432 \
  --pg-db=ny_taxi \
  --target-table=yellow_taxi_trips \
  --year=2021 \
  --month=1 \
  --chunksize=100000

This will download and ingest the data into your PostgreSQL database.

Verify Data

Connect with pgcli and query the data:

uv run pgcli -h localhost -p 5432 -u root -d ny_taxi
-- Count records (should return 1,369,765 rows)
SELECT COUNT(*) FROM yellow_taxi_trips;

-- View sample data
SELECT * FROM yellow_taxi_trips LIMIT 10;

-- Basic analytics
SELECT 
    DATE(tpep_pickup_datetime) AS pickup_date,
    COUNT(*) AS trips_count,
    AVG(total_amount) AS avg_amount
FROM yellow_taxi_trips
GROUP BY DATE(tpep_pickup_datetime)
ORDER BY pickup_date;

pgAdmin - Database Management Tool

pgcli is a handy tool but it's cumbersome to use for complex queries and database management. pgAdmin is a web-based tool that makes it more convenient to access and manage our databases.

It's possible to run pgAdmin as a container along with the Postgres container, but both containers will have to be in the same virtual network so that they can find each other.

Run pgAdmin Container

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -v pgadmin_data:/var/lib/pgadmin \
  -p 8085:80 \
  dpage/pgadmin4

The -v pgadmin_data:/var/lib/pgadmin volume mapping saves pgAdmin settings (server connections, preferences) so you don't have to reconfigure it every time you restart the container.

  • The container needs 2 environment variables: a login email and a password. We use admin@admin.com and root in this example.
  • pgAdmin is a web app and its default port is 80; we map it to 8085 in our localhost to avoid any possible conflicts.
  • The actual image name is dpage/pgadmin4.

Note: This won't work yet because pgAdmin can't see the PostgreSQL container. They need to be on the same Docker network!

Docker Networks

Let's create a virtual Docker network called pg-network:

docker network create pg-network

You can remove the network later with the command docker network rm pg-network. You can look at the existing networks with docker network ls.

Stop both containers and re-run them with the network configuration:

# Run PostgreSQL on the network
docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  --network=pg-network \
  --name pgdatabase \
  postgres:18

# In another terminal, run pgAdmin on the same network
docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -v pgadmin_data:/var/lib/pgadmin \
  -p 8085:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4
  • Just like with the Postgres container, we specify a network and a name for pgAdmin.
  • The container names (pgdatabase and pgadmin) allow the containers to find each other within the network.

Connect pgAdmin to PostgreSQL

You should now be able to load pgAdmin on a web browser by browsing to http://localhost:8085. Use the same email and password you used for running the container to log in.

  1. Open browser and go to http://localhost:8085
  2. Login with email: admin@admin.com, password: root
  3. Right-click "Servers" → Register → Server
  4. Configure:
    • General tab: Name: Local Docker
    • Connection tab:
      • Host: pgdatabase (the container name)
      • Port: 5432
      • Username: root
      • Password: root
  5. Save

Now you can explore the database using the pgAdmin interface!

Dockerizing the Ingestion Script

Let's modify the Dockerfile we created before to include our ingest_data.py script:

# Start with slim Python 3.13 image for smaller size
FROM python:3.13.11-slim

# Copy uv binary from official uv image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

# Set working directory inside container
WORKDIR /app

# Add virtual environment to PATH
ENV PATH="/app/.venv/bin:$PATH"

# Copy dependency files first (better caching)
COPY "pyproject.toml" "uv.lock" ".python-version" ./
# Install all dependencies (pandas, sqlalchemy, psycopg2)
RUN uv sync --locked

# Copy ingestion script
COPY ingest_data.py ingest_data.py 

# Set entry point to run the ingestion script
ENTRYPOINT [ "python", "ingest_data.py" ]

Explanation:

  • uv sync --locked installs exact versions from uv.lock for reproducibility
  • Dependencies (pandas, sqlalchemy, psycopg2) are already in pyproject.toml
  • Multi-stage build pattern copies uv from official image
  • Copying dependency files before code improves Docker layer caching

Build the Docker Image

docker build -t taxi_ingest:v001 .

Run the Containerized Ingestion

You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table.

docker run -it \
  --network=pg-network \
  taxi_ingest:v001 \
    --pg-user=root \
    --pg-pass=root \
    --pg-host=pgdatabase \
    --pg-port=5432 \
    --pg-db=ny_taxi \
    --target-table=yellow_taxi_trips_2021_2 \
    --year=2021 \
    --month=2 \
    --chunksize=100000

Important notes:

  • We need to provide the network for Docker to find the Postgres container. It goes before the name of the image.
  • Since Postgres is running on a separate container, the host argument will have to point to the container name of Postgres (pgdatabase).
  • You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table.

Docker Compose

docker-compose allows us to launch multiple containers using a single configuration file, so that we don't have to run multiple complex docker run commands separately.

Docker compose makes use of YAML files. Here's the docker-compose.yaml file for running the Postgres and pgAdmin containers:

services:
  pgdatabase:
    image: postgres:18
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - "ny_taxi_postgres_data:/var/lib/postgresql:rw"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    volumes:
      - "pgadmin_data:/var/lib/pgadmin"
    ports:
      - "8085:80"

volumes:
  ny_taxi_postgres_data:
  pgadmin_data:
  • We don't have to specify a network because docker-compose takes care of it: every single container (or "service", as the file states) will run within the same network and will be able to find each other according to their names (pgdatabase and pgadmin in this example).
  • All other details from the docker run commands (environment variables, volumes and ports) are mentioned accordingly in the file following YAML syntax.

Start Services with Docker Compose

We can now run Docker compose by running the following command from the same directory where docker-compose.yaml is found. Make sure that all previous containers aren't running anymore:

docker-compose up

Note: this command assumes that the ny_taxi_postgres_data used for mounting the volume is in the same directory as docker-compose.yaml.

Since the settings for pgAdmin were stored within the container and we have killed the previous one, you will have to re-create the connection by following the steps in the pgAdmin section.

You will have to press Ctrl+C in order to shut down the containers. The proper way of shutting them down is with this command:

docker-compose down

And if you want to run the containers again in the background rather than in the foreground (thus freeing up your terminal), you can run them in detached mode:

docker-compose up -d

Other useful commands:

# View logs
docker-compose logs

# Stop and remove volumes
docker-compose down -v

Benefits of Docker Compose:

  • Single command to start all services
  • Automatic network creation
  • Easy configuration management
  • Declarative infrastructure

If you want to re-run the dockerized ingest script when you run Postgres and pgAdmin with docker-compose, you will have to find the name of the virtual network that Docker compose created for the containers. You can use the command docker network ls to find it and then change the docker run command for the dockerized script to include the network name.

# check the network link:
docker network ls 

# it's pipeline_default
# now run the script:
docker run -it \
  --network=pipeline_default \
  taxi_ingest:v001 \
    --pg-user=root \
    --pg-pass=root \
    --pg-host=pgdatabase \
    --pg-port=5432 \
    --pg-db=ny_taxi \
    --target-table=yellow_taxi_trips_2021_2 \
    --year=2021 \
    --month=2 \
    --chunksize=100000

Cleanup

When you're done with the workshop, clean up Docker resources to free up disk space:

Stop all running containers:

docker-compose down

Remove specific containers:

# List all containers
docker ps -a

# Remove specific container
docker rm <container_id>

# Remove all stopped containers
docker container prune

Remove Docker images:

# List all images
docker images

# Remove specific image
docker rmi taxi_ingest:v001
docker rmi test:pandas

# Remove all unused images
docker image prune -a

Remove Docker volumes:

# List volumes
docker volume ls

# Remove specific volumes
docker volume rm ny_taxi_postgres_data
docker volume rm pgadmin_data

# Remove all unused volumes
docker volume prune

Remove Docker networks:

# List networks
docker network ls

# Remove specific network
docker network rm pg-network

# Remove all unused networks
docker network prune

Complete cleanup (removes everything):

# ⚠️ Warning: This removes ALL Docker resources!
docker system prune -a --volumes

Clean up local files:

# Remove parquet files
rm *.parquet

# Remove Python cache
rm -rf __pycache__ .pytest_cache

# Remove virtual environment (if using venv)
rm -rf .venv

Happy learning! 🐳📊

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors