In this workshop, we will explore Docker fundamentals and data engineering workflows using Docker containers.
We will cover:
- Introduction to Docker and containerization
- Running PostgreSQL in a Docker container
- Data ingestion into PostgreSQL
- Working with pgAdmin for database management
- Docker networking and port mapping
- Docker Compose for multi-container applications
- Creating a data ingestion pipeline
- SQL refresher with real-world data
- Best practices for containerized data engineering workflows
- Basic understanding of Python
- Basic SQL knowledge (helpful but not required)
- Docker and Python installed on your machine
- Git (optional)
Docker is a containerization software that allows us to isolate software in a similar way to virtual machines but in a much leaner way.
A Docker image is a snapshot of a container that we can define to run our software, or in this case our data pipelines. By exporting our Docker images to Cloud providers such as Amazon Web Services or Google Cloud Platform we can run our containers there.
Why Docker?
Docker provides the following advantages:
- Reproducibility: Same environment everywhere
- Isolation: Applications run independently
- Portability: Run anywhere Docker is installed
They are used in many situations:
- Integration tests: CI/CD pipelines
- Running pipelines on the cloud: AWS Batch, Kubernetes jobs
- Spark: Analytics engine for large-scale data processing
- Serverless: AWS Lambda, Google Functions
Basic Docker Commands
Check Docker version:
docker --versionRun a simple container:
docker run hello-worldRun something more complex:
docker run ubuntuNothing happens. Need to run it in -it mode (interactive terminal):
docker run -it ubuntuIf we don't have python there , let's install it:
apt update && apt install python3
python3 -VDOCKER image vs container:
- Image: Blueprint for creating containers
- Container: Running instance of an image
Important: Docker containers are stateless - any changes done inside a container will NOT be saved when the container is killed and started again.
When you exit the container and use it again, the changes are gone:
docker run -it ubuntu
python3 -VThis is good, because it doesn't affect your host system. Let's say you do something crazy like this:
docker run -it ubuntu
rm -rf / # don't run it on your computer! Next time we run it, all the files are back.
But, this is not completely correct. The state is saved somewhere. We can see stopped containers:
docker ps -aWe can restart one of them, but we won't do it, because it's not a good practice. They take space, so let's delete them:
docker rm `docker ps -aq`Next time we run something, we add --rm:
docker run -it --rm ubuntuThere are other base images besides hello-world and ubuntu. For example, Python:
docker run -it --rm python:3.13.10
# add -slim to get a smaller versionThis one starts python. If we want bash, we need to overwrite entrypoint:
docker run -it \
--rm \
--entrypoint=bash \
python:3.13.11-slimSo, we know that with docker we can restore any container to its initial state in a reproducible manner. But what about data? A common way to do so is with volumes.
Let's create some data in test:
mkdir test
cd test
touch file1.txt file2.txt file3.txt
echo "Hello from host" > file1.txt
cd ..Now let's create a simple script test/list_files.py that shows the files in the folder:
from pathlib import Path
current_dir = Path.cwd()
current_file = Path(__file__).name
print(f"Files in {current_dir}:")
for filepath in current_dir.iterdir():
if filepath.name == current_file:
continue
print(f" - {filepath.name}")
if filepath.is_file():
content = filepath.read_text(encoding='utf-8')
print(f" Content: {content}")Now let's map this to a Python container:
docker run -it \
--rm \
-v $(pwd)/test:/app/test \
--entrypoint=bash \
python:3.13.11-slimInside the container, run:
cd /app/test
ls -la
cat file1.txt
python list_files.pyYou'll see the files from your host machine are accessible in the container!
A data pipeline is a service that receives data as input and outputs more data. For example, reading a CSV file, transforming the data somehow and storing it as a table in a PostgreSQL database.
graph LR
A[CSV File] --> B[Data Pipeline]
B --> C[Parquet File]
B --> D[PostgreSQL Database]
B --> E[Data Warehouse]
style B fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
we'll build pipelines that:
- Download CSV data from the web
- Transform and clean the data with pandas
- Load it into PostgreSQL for querying
- Process data in chunks to handle large files
Let's create an example pipeline. First, create a directory pipeline and inside, create a file pipeline.py:
import sys
print("arguments", sys.argv)
day = int(sys.argv[1])
print(f"Running pipeline for day {day}")Now let's add pandas:
import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
print(df.head())
df.to_parquet(f"output_day_{sys.argv[1]}.parquet")We need pandas, but we don't have it. We want to test it before we run things in a container.
We can install it with pip:
pip install pandas pyarrowBut this installs it globally on your system. This can cause conflicts if different projects need different versions of the same package.
Instead, we want to use a virtual environment - an isolated Python environment that keeps dependencies for this project separate from other projects and from your system Python.
We'll use uv - a modern, fast Python package and project manager written in Rust. It's much faster than pip and handles virtual environments automatically.
pip install uvNow initialize a Python project with uv:
uv init --python=3.13This creates a pyproject.toml file for managing dependencies and a .python-version file.
Compare the Python versions:
uv run which python # Python in the virtual environment
uv run python -V
which python # System Python
python -VYou'll see they're different - uv run uses the isolated environment.
Now let's add pandas:
uv add pandas pyarrowThis adds pandas to your pyproject.toml and installs it in the virtual environment.
Now we can execute this file:
uv run python pipeline.py 10We will see:
['pipeline.py', '10']job finished successfully for day = 10
This script produces a binary (parquet) file, so let's make sure we don't accidentally commit it to git by adding parquet extensions to .gitignore:
*.parquet
Now let's containerize the script. Create the following Dockerfile file:
# base Docker image that we will build on
FROM python:3.13.11-slim
# set up our image by installing prerequisites; pandas in this case
RUN pip install pandas pyarrow
# set up the working directory inside the container
WORKDIR /app
# copy the script to the container. 1st name is source file, 2nd is destination
COPY pipeline.py pipeline.py
# define what to do first when the container runs
# in this example, we will just run the script
ENTRYPOINT ["python", "pipeline.py"]Explanation:
FROM: Base image (Python 3.13)RUN: Execute commands during buildWORKDIR: Set working directoryCOPY: Copy files into the imageENTRYPOINT: Default command to run
Let's build the image:
docker build -t test:pandas .- The image name will be
testand its tag will bepandas. If the tag isn't specified it will default tolatest.
We can now run the container and pass an argument to it, so that our pipeline will receive it:
docker run -it test:pandas some_numberYou should get the same output you did when you ran the pipeline script by itself.
Note: these instructions assume that
pipeline.pyandDockerfileare in the same directory. The Docker commands should also be run from the same directory as these files.
What about uv? Let's use it instead of using pip:
# Start with slim Python 3.13 image
FROM python:3.13.10-slim
# Copy uv binary from official uv image (multi-stage build pattern)
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/
# Set working directory
WORKDIR /app
# Add virtual environment to PATH so we can use installed packages
ENV PATH="/app/.venv/bin:$PATH"
# Copy dependency files first (better layer caching)
COPY "pyproject.toml" "uv.lock" ".python-version" ./
# Install dependencies from lock file (ensures reproducible builds)
RUN uv sync --locked
# Copy application code
COPY pipeline.py pipeline.py
# Set entry point
ENTRYPOINT ["python", "pipeline.py"]Now we want to do real data engineering. Let's use a Postgres database for that.
You can run a containerized version of Postgres that doesn't require any installation steps. You only need to provide a few environment variables to it as well as a volume for storing data.
Create a folder anywhere you'd like for Postgres to store data in. We will use the example folder ny_taxi_postgres_data. Here's how to run the container:
docker run -it --rm \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v ny_taxi_postgres_data:/var/lib/postgresql \
-p 5432:5432 \
postgres:18Explanation of parameters:
-esets environment variables (user, password, database name)-v ny_taxi_postgres_data:/var/lib/postgresqlcreates a named volume- Docker manages this volume automatically
- Data persists even after container is removed
- Volume is stored in Docker's internal storage
-p 5432:5432maps port 5432 from container to hostpostgres:18uses PostgreSQL version 18 (latest as of Dec 2025)
Alternative approach - bind mount:
First create the directory, then map it:
mkdir ny_taxi_postgres_data
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql \
-p 5432:5432 \
postgres:18When you create the directory first, it's owned by your user. If you let Docker create it, it will be owned by the Docker/root user, which can cause permission issues on Linux. On Windows and macOS with Docker Desktop, this is handled automatically.
Named volume vs Bind mount:
- Named volume (
name:/path): Managed by Docker, easier - Bind mount (
/host/path:/container/path): Direct mapping to host filesystem, more control
Once the container is running, we can log into our database with pgcli.
Install pgcli:
uv add --dev pgcliThe --dev flag marks this as a development dependency (not needed in production). It will be added to the [dependency-groups] section of pyproject.toml instead of the main dependencies section.
Now use it to connect to Postgres:
uv run pgcli -h localhost -p 5432 -u root -d ny_taxiuv runexecutes a command in the context of the virtual environment-his the host. Since we're running locally we can uselocalhost.-pis the port.-uis the username.-dis the database name.- The password is not provided; it will be requested after running the command.
When prompted, enter the password: root
Try some SQL commands:
-- List tables
\dt
-- Create a test table
CREATE TABLE test (id INTEGER, name VARCHAR(50));
-- Insert data
INSERT INTO test VALUES (1, 'Hello Docker');
-- Query data
SELECT * FROM test;
-- Exit
\qWe will now create a Jupyter Notebook notebook.ipynb file which we will use to read a CSV file and export it to Postgres.
Install Jupyter:
uv add --dev jupyterLet's create a Jupyter notebook to explore the data:
uv run jupyter notebookWe will use data from the NYC TLC Trip Record Data website.
Specifically, we will use the Yellow taxi trip records CSV file for January 2021.
This data used to be csv, but later they switched to parquet. We want to keep using CSV because we need to do a bit of extra pre-processing (for the purposes of learning it).
A dictionary to understand each field is available here.
Note: The CSV data is stored as gzipped files. Pandas can read them directly.
Explore the Data
Create a new notebook and run:
import pandas as pd
# Read a sample of the data
prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/'
df = pd.read_csv(prefix + '/yellow_tripdata_2021-01.csv.gz', nrows=100)
# Display first rows
df.head()
# Check data types
df.dtypes
# Check data shape
df.shapeWe have a warning:
/tmp/ipykernel_25483/2933316018.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
So we need to specify the types:
dtype = {
"VendorID": "Int64",
"passenger_count": "Int64",
"trip_distance": "float64",
"RatecodeID": "Int64",
"store_and_fwd_flag": "string",
"PULocationID": "Int64",
"DOLocationID": "Int64",
"payment_type": "Int64",
"fare_amount": "float64",
"extra": "float64",
"mta_tax": "float64",
"tip_amount": "float64",
"tolls_amount": "float64",
"improvement_surcharge": "float64",
"total_amount": "float64",
"congestion_surcharge": "float64"
}
parse_dates = [
"tpep_pickup_datetime",
"tpep_dropoff_datetime"
]
df = pd.read_csv(
prefix + '/yellow_tripdata_2021-01.csv.gz',
nrows=100
dtype=dtype,
parse_dates=parse_dates
)In the Jupyter notebook, we create code to:
- Download the CSV file
- Read it in chunks with pandas
- Convert datetime columns
- Insert data into PostgreSQL using SQLAlchemy
First, install SQLAlchemy:
uv add sqlalchemy psycopg2-binaryCreate engine
from sqlalchemy import create_engine
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')Get DDL schema for the database:
print(pd.io.sql.get_schema(df, name='yellow_taxi_data', con=engine))Output:
CREATE TABLE yellow_taxi_data (
"VendorID" BIGINT,
tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE,
tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE,
passenger_count BIGINT,
trip_distance FLOAT(53),
"RatecodeID" BIGINT,
store_and_fwd_flag TEXT,
"PULocationID" BIGINT,
"DOLocationID" BIGINT,
payment_type BIGINT,
fare_amount FLOAT(53),
extra FLOAT(53),
mta_tax FLOAT(53),
tip_amount FLOAT(53),
tolls_amount FLOAT(53),
improvement_surcharge FLOAT(53),
total_amount FLOAT(53),
congestion_surcharge FLOAT(53)
)Create the table:
df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')head(n=0) makes sure we only create the table, we don't add any data yet.
We don't want to insert all the data at once. Let's do it in batches and use an iterator for that:
df_iter = pd.read_csv(
...
iterator=True,
chunksize=100000
)Iterate over it:
for df_chunk in df_iter:
print(len(df_chunk))Inserting data:
df_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')Putting everything together:
first = True
for df_chunk in df_iter:
if first:
# Create table schema (no data)
df_chunk.head(0).to_sql(
name="yellow_taxi_data",
con=engine,
if_exists="replace"
)
first = False
print("Table created")
# Insert chunk
df_chunk.to_sql(
name="yellow_taxi_data",
con=engine,
if_exists="append"
)
print("Inserted:", len(df_chunk))If you don't like using the first flag:
first_chunk = next(df_iter)
first_chunk.head(0).to_sql(
name="yellow_taxi_data",
con=engine,
if_exists="replace"
)
print("Table created")
first_chunk.to_sql(
name="yellow_taxi_data",
con=engine,
if_exists="append"
)
print("Inserted first chunk:", len(first_chunk))
for df_chunk in df_iter:
df_chunk.to_sql(
name="yellow_taxi_data",
con=engine,
if_exists="append"
)
print("Inserted chunk:", len(df_chunk))Add tqdm to see progress:
uv add tqdmPut it around the iterable:
from tqdm.auto import tqdm
for df_chunk in tqdm(df_iter):
...Connect to it using pgcli:
uv run pgcli -h localhost -p 5432 -u root -d ny_taxiAnd explore the data.
Now let's convert the notebook to a Python script:
uv run jupyter nbconvert --to=script notebook.ipynb
mv notebook.py ingest_data.pyThen clean up the script:
#!/usr/bin/env python
# coding: utf-8
import pandas as pd
from sqlalchemy import create_engine
from tqdm.auto import tqdm
dtype = {
"VendorID": "Int64",
"passenger_count": "Int64",
"trip_distance": "float64",
"RatecodeID": "Int64",
"store_and_fwd_flag": "string",
"PULocationID": "Int64",
"DOLocationID": "Int64",
"payment_type": "Int64",
"fare_amount": "float64",
"extra": "float64",
"mta_tax": "float64",
"tip_amount": "float64",
"tolls_amount": "float64",
"improvement_surcharge": "float64",
"total_amount": "float64",
"congestion_surcharge": "float64"
}
parse_dates = [
"tpep_pickup_datetime",
"tpep_dropoff_datetime"
]
def ingest_data(
url: str,
engine,
target_table: str,
chunksize: int = 100000,
) -> pd.DataFrame:
df_iter = pd.read_csv(
url,
dtype=dtype,
parse_dates=parse_dates,
iterator=True,
chunksize=chunksize
)
first_chunk = next(df_iter)
first_chunk.head(0).to_sql(
name=target_table,
con=engine,
if_exists="replace"
)
print(f"Table {target_table} created")
first_chunk.to_sql(
name=target_table,
con=engine,
if_exists="append"
)
print(f"Inserted first chunk: {len(first_chunk)}")
for df_chunk in tqdm(df_iter):
df_chunk.to_sql(
name=target_table,
con=engine,
if_exists="append"
)
print(f"Inserted chunk: {len(df_chunk)}")
print(f'done ingesting to {target_table}')
def main():
pg_user = 'root'
pg_pass = 'root'
pg_host = 'localhost'
pg_port = '5432'
pg_db = 'ny_taxi'
year = 2021
month = 1
chunksize = 100000
target_table = 'yellow_taxi_data'
engine = create_engine(f'postgresql://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}')
url_prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'
url = f'{url_prefix}/yellow_tripdata_{year:04d}-{month:02d}.csv.gz'
ingest_data(
url=url,
engine=engine,
target_table=target_table,
chunksize=chunksize
)
if __name__ == '__main__':
main()Then add click:
use click to parse the arguments. install click with uv
The result:
import click
...
@click.command()
@click.option('--pg-user', default='root', help='PostgreSQL username')
@click.option('--pg-pass', default='root', help='PostgreSQL password')
@click.option('--pg-host', default='localhost', help='PostgreSQL host')
@click.option('--pg-port', default='5432', help='PostgreSQL port')
@click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name')
@click.option('--year', default=2021, type=int, help='Year of the data')
@click.option('--month', default=1, type=int, help='Month of the data')
@click.option('--chunksize', default=100000, type=int, help='Chunk size for ingestion')
@click.option('--target-table', default='yellow_taxi_data', help='Target table name')
def main(pg_user, pg_pass, pg_host, pg_port, pg_db, year, month, chunksize, target_table):
...The script reads data in chunks (100,000 rows at a time) to handle large files efficiently without running out of memory.
Make sure PostgreSQL is running, then execute the ingestion script:
uv run python ingest_data.py \
--pg-user=root \
--pg-pass=root \
--pg-host=localhost \
--pg-port=5432 \
--pg-db=ny_taxi \
--target-table=yellow_taxi_trips \
--year=2021 \
--month=1 \
--chunksize=100000This will download and ingest the data into your PostgreSQL database.
Verify Data
Connect with pgcli and query the data:
uv run pgcli -h localhost -p 5432 -u root -d ny_taxi-- Count records (should return 1,369,765 rows)
SELECT COUNT(*) FROM yellow_taxi_trips;
-- View sample data
SELECT * FROM yellow_taxi_trips LIMIT 10;
-- Basic analytics
SELECT
DATE(tpep_pickup_datetime) AS pickup_date,
COUNT(*) AS trips_count,
AVG(total_amount) AS avg_amount
FROM yellow_taxi_trips
GROUP BY DATE(tpep_pickup_datetime)
ORDER BY pickup_date;pgcli is a handy tool but it's cumbersome to use for complex queries and database management. pgAdmin is a web-based tool that makes it more convenient to access and manage our databases.
It's possible to run pgAdmin as a container along with the Postgres container, but both containers will have to be in the same virtual network so that they can find each other.
Run pgAdmin Container
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-v pgadmin_data:/var/lib/pgadmin \
-p 8085:80 \
dpage/pgadmin4The -v pgadmin_data:/var/lib/pgadmin volume mapping saves pgAdmin settings (server connections, preferences) so you don't have to reconfigure it every time you restart the container.
- The container needs 2 environment variables: a login email and a password. We use
admin@admin.comandrootin this example. - pgAdmin is a web app and its default port is 80; we map it to 8085 in our localhost to avoid any possible conflicts.
- The actual image name is
dpage/pgadmin4.
Note: This won't work yet because pgAdmin can't see the PostgreSQL container. They need to be on the same Docker network!
Docker Networks
Let's create a virtual Docker network called pg-network:
docker network create pg-networkYou can remove the network later with the command
docker network rm pg-network. You can look at the existing networks withdocker network ls.
Stop both containers and re-run them with the network configuration:
# Run PostgreSQL on the network
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v ny_taxi_postgres_data:/var/lib/postgresql \
-p 5432:5432 \
--network=pg-network \
--name pgdatabase \
postgres:18
# In another terminal, run pgAdmin on the same network
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-v pgadmin_data:/var/lib/pgadmin \
-p 8085:80 \
--network=pg-network \
--name pgadmin \
dpage/pgadmin4- Just like with the Postgres container, we specify a network and a name for pgAdmin.
- The container names (
pgdatabaseandpgadmin) allow the containers to find each other within the network.
Connect pgAdmin to PostgreSQL
You should now be able to load pgAdmin on a web browser by browsing to http://localhost:8085. Use the same email and password you used for running the container to log in.
- Open browser and go to
http://localhost:8085 - Login with email:
admin@admin.com, password:root - Right-click "Servers" → Register → Server
- Configure:
- General tab: Name:
Local Docker - Connection tab:
- Host:
pgdatabase(the container name) - Port:
5432 - Username:
root - Password:
root
- Host:
- General tab: Name:
- Save
Now you can explore the database using the pgAdmin interface!
Let's modify the Dockerfile we created before to include our ingest_data.py script:
# Start with slim Python 3.13 image for smaller size
FROM python:3.13.11-slim
# Copy uv binary from official uv image
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/
# Set working directory inside container
WORKDIR /app
# Add virtual environment to PATH
ENV PATH="/app/.venv/bin:$PATH"
# Copy dependency files first (better caching)
COPY "pyproject.toml" "uv.lock" ".python-version" ./
# Install all dependencies (pandas, sqlalchemy, psycopg2)
RUN uv sync --locked
# Copy ingestion script
COPY ingest_data.py ingest_data.py
# Set entry point to run the ingestion script
ENTRYPOINT [ "python", "ingest_data.py" ]Explanation:
uv sync --lockedinstalls exact versions fromuv.lockfor reproducibility- Dependencies (pandas, sqlalchemy, psycopg2) are already in
pyproject.toml - Multi-stage build pattern copies uv from official image
- Copying dependency files before code improves Docker layer caching
Build the Docker Image
docker build -t taxi_ingest:v001 .Run the Containerized Ingestion
You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table.
docker run -it \
--network=pg-network \
taxi_ingest:v001 \
--pg-user=root \
--pg-pass=root \
--pg-host=pgdatabase \
--pg-port=5432 \
--pg-db=ny_taxi \
--target-table=yellow_taxi_trips_2021_2 \
--year=2021 \
--month=2 \
--chunksize=100000Important notes:
- We need to provide the network for Docker to find the Postgres container. It goes before the name of the image.
- Since Postgres is running on a separate container, the host argument will have to point to the container name of Postgres (
pgdatabase). - You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table.
docker-compose allows us to launch multiple containers using a single configuration file, so that we don't have to run multiple complex docker run commands separately.
Docker compose makes use of YAML files. Here's the docker-compose.yaml file for running the Postgres and pgAdmin containers:
services:
pgdatabase:
image: postgres:18
environment:
- POSTGRES_USER=root
- POSTGRES_PASSWORD=root
- POSTGRES_DB=ny_taxi
volumes:
- "ny_taxi_postgres_data:/var/lib/postgresql:rw"
ports:
- "5432:5432"
pgadmin:
image: dpage/pgadmin4
environment:
- PGADMIN_DEFAULT_EMAIL=admin@admin.com
- PGADMIN_DEFAULT_PASSWORD=root
volumes:
- "pgadmin_data:/var/lib/pgadmin"
ports:
- "8085:80"
volumes:
ny_taxi_postgres_data:
pgadmin_data:- We don't have to specify a network because
docker-composetakes care of it: every single container (or "service", as the file states) will run within the same network and will be able to find each other according to their names (pgdatabaseandpgadminin this example). - All other details from the
docker runcommands (environment variables, volumes and ports) are mentioned accordingly in the file following YAML syntax.
Start Services with Docker Compose
We can now run Docker compose by running the following command from the same directory where docker-compose.yaml is found. Make sure that all previous containers aren't running anymore:
docker-compose upNote: this command assumes that the
ny_taxi_postgres_dataused for mounting the volume is in the same directory asdocker-compose.yaml.
Since the settings for pgAdmin were stored within the container and we have killed the previous one, you will have to re-create the connection by following the steps in the pgAdmin section.
You will have to press Ctrl+C in order to shut down the containers. The proper way of shutting them down is with this command:
docker-compose downAnd if you want to run the containers again in the background rather than in the foreground (thus freeing up your terminal), you can run them in detached mode:
docker-compose up -dOther useful commands:
# View logs
docker-compose logs
# Stop and remove volumes
docker-compose down -vBenefits of Docker Compose:
- Single command to start all services
- Automatic network creation
- Easy configuration management
- Declarative infrastructure
If you want to re-run the dockerized ingest script when you run Postgres and pgAdmin with docker-compose, you will have to find the name of the virtual network that Docker compose created for the containers. You can use the command docker network ls to find it and then change the docker run command for the dockerized script to include the network name.
# check the network link:
docker network ls
# it's pipeline_default
# now run the script:
docker run -it \
--network=pipeline_default \
taxi_ingest:v001 \
--pg-user=root \
--pg-pass=root \
--pg-host=pgdatabase \
--pg-port=5432 \
--pg-db=ny_taxi \
--target-table=yellow_taxi_trips_2021_2 \
--year=2021 \
--month=2 \
--chunksize=100000When you're done with the workshop, clean up Docker resources to free up disk space:
Stop all running containers:
docker-compose downRemove specific containers:
# List all containers
docker ps -a
# Remove specific container
docker rm <container_id>
# Remove all stopped containers
docker container pruneRemove Docker images:
# List all images
docker images
# Remove specific image
docker rmi taxi_ingest:v001
docker rmi test:pandas
# Remove all unused images
docker image prune -aRemove Docker volumes:
# List volumes
docker volume ls
# Remove specific volumes
docker volume rm ny_taxi_postgres_data
docker volume rm pgadmin_data
# Remove all unused volumes
docker volume pruneRemove Docker networks:
# List networks
docker network ls
# Remove specific network
docker network rm pg-network
# Remove all unused networks
docker network pruneComplete cleanup (removes everything):
# ⚠️ Warning: This removes ALL Docker resources!
docker system prune -a --volumesClean up local files:
# Remove parquet files
rm *.parquet
# Remove Python cache
rm -rf __pycache__ .pytest_cache
# Remove virtual environment (if using venv)
rm -rf .venvHappy learning! 🐳📊