Skip to content

Adds CAD database tables and related ETLs#2013

Open
johnclary wants to merge 49 commits intomainfrom
john/27728-init-cad-table
Open

Adds CAD database tables and related ETLs#2013
johnclary wants to merge 49 commits intomainfrom
john/27728-init-cad-table

Conversation

@johnclary
Copy link
Copy Markdown
Member

@johnclary johnclary commented Apr 3, 2026

Associated issues

This PR brings CAD data into the database! We currently have a live daily extract of CAD files accumulating in our shared network drive at /mnt/vision_zero_cad on our 02 server. We've also been provided with backfill files containing ~200mb of records going back to 2015. We'll process those mega files post-go-live.

Todos outside the scope of this PR:

  • Create Airflow DAG for daily processing
  • Backfill incident records

Testing

URL to test: Local

testing

Tests involve adding and removing files in S3, so it would be good to give folks a heads-up in Slack to avoid conflicts.

  1. Start your local stack and apply migrations and metadata.

  2. Set up your env file by copying env_template and filling in the blanks. If you already have a .env file for the afd_ems_import ETL, you can use that (just make sure it uses dev variables, not prod)

  3. Acquire local files. in AWS S3, navigate to atd-vision-zero/dev/cad_incidents/archive and download all of the files from this archive directory, saving them to the ./test_data directory in this repo.

  4. Now we will test the incidents_to_s3.py script, starting with a dry run. This command uses the docker-compose.local override file to mount your ./test-data directory into the container's network drive mount point.

docker compose -f docker-compose.yml -f docker-compose.local.yml run import incidents_to_s3.py --dry-run

Confirm that the output lists files oldest to newest based on the timestamp in the filename, with the WithGroupID files always processed after the file without that string in the filename. The output should look something like this

2026-05-07 17:30:37,071 INFO DRY RUN enabled — no files will be uploaded or deleted.
2026-05-07 17:30:37,076 INFO Found 6 file(s) to process.
2026-05-07 17:30:37,076 INFO [DRY RUN] Would upload s3://atd-vision-zero/dev/cad_incidents/inbox/TPWCADTrafficSafetyDaily_20260505.CSV
2026-05-07 17:30:37,076 INFO [DRY RUN] Would upload s3://atd-vision-zero/dev/cad_incidents/inbox/TPWCADTrafficSafetyWithGroupIDDaily_20260505.CSV
  1. Remove the --dry-run flag run and the script again:
docker compose -f docker-compose.yml -f docker-compose.local.yml run import incidents_to_s3.py

The output should look similar to the previous step and display the name of each file that was uploaded to S3. In the AWS console, navigate to the CAD incidents inbox (atd-vision-zero/dev/cad_incidents/inbox) and observe that files have been uploaded there. The timestamps shown in the Last modified column should all be very recent.

  1. Run this script again add the --remove directive. This will cause the processed files to be deleted from your filesystem.
docker compose -f docker-compose.yml -f docker-compose.local.yml run import incidents_to_s3.py --remove

Confirm the files have been deleted from your ./test_data directory.

  1. Moving on to incident_import.py, we will now process the files from the S3 inbox into our local database. Start with a dry run:
docker compose -f docker-compose.yml -f docker-compose.local.yml run import incidents_import.py --dry-run

Confirm that the output lists files oldest to newest based on the timestamp in the filename, with the WithGroupID files always processed after the file without that string in the filename. It should look something like this:

INFO:root:Running CAD incident import
INFO:root:Getting list of files in S3 inbox
INFO:root:6 S3 files to process
INFO:root:Downloading: dev/cad_incidents/inbox/TPWCADTrafficSafetyDaily_20260505.CSV
INFO:root:1,247 total records to upsert
INFO:root:Would upsert 1000
INFO:root:Would upsert 247
INFO:root:Downloading: dev/cad_incidents/inbox/TPWCADTrafficSafetyWithGroupIDDaily_20260505.CSV
INFO:root:1,034 total records to upsert
INFO:root:Would upsert 1000
INFO:root:Would upsert 34
  1. Run the script again without the --dry-run flag:
docker compose -f docker-compose.yml -f docker-compose.local.yml run import incidents_import.py

Once the script completes, use your SQL client to inspect the records in our new tables. Verify that the location_id and in_austin_full_purpose columns are populating on the cad_incidents table.

select * from cad_incidents;
select * from cad_incident_groups;
  1. Lastly, run the script again using the --archive flag.
docker compose -f docker-compose.yml -f docker-compose.local.yml run import incidents_import.py --archive

Use your SQl client to confirm that the records in the cad_incidents table now have different timestamps in the created_at and updated_at columns.

Head to the CAD incidents inbox (atd-vision-zero/dev/cad_incidents/inbox) and confirm that files have been removed form this directory, and that the files in the /archive directory have recent timestamps.


Ship list

  • Check migrations for any conflicts with latest migrations in main branch
  • Confirm Hasura role permissions for necessary access
  • Code reviewed
  • Product manager approved

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 3, 2026

Deploy Preview for atd-vze-staging canceled.

Name Link
🔨 Latest commit 95f43bc
🔍 Latest deploy log https://app.netlify.com/projects/atd-vze-staging/deploys/69de722d5af967000831e558


# Query the view definition and append to the file
run_psql -v ON_ERROR_STOP=1 -A -t -c "SELECT 'CREATE OR REPLACE VIEW ' || '$VIEW_NAME' || ' AS ' || pg_get_viewdef('$VIEW_NAME'::regclass, true);" >> database/views/$VIEW_NAME.sql
run_psql -v ON_ERROR_STOP=1 -A -t -c "SELECT 'CREATE OR REPLACE VIEW ' || '$VIEW_NAME' || ' AS' || chr(10) || pg_get_viewdef('$VIEW_NAME'::regclass, true);" >> database/views/$VIEW_NAME.sql
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes ensure that we always have a newline character after the AS portion of the view output.

It seems as far as I can tell that sqruff does not have a setting for this controlling this, and it was leading to the bot making repeated changes to view files.

To follow along, see:

  • 19198ca, the bot's first commit to this PR, in which it undid changes to views that only it had touched.
  • 95f43bc, the bot reverting most of those changes after I introduced the newline (chr(10)) change above

My hope is that after this commit goes through we will stop seeing so much noise from the bot 🤞

@netlify
Copy link
Copy Markdown

netlify Bot commented May 1, 2026

Deploy Preview for atd-vze-staging canceled.

Name Link
🔨 Latest commit bb857e7
🔍 Latest deploy log https://app.netlify.com/projects/atd-vze-staging/deploys/69fe028d5a78440008c0830f

@johnclary johnclary changed the title Adds cad_incidents table to the DB Adds CAD data and related ETLs May 4, 2026
@johnclary johnclary changed the title Adds CAD data and related ETLs Adds CAD database tables and related ETLs May 5, 2026
UNIQUE (incident_group_id, master_incident_id)
);

comment on table cad_incident_groups is 'Table showing linkages of CAD incident groups, which are a poorly understood grouping of related incidents generated by the CAD system. Not all incident IDs in this table will have a corresponding record in the cad_incidents table, because some of those incidents may fall outside the scope of our CAD data query (which includes crash related records only)';
Copy link
Copy Markdown
Member Author

@johnclary johnclary May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBD how useful this table will be. I still don't understand the how/when/why the CAD system generates these incident groups.

It's further complicated by the fact that some master_incident_id's referenced here will not be in our database, because they related to a non-crash incident category. This is why I did not establish a foreign key relationship between cad_incident_groups and cad_incidents.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. At first I thought the comment on the table was also lifted from the APD public dataset and I thought "poorly understood" was oddly honest. I can't really picture what this table is, but it sounds like I am not alone in that.


CREATE INDEX idx_cad_incidents_response_date ON cad_incidents (response_date);

comment ON TABLE cad_incidents IS 'This dataset contains information on both 911 calls (usually referred to as Calls for Service or Dispatched Incidents) and officer-initiated incidents related to traffic crashes as recorded in the Austin public safety Computer Aided Dispatch (CAD) system. Data is provided by the public safety enterprise data team after approval by AFD, ATCEMS, and APD';
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@Charlie-Henry Charlie-Henry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On step 8:

Verify that the location_id and in_austin_full_purpose columns are populating on the cad_incidents table.

I did see some 129 records missing a location_id, I dunno if that is expected.

Everything else is pretty minor stuff/questions. I did have a small fix for the env_template for anyone testing this themselves. After that, everything worked well.

Comment thread etl/cad_incidents_import/env_template Outdated
Copy link
Copy Markdown
Contributor

@Charlie-Henry Charlie-Henry May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we aren't retrieving the data directly from the warehouse? Sorry probably lost some context along the way.

List: List of S3 object keys, sorted oldest to newest
"""
prefix = f"{BUCKET_ENV}/cad_incidents/{subdir}"
response = s3_client.list_objects(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thing I'm always reminded of that list_objects has a limit of 1k objects being returned. Maybe something to be aware of when doing the backfill of data.

Comment on lines +32 to +34
We expect object keys like:
- some/path/TPWCADTrafficSafetyWithGroupIDDaily_20260410.CSV
- some/path/TPWCADTrafficSafetyDaily_20260410.CSV
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible we'd end up with files in our inbox not following this format?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintentionally, for sure. I noticed i forgot to uncomment the file name check in is_file_to_process() which is looking out for this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check is_file_to_process() when getting files locally, should we also check the file names from the s3 bucket?

"""
for row in data:
for source_key, target_key in cols_to_rename.items():
row[target_key] = row.pop(source_key)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the csv file had a new column added we were not expecting I think this would not drop it?


We process two distinct files on a daily basis:

- The CAD incident incident file, in which each row is a crash-related CAD incident responded to by AFD, EMS, or APD. Sample filename: `TPWCADTrafficSafetyDaily_20260410.CSV`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double incident

Copy link
Copy Markdown
Member

@chiaberry chiaberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and saw the files move the way I expected. When do you anticipate we would import the files directly from local vs uploading to s3 and then importing from there?

Comment on lines +135 to +136
csv_content = download_file_s3(file_obj_key_or_path)
csv_content = download_file_s3(file_obj_key_or_path)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called twice?

Comment on lines +32 to +34
We expect object keys like:
- some/path/TPWCADTrafficSafetyWithGroupIDDaily_20260410.CSV
- some/path/TPWCADTrafficSafetyDaily_20260410.CSV
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check is_file_to_process() when getting files locally, should we also check the file names from the s3 bucket?

date_field_names (str[]): list of field names which hold date values to update
date_format (str): the format of the input date string, which will be use to parse the string
into a datetime object
tz (string): The IANA time zone name of the input time value. Defaluts to America/Chicago
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two tiny typos. line 64 sting/string and Defaluts/defaults

)

if not files_todo:
raise Exception("No CAD files found in S3 inbox")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would only happen if there were no files in the local_files_to_process since get_s3_files_todo throws an error if there are no files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants