GL-Data 1071: Stale data check dedup logic update by sb-2011 · Pull Request #12876 · 18F/identity-idp

sb-2011 · 2026-02-17T16:44:07Z

🎫 Ticket

Issue-1071

🛠 Summary of changes

Created a rails job to keep track of duplicates by the hour for the log groups of interest.
Also created a spec file to test coverage

📜 Testing Plan

Provide a checklist of steps to confirm the changes.

Step 1 - Deploy to personal env
Step 2 - Confirm the new job runs as expected

Tested in Sandbox

Run the new hourly job: `CloudwatchDuplicateLogCounterJob`

Process two log groups (production.log and events.log).
Generates two CSV file with hourly duplicate count. File stored in S3.

identity(spalathura):002:0> DataWarehouse::CloudwatchDuplicateLogCounterJob.perform_now(Time.zone.now)
Processing log group spalathura_/srv/idp/shared/log/events.log, checking for duplicates.
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
Checking for duplicates at hour 2.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 3.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 4.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 5.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 6.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 7.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 8.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 9.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 10.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 11.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 12.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 13.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Uploading duplicate counts to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
DataWarehouse::CloudwatchDuplicateLogCounterJob: uploading to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
Successfully updated duplicate counts for hour bucket(s): [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
Processing log group spalathura_/srv/idp/shared/log/production.log, checking for duplicates.
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
Checking for duplicates at hour 2.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 3.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 4.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 5.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 6.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 7.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 8.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 9.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 10.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 11.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 12.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 13.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Uploading duplicate counts to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
DataWarehouse::CloudwatchDuplicateLogCounterJob: uploading to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
Successfully updated duplicate counts for hour bucket(s): [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
nil

Run the existing modified daily job: `TableSummaryStatsExportJob`

identity(spalathura):003:0> DataWarehouse::TableSummaryStatsExportJob.perform_now(Time.zone.today)
Querying log slices 100% [1/1] |====================================================================================================================| Time: 00:00:03
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
Querying log slices 100% [1/1] |====================================================================================================================| Time: 00:00:03
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
DataWarehouse::TableSummaryStatsExportJob: uploading to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/2026/2026-04-16_table_summary_stats.json
"s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/2026/2026-04-16_table_summary_stats.json"

Verifying the results

fields @timestamp, jsonParse(@message) as @messageJson 
  | filter @logStream like 'worker-i-' or @logStream like 'idp-i-'
  | filter isPresent(@messageJson)
  | filter @message like 'DUPLICATE_TEST'
  | stats count() as raw_count

Returns:

Verifying the data in S3

samathad2023 · 2026-02-19T18:36:51Z

+      # Data warehouse duplicate log count check
+      cloudwatch_duplicate_log_counter_job: {
+        class: 'DataWarehouse::CloudwatchDuplicateLogCounterJob',
+        cron: '5,25,45 * * * *', # run 3x per hour, at 5, 25, and 45 minutes past the hour


Only reason this is Running 3x/hour to prevent CloudWatch log bloat and timeout issues?

I'm giving the job a chance to process specific hour set more than once..just in case the first and/or second runs fail for some reason that we cannot control such as network, aws outage, etc.

I think GoodJobs has a built-in retry mechanism but I'm not sure if it would catch those failures.

If the retry mechanism does catch them then this would be unnecessary.

@MrNagoo can confirm

I can confirm, but I still needed to re-read the docs 😆 for syntax.

https://guides.rubyonrails.org/active_job_basics.html#retrying-or-discarding-failed-jobs

Since we're already using try/rescue we know we're capturing errors from the client so something like...

retry_on StandardError, wait: 10.minutes, attempts: 3 could replace the forced interval and free up significant resources.

astrogeco · 2026-02-19T18:46:40Z

Reminder to tag at least one of the commits (and the final squashed one) with the full gitlab issue URL to preserve linking

…loudwatch stale data check (issue 1071 - https://gitlab.login.gov/lg-teams/Team-Data/data-warehouse-ag/-/issues/1071)

…Utils

…ndling in duplicate_row_count_file_path method

MrNagoo

There are a few things we may need to adjust but I'm not seeing anything that won't work.

MrNagoo · 2026-04-17T21:35:06Z

+      # Data warehouse duplicate log count check
+      cloudwatch_duplicate_log_counter_job: {
+        class: 'DataWarehouse::CloudwatchDuplicateLogCounterJob',
+        cron: '5,25,45 * * * *', # run 3x per hour, at 5, 25, and 45 minutes past the hour


I can confirm, but I still needed to re-read the docs 😆 for syntax.

https://guides.rubyonrails.org/active_job_basics.html#retrying-or-discarding-failed-jobs

Since we're already using try/rescue we know we're capturing errors from the client so something like...

retry_on StandardError, wait: 10.minutes, attempts: 3 could replace the forced interval and free up significant resources.

MrNagoo · 2026-04-17T21:58:31Z

+module DataWarehouse
+  module Shared
+    module StaleDataUtils
+      NUM_THREADS = 6


Why are we doing 6 threads? I ask because we have a hard ceiling 5 request per second on CW and we have battled with RateLimitExceeded Errors.

6 threads was supposed to do 6 X 10 mins = 60 mins/ 1 hour. Testing in lower environments did not create an error with 6 threads.

sb-2011 force-pushed the 1071-stale-data-check-dedup-logic-update branch from 03271ae to 43289c9 Compare February 17, 2026 17:13

sb-2011 requested review from MrNagoo, basiliskus, guruparan18 and samathad2023 February 17, 2026 21:20

basiliskus reviewed Feb 18, 2026

View reviewed changes

Comment thread app/jobs/data_warehouse/cloudwatch_duplicate_log_counter_job.rb Outdated

basiliskus reviewed Feb 18, 2026

View reviewed changes

Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated

samathad2023 reviewed Feb 19, 2026

View reviewed changes

Comment thread app/jobs/data_warehouse/cloudwatch_duplicate_log_counter_job.rb

sb-2011 added 8 commits February 19, 2026 14:05

changelog: Internal, DataWarehouse Alerts, Optimize dedup logic for c…

0d2a768

…loudwatch stale data check (issue 1071 - https://gitlab.login.gov/lg-teams/Team-Data/data-warehouse-ag/-/issues/1071)

Added fixes to code and spec file

a8a9e0a

Additional fixes plus lint

56d3885

Remove redundant check if s3 file exists

a3f9cf9

Fix job configuration

11c9b26

Add year to S3 path location for duplicates

d845b4b

Limit concurrency and run job 3x hour

6d03031

Addressing PR comments

d5518c8

sb-2011 force-pushed the 1071-stale-data-check-dedup-logic-update branch from ba60462 to d5518c8 Compare February 19, 2026 19:05

Merge branch 'main' into 1071-stale-data-check-dedup-logic-update

8a09ad9

astrogeco changed the title ~~1071 stale data check dedup logic update~~ GL-Data 1071: Stale data check dedup logic update Apr 13, 2026

astrogeco reviewed Apr 13, 2026

View reviewed changes

Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated

Refactor log_groups method: move to BaseJob and remove from StaleData…

5fefa2d

…Utils

astrogeco reviewed Apr 16, 2026

View reviewed changes

Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated

astrogeco reviewed Apr 16, 2026

View reviewed changes

Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated

basiliskus reviewed Apr 16, 2026

View reviewed changes

Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb

Update deduplication logic for Cloudwatch logs: refactor file path ha…

1bce6cb

…ndling in duplicate_row_count_file_path method

MrNagoo approved these changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GL-Data 1071: Stale data check dedup logic update#12876

GL-Data 1071: Stale data check dedup logic update#12876
sb-2011 wants to merge 11 commits intomainfrom
1071-stale-data-check-dedup-logic-update

sb-2011 commented Feb 17, 2026 •

edited by guruparan18

Loading

Uh oh!

Uh oh!

Uh oh!

samathad2023 Feb 19, 2026

Uh oh!

sb-2011 Feb 19, 2026

Uh oh!

astrogeco Apr 17, 2026

Uh oh!

MrNagoo Apr 17, 2026

Uh oh!

Uh oh!

astrogeco commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrNagoo left a comment

Uh oh!

MrNagoo Apr 17, 2026

Uh oh!

MrNagoo Apr 17, 2026

Uh oh!

guruparan18 Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

sb-2011 commented Feb 17, 2026 • edited by guruparan18 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎫 Ticket

🛠 Summary of changes

📜 Testing Plan

Tested in Sandbox

Run the new hourly job: CloudwatchDuplicateLogCounterJob

Run the existing modified daily job: TableSummaryStatsExportJob

Verifying the results

Uh oh!

Uh oh!

Uh oh!

samathad2023 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

sb-2011 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

astrogeco Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

MrNagoo Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

astrogeco commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrNagoo left a comment

Choose a reason for hiding this comment

Uh oh!

MrNagoo Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

MrNagoo Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

guruparan18 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sb-2011 commented Feb 17, 2026 •

edited by guruparan18

Loading

Run the new hourly job: `CloudwatchDuplicateLogCounterJob`

Run the existing modified daily job: `TableSummaryStatsExportJob`