Skip to content

GL-Data 1071: Stale data check dedup logic update#12876

Open
sb-2011 wants to merge 11 commits intomainfrom
1071-stale-data-check-dedup-logic-update
Open

GL-Data 1071: Stale data check dedup logic update#12876
sb-2011 wants to merge 11 commits intomainfrom
1071-stale-data-check-dedup-logic-update

Conversation

@sb-2011
Copy link
Copy Markdown
Contributor

@sb-2011 sb-2011 commented Feb 17, 2026

🎫 Ticket

Issue-1071

🛠 Summary of changes

  • Created a rails job to keep track of duplicates by the hour for the log groups of interest.
  • Also created a spec file to test coverage

📜 Testing Plan

Provide a checklist of steps to confirm the changes.

  • Step 1 - Deploy to personal env
  • Step 2 - Confirm the new job runs as expected

Tested in Sandbox

Run the new hourly job: CloudwatchDuplicateLogCounterJob

  • Process two log groups (production.log and events.log).
  • Generates two CSV file with hourly duplicate count. File stored in S3.
identity(spalathura):002:0> DataWarehouse::CloudwatchDuplicateLogCounterJob.perform_now(Time.zone.now)
Processing log group spalathura_/srv/idp/shared/log/events.log, checking for duplicates.
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
Checking for duplicates at hour 2.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 3.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 4.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 5.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 6.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 7.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 8.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 9.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 10.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 11.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 12.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 13.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Uploading duplicate counts to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
DataWarehouse::CloudwatchDuplicateLogCounterJob: uploading to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
Successfully updated duplicate counts for hour bucket(s): [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
Processing log group spalathura_/srv/idp/shared/log/production.log, checking for duplicates.
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
Checking for duplicates at hour 2.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 3.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 4.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 5.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 6.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 7.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 8.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 9.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 10.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 11.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 12.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Checking for duplicates at hour 13.
Querying log slices 100% [6/6] |====================================================================================================================| Time: 00:00:03
Uploading duplicate counts to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
DataWarehouse::CloudwatchDuplicateLogCounterJob: uploading to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
Successfully updated duplicate counts for hour bucket(s): [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
nil

Run the existing modified daily job: TableSummaryStatsExportJob

identity(spalathura):003:0> DataWarehouse::TableSummaryStatsExportJob.perform_now(Time.zone.today)
Querying log slices 100% [1/1] |====================================================================================================================| Time: 00:00:03
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/events_log/2026/2026-04-16.csv
Querying log slices 100% [1/1] |====================================================================================================================| Time: 00:00:03
Reading existing duplicate counts from s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/cw_log_duplicate_counts/production_log/2026/2026-04-16.csv
DataWarehouse::TableSummaryStatsExportJob: uploading to s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/2026/2026-04-16_table_summary_stats.json
"s3://login-gov-analytics-export-spalathura-894947205914-us-west-2/table_summary_stats/2026/2026-04-16_table_summary_stats.json"

Verifying the results

fields @timestamp, jsonParse(@message) as @messageJson 
  | filter @logStream like 'worker-i-' or @logStream like 'idp-i-'
  | filter isPresent(@messageJson)
  | filter @message like 'DUPLICATE_TEST'
  | stats count() as raw_count

Returns:
image
image

Verifying the data in S3

@sb-2011 sb-2011 force-pushed the 1071-stale-data-check-dedup-logic-update branch from 03271ae to 43289c9 Compare February 17, 2026 17:13
Comment thread app/jobs/data_warehouse/cloudwatch_duplicate_log_counter_job.rb Outdated
Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated
# Data warehouse duplicate log count check
cloudwatch_duplicate_log_counter_job: {
class: 'DataWarehouse::CloudwatchDuplicateLogCounterJob',
cron: '5,25,45 * * * *', # run 3x per hour, at 5, 25, and 45 minutes past the hour
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reason this is Running 3x/hour to prevent CloudWatch log bloat and timeout issues?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm giving the job a chance to process specific hour set more than once..just in case the first and/or second runs fail for some reason that we cannot control such as network, aws outage, etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GoodJobs has a built-in retry mechanism but I'm not sure if it would catch those failures.

If the retry mechanism does catch them then this would be unnecessary.

@MrNagoo can confirm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm, but I still needed to re-read the docs 😆 for syntax.

https://guides.rubyonrails.org/active_job_basics.html#retrying-or-discarding-failed-jobs

Since we're already using try/rescue we know we're capturing errors from the client so something like...

retry_on StandardError, wait: 10.minutes, attempts: 3 could replace the forced interval and free up significant resources.

Comment thread app/jobs/data_warehouse/cloudwatch_duplicate_log_counter_job.rb
@astrogeco
Copy link
Copy Markdown
Contributor

Reminder to tag at least one of the commits (and the final squashed one) with the full gitlab issue URL to preserve linking

@sb-2011 sb-2011 force-pushed the 1071-stale-data-check-dedup-logic-update branch from ba60462 to d5518c8 Compare February 19, 2026 19:05
@astrogeco astrogeco changed the title 1071 stale data check dedup logic update GL-Data 1071: Stale data check dedup logic update Apr 13, 2026
Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated
Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated
Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb Outdated
Comment thread app/jobs/data_warehouse/shared/stale_data_utils.rb
…ndling in duplicate_row_count_file_path method
Copy link
Copy Markdown
Contributor

@MrNagoo MrNagoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things we may need to adjust but I'm not seeing anything that won't work.

# Data warehouse duplicate log count check
cloudwatch_duplicate_log_counter_job: {
class: 'DataWarehouse::CloudwatchDuplicateLogCounterJob',
cron: '5,25,45 * * * *', # run 3x per hour, at 5, 25, and 45 minutes past the hour
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm, but I still needed to re-read the docs 😆 for syntax.

https://guides.rubyonrails.org/active_job_basics.html#retrying-or-discarding-failed-jobs

Since we're already using try/rescue we know we're capturing errors from the client so something like...

retry_on StandardError, wait: 10.minutes, attempts: 3 could replace the forced interval and free up significant resources.

module DataWarehouse
module Shared
module StaleDataUtils
NUM_THREADS = 6
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing 6 threads? I ask because we have a hard ceiling 5 request per second on CW and we have battled with RateLimitExceeded Errors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 threads was supposed to do 6 X 10 mins = 60 mins/ 1 hour. Testing in lower environments did not create an error with 6 threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants