Skip to content

Update in_cloudwatch_logs.rb for better throttling handling #264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hd40910
Copy link

@hd40910 hd40910 commented Apr 3, 2025

If we look at this ruby code , there are 2 potential issues:

  1. The current implementation has a flaw in it retry mechanism. When AWS throttles the get_log_events call, the code attempts to retry, but there's no limit to how many times it will retry
  2. The retry mechanism is inefficient because it's retrying the entire get_events method when throttling occurs if there are more than x events

If we look at this ruby code , there are 2 potential issues:

1.	The current implementation has a flaw in it retry mechanism. When AWS throttles the get_log_events call, the code attempts to retry, but there's no limit to how many times it will retry
2.	The retry mechanism is inefficient because it's retrying the entire get_events method when throttling occurs if there are more than x events
@hd40910
Copy link
Author

hd40910 commented Apr 3, 2025

@cosmo0920 for your approval, please do inform if you see issues

@cosmo0920
Copy link
Member

Currently, I have no cycles to take a look on it.
Could y'all take a look this, @kenhys or @daipom ?

@daipom daipom self-requested a review April 10, 2025 10:52
Copy link
Contributor

@daipom daipom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hd40910 Thanks for this improvement! Sorry for the delay.
This direction looks good to me.

I have commented on some points.
Please check them.

In addition, please consider the following points?

  • Remove unnecessary spaces.
    • Some new blank lines contain needless whitespaces.
  • Update README.
    • Add description for the new parameter.
    • Change description of throttling_retry_seconds about the exponential interval.
  • Fix failing tests on CI.
    • Looks like they fail because the log message has changed.
    • Since the wait time will be random, it would be good to loosen the assert condition. It would be unnecessary to assert an exact match for the entire message.)

@@ -43,6 +43,7 @@ class CloudwatchLogsInput < Input
config_param :end_time, :string, default: nil
config_param :time_range_format, :string, default: "%Y-%m-%d %H:%M:%S"
config_param :throttling_retry_seconds, :time, default: nil
config_param :max_retry_count, :integer, default: 999 #TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_param :max_retry_count, :integer, default: 999 #TODO

Please consider the appropriate default value and remove the TODO comment.

I'm not familiar with CloudWatch.
Can there be a case where a user wants to retry an unlimited number of times, like the current version?

If so, retry should be unlimited when this value is nil.
And the default value might be better to be nil for compatibility.

request[:next_token] = log_next_token if !log_next_token.nil? && !log_next_token.empty?
request[:start_from_head] = true if read_from_head?(log_next_token)

# Only apply throttling retry to the API call, not the whole method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Only apply throttling retry to the API call, not the whole method

Information that can be found in the code is not required for comments.
(Maybe those comments are for PR. Thanks. I understand the code. Let's remove them now.)

request[:next_token] = next_token if next_token
request[:log_stream_name_prefix] = log_stream_name_prefix if log_stream_name_prefix

# Only apply throttling retry to the API call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Only apply throttling retry to the API call

end

def throttling_handler(method_name)
# New method to handle API calls with throttling retry with exponential backoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# New method to handle API calls with throttling retry with exponential backoff

if @throttling_retry_seconds && retry_count < @max_retry_count
# Calculate backoff with jitter: base_time * (2^retry_count) + random_jitter
wait_time = @throttling_retry_seconds * (2 ** retry_count) * (0.9 + 0.2 * rand)
log.warn "Haia - ThrottlingException on #{method_name}. Retry #{retry_count+1}/#{@max_retry_count}. Waiting #{wait_time.round(2)} seconds."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell me what Haia means?

log.warn "Haia - ThrottlingException on #{method_name}. Retry #{retry_count+1}/#{@max_retry_count}. Waiting #{wait_time.round(2)} seconds."
sleep wait_time

# Only retry the API call itself, not recursively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Only retry the API call itself, not recursively

@@ -341,7 +358,12 @@ def describe_log_groups(log_group_name_prefix, log_groups = nil, next_token = ni
log_group_name_prefix: log_group_name_prefix
}
request[:next_token] = next_token if next_token
response = @logs.describe_log_groups(request)

# Apply throttling handling to describe_log_groups too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Apply throttling handling to describe_log_groups too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants