Skip to content

Conversation

@cliu587
Copy link
Contributor

@cliu587 cliu587 commented Jun 25, 2019

PTAL @dguo-coursera, @amory-coursera - I think middlewares should be created per request rather than once on instantiation, as they hold execution state. This along with another change should unbreak the datadog tracing.

@cliu587
Copy link
Contributor Author

cliu587 commented Jun 25, 2019

not necessary :) thanks for the 2nd pair of eyes on this, @amory-coursera

@cliu587 cliu587 closed this Jun 25, 2019
@cliu587 cliu587 reopened this Jun 26, 2019
@cliu587
Copy link
Contributor Author

cliu587 commented Jun 26, 2019

Reviving this - while middlewares are indeed stateless, reusing the same middleware for every request seemed to have resulted in a memory leak due to the thread-local scope data not getting cleaned up.

Here's the heap dump for the newest deploy that resulted in java heap shooting up rapidly: https://imgur.com/a/t5SH0CV (~11GB heap)

Datadog link of deploy to ~40min after deploy: https://app.datadoghq.com/dashboard/a4f-p8g-3mx/service-dash?from_ts=1561553831162&live=true&tile_size=m&to_ts=1561568231162&tpl_var_service=assembler&fullscreen_widget=58099565&fullscreen_section=overview

Here's the heap dump for the last known good deploy (which still has tracing enabled):
https://imgur.com/a/cWpNnzu (~700MB heap)

Datadog link of deploy to ~40min after deploy: https://app.datadoghq.com/dashboard/a4f-p8g-3mx/service-dash?from_ts=1561553831162&live=true&tile_size=m&to_ts=1561568231162&tpl_var_service=assembler&fullscreen_widget=58099565&fullscreen_section=overview

My best guess is because we create a singleton tracing middleware, every thread-local created has a strong reference that's never cleaned up at the end of the request.

EDIT: also added datadog links to correlate

@cliu587
Copy link
Contributor Author

cliu587 commented Jun 27, 2019

Gentle ping on this - would like to resolve by end of week if at all possible. Thanks in advance!

@amory-coursera amory-coursera requested review from amory-coursera and removed request for amory-coursera June 27, 2019 18:20
Copy link
Contributor

@amory-coursera amory-coursera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something we could fix in another way, but if this is the path of least resistance I'm ok with the approach.

@cliu587
Copy link
Contributor Author

cliu587 commented Jun 27, 2019

This feels like something we could fix in another way, but if this is the path of least resistance I'm ok with the approach.

I think the existence of thread-locals in middlewares along with async execution makes all other solutions very non-obvious. Assuming this fix indeed does the job, I think it is the path of least resistance and the most obviously correct solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants