Set HTTP Server Timeouts by nayanjd-do · Pull Request #54 · digitalocean/docker-distribution

nayanjd-do · 2025-08-19T12:00:58Z

This PR add HTTP Server Timeouts to prevent R.U.D.Y. attacks.

Jira:

DOCR-1663

…locean/docker-distribution into ndas/DOCR-1663/connection-timeout

nmittal-do · 2025-08-21T06:45:44Z

registry/registry.go

 	server := &http.Server{
 		Handler:           handler,
 		ReadHeaderTimeout: readHeaderTimeout,
+		ReadTimeout:       60 * time.Minute,


Supporting 20GB upload using connection timeout seems like a short term workaround.
What if the user has bigger file and good internet connection will they be able to download/upload >20G file and vice versa?
Also is it possible to create a DDoS attach by opening multiple 1hr long connection? Do we know how can we thwart it?
Have we explored option around how can we support multi part upload? while limiting the file size to maybe 1 gigs? 5Gigs?

What if the user has bigger file and good internet connection will they be able to download/upload >20G file and vice versa?
Have we explored option around how can we support multi part upload? while limiting the file size to maybe 1 gigs? 5Gigs?

Docker clients already chunks it to around 5 GBs (per HTTP POST/PATCH/PUT). Earlier, customer were able to push even 60 GBs image using this. But recently docker clients chunks it to little bit more than that and registry.digitalocean.com cloudflare zone had 5GB limit of request body. Hence, customer's push had been failing recently. We would raise a cloudflare request to increase this limit and we are setting 1 hr connection timeout to prevent slow connection attacks like R.U.D.Y. We are going to document only 20 GB in our product docs because we want to try 20 GB upload first and maybe increase it after some months of customer testing.

Also is it possible to create a DDoS attach by opening multiple 1hr long connection? Do we know how can we thwart it?

Cloudflare shielded our origins for such attacks. After we lift this 5GB limit, cloudflare would not shield us anymore. We would be open to such attacks. So, yes multiple 1hr long connection can be used to slow our servers down. We do not have any mechanism to thwart this at the moment. However, we can add grafana dashboards and alerting for such kind of long running connections.

But recently docker clients chunks it to little bit more than that

What will that no be? I am assuming little bit is not 15G

What was the timeout value before?

We have updated the timeout to 60 min, but what was this value before? I am trying to understand if the chucks size is not increased lot are we increasing timeout proportionally or how did we land on 60 min?

Document 20G on product

I will be worried about slow connections here which are unable to upload 20G in 1 hour they will not like us holding our promise. Also once one customer figrue out limit is not 20G this can spread easily, we should think about how can we have technical guardrails around it.

After we lift this 5GB limit, cloudflare would not shield us anymore

Is this limit applicable when we are doing chunk uploads of <= 5G?

we can add grafana dashboards and alerting for such kind of long running connections.

Please add those and put them in this PR description

What will that no be? I am assuming little bit is not 15G

I will find out and post it here.

we should think about how can we have technical guardrails around it.

That's a valid point. I would explore how can we do this. Either using nginx's conf or from our own code.

Is this limit applicable when we are doing chunk uploads of <= 5G?

I am not sure. Need to confirm with cloudflare. We would only be able to do that when we raise a request.

Please add those and put them in this PR description

Will do.

gane5hvarma · 2025-08-21T08:26:14Z

@NayanJD Im assuming these huge timeouts are required only for few endpoints. Can we explore creating different server for particular endpoints. It might help us a bit on security measure. Im okay as it is as customer is blocked on this.
Also is it possible if we can measure connection times greater than a min ? just to get a objective view and decide what is best based on this data

nayanjd-do · 2025-08-21T10:10:41Z

Can we explore creating different server for particular endpoints. It might help us a bit on security measure.

@gane5hvarma In the end, if the dedicated servers for endpoints are run by the same go code then the bottleneck would still exist. It would make sense if we run separate pods for each server and then load balance according to route using nginx. Also, we cannot use the same port if we use different servers. We can use port reuse os primitive though but I am not sure how that works in docker containers. I think it would be overkill for this but open for suggestion.

Also is it possible if we can measure connection times greater than a min ? just to get a objective view and decide what is best based on this data

Yes, we can add metrics for this. We can then visualise in grafana the p50 or p99 connection time for HTTP requests.

gane5hvarma · 2025-08-21T10:40:27Z

you are right, we cant create separate servers while reusing same port for the same process. Different process would be needed for different servers and route forwarding service will be needed like nginx. But yeah as i said, for now we can go ahead and add metric to observe.
Also i agree with @nmittal-do that 60 min might be an overkill. The default value is 0, which mean no timeout. Also we might need to check if there is connection timeout at cloudflare side

nayanjd-do · 2025-08-21T11:35:46Z

Also i agree with @nmittal-do that 60 min might be an overkill.

We can reduce it but then actual customers with spotty connections might face issue uploading 20GB layers. Image push are generally done by CI/CD workers and sometimes also by developers from their machines. Pushes from CI/CD workers should be fast. But from developer machines it might be slow. So, (20 GB / 3600s) would give us ~ 6MB/s. That means we are assuming >6 MB/s transfer rate from client side. This value I believe is good for home WiFi as well I believe. BTW, is there any timeout value you have on top of your mind @gane5hvarma ?

NayanJD added 3 commits August 19, 2025 11:38

Add timeouts

6aa6f50

forkedchanges: Add timeouts

0b4fea8

Merge branch 'ndas/DOCR-1663/connection-timeout' of github.com:digita…

93dee86

…locean/docker-distribution into ndas/DOCR-1663/connection-timeout

rak16 approved these changes Aug 21, 2025

View reviewed changes

nmittal-do reviewed Aug 21, 2025

View reviewed changes

Change idle timeout to 5 mins

d573976

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set HTTP Server Timeouts#54

Set HTTP Server Timeouts#54
nayanjd-do wants to merge 4 commits intomasterfrom
ndas/DOCR-1663/connection-timeout

nayanjd-do commented Aug 19, 2025 •

edited by atlassian bot

Loading

Uh oh!

nmittal-do Aug 21, 2025 •

edited

Loading

Uh oh!

nayanjd-do Aug 21, 2025

Uh oh!

nmittal-do Aug 21, 2025 •

edited

Loading

Uh oh!

nayanjd-do Aug 21, 2025

Uh oh!

gane5hvarma commented Aug 21, 2025 •

edited

Loading

Uh oh!

nayanjd-do commented Aug 21, 2025

Uh oh!

gane5hvarma commented Aug 21, 2025 •

edited

Loading

Uh oh!

nayanjd-do commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nayanjd-do commented Aug 19, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmittal-do Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nayanjd-do Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

nmittal-do Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nayanjd-do Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gane5hvarma commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nayanjd-do commented Aug 21, 2025

Uh oh!

gane5hvarma commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nayanjd-do commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nayanjd-do commented Aug 19, 2025 •

edited by atlassian bot

Loading

nmittal-do Aug 21, 2025 •

edited

Loading

nmittal-do Aug 21, 2025 •

edited

Loading

gane5hvarma commented Aug 21, 2025 •

edited

Loading

gane5hvarma commented Aug 21, 2025 •

edited

Loading