Current capacity issues
A short partial outage on Mar 9th1 and an improved balancing of servers2 that took 12-hours has drawn me to conclude we're too close to capacity in Europe3 during their weekday peaks. The rebalancing has improved matters, but we're still nearing capacity in peaks.
Details of server balancing
The tile stores function similarily to a cache so we have what is similar to a two level cache in front of rendering. The 2nd level of the tile store caches metatiles not tiles so it's a different cache object we need to split across servers. Rendering new tiles if a tile store (and server) becomes unavailabile means we want each metatile to be warm in two servers. This also has to be done for two regions where we prefer sending requests out of region over sending requests to a server that is guaranteed to not have the tiles in the store.
We then have to do all this with servers of varying capacity and make sure that the load distribution is the same as capacity if any one server goes down. That's why this takes a large amount of custom VCL to distribute requests unevenly to groups and handle fallback.
None of this matters very much since we're close enough to the limit it would be an issue under any means of distributing load. This is also why sending parts of a region to a different region with more capacity isn't a long-term solution.
init/define-directors.vcl
/*
Load distribution among tile rendering servers of differing capacities is a
difficult problem. If we distribute requests randomly across the backends we
encounter duplication where different servers are rendering the same metatile.
If we want to avoid all duplication we could use a single chash director with
appropriate weights. The problem with this is that when a server goes down
100% of the requests it was serving will be for metatiles that the other
servers do not have in storage.
We want each metatile to go to at least two servers in normal operation to
keep the storage warm. If we had all identical servers this would be easy.
With identical servers A, B, C you make groups AB, AC, BC and send a third
of requests to each group. Then if A goes down B and D take equal portions of
the load
It can be shown that if you have servers A, B, C with relative capacities of
a, b, c and groups AB, AC, BC within each group that within each group the
weights should be the relative capacity and the weights across the groups
should be the sum of their capacities.
Our situation is more complicated because we (a) want to avoid sending traffic
to distant backends, while at the same time we (b) want to fall back to a
a distant backend in favour of sending traffic to servers that will not have
the tiles in storage. Cross-ocean latency is better than the time to render a
new metatile.
If (a) was the only requirement we could use a chash of the AB, AC, BC groups.
Instead we have to have a fallback director in each region for each group. This
way requests going to the AB North America group fall back to the AB Europe
group.
We can't use a chash at the top level because we want the selection within the
groups to be sticky to client IP. This prevents us from changing
client.identity to the metatile number.
*/
/*
Our current servers and relative capacity (within region) are approximately
Europe:
A: nidhogg 6
B: culebre 6
C: odin 2
D: ysera 2
E: wawel 2
North America:
F: palulukon 2
G: piasa 1
H: orm 1
Asia only has two servers so we ignore it for the math.
This gives groups with relative capacities of
Europe
AB 3
AC 2
AD 2
AE 2
BC 2
BD 2
BE 2
CD 1
CE 1
DE 1
To avoid needing even *more* directors we fudge the NA numbers so they total
the same as Europe.
North America
FG 7
FH 7
GH 4
When inputting the within-group weights we multiple by a number just so it's
easier to scale servers in/out of the groups
*/
director europe_AB client {
{ .backend = F_nidhogg; .weight = 60; }
{ .backend = F_culebre; .weight = 60; }
}
/* ... */
director na_FG client {
{ .backend = F_palulukon; .weight = 100; }
{ .backend = F_piasa; .weight = 50; }
}
/*
Now that we have all the director groups set we need to set fallbacks
We group them up as
0 AB GH
1 AB GH
2 AB GH
3 AC FG
4 AC FG
...
*/
director eu_0 fallback {
{ .backend = europe_AB; }
{ .backend = na_GH; }
{ .backend = asia; }
}
director na_0 fallback {
{ .backend = na_GH; }
{ .backend = europe_AB; }
{ .backend = asia; }
}
director asia_0 fallback {
{ .backend = asia; }
{ .backend = na_GH; }
{ .backend = europe_AB; }
}
director eu_3 fallback {
{ .backend = europe_AC; }
{ .backend = na_FG; }
{ .backend = asia; }
}
director na_3 fallback {
{ .backend = na_FG; }
{ .backend = europe_AC; }
{ .backend = asia; }
}
director asia_3 fallback {
{ .backend = asia; }
{ .backend = na_FG; }
{ .backend = europe_AC; }
}
recv/choose-backend.vcl
eclare local var.z INTEGER;
declare local var.x_mt INTEGER;
declare local var.y_mt INTEGER;
declare local var.tileid INTEGER;
declare local var.director INTEGER;
/*
See define-directors.vcl for a lengthy discussion of how the servers are split
up into directors
*/
/* Requests for tiles get split on the basis of metatile. All the work in
define-directors assumes we can distribute metatiles without any correlation
with views. Getting a tileid then hashing it works.*/
set var.z = 0;
set var.x_mt = 0;
set var.y_mt = 0;
set var.tileid = 1;
if (req.url.path ~ "^/(1?[0-9])/([0-9]+)/([0-9]+)\.png") {
// Compute a tileid of x * 2^z + y
set var.z = std.atoi(re.group.1);
set var.x_mt = std.atoi(re.group.2);
set var.x_mt /= 8;
set var.y_mt = std.atoi(re.group.3);
set var.y_mt /= 8;
set var.tileid <<= var.z;
# After this step the maximum is 2^19*2^19, which is under the max int
set var.tileid *= var.x_mt;
set var.tileid += var.y_mt;
// Strip query string
set req.url = req.url.path;
}
/* Turn the tileid into a number from 0 to 17 to pick a director */
set var.director = fastly.hash(std.itoa(var.tileid), 0, 0, 17);
if (/* conditions on server.region */) {
if (var.director >= 17) {
set req.backend = na_17;
} else if (var.director >= 16) {
set req.backend = na_16;
/* ... */
}
} else if (/* conditions on server.region */) {
if (var.director >= 17) {
set req.backend = asia_17;
/* ... */
}
} else {
if (var.director >= 17) {
set req.backend = eu_17;
/* ... */
}
}
Increasing capacity
We would appreciate more donated tile rendering nodes. If you can help, please reach out to us. Thanks for the tile rendering servers supplied by
Thanks for the space and bandwidth for OSMF-owned tile rendering servers provided by
Special thanks to Fastly's Fast Forward program which has been essential in running our service for the last five years.
Current hardware prices and the supply shortage make it a bad time to buy more capacity. If we had known prices would rise like this4, we would have purchased more capacity before the crunch.
Reducing traffic
Even if we had a donation of a new high capacity European node we need to maintain redundancy to have a server go down. This means we need to reduce load on the Standard Tile Layer.5
Our priority for the service is maps on openstreetmap.org and for supporting OpenStreetMap editing in general6. Providing maps to open-source, open data, and public good projects7 is secondary. Other uses are tertiary. To meet our core needs we must limit other usage.
The math says we should run at a most 60-70% of absolute maximum capacity if we want redundancy. I'd be happy with reducing traffic by 15% right now.
Blocking scrapers pretending to be openstreetmap.org traffic
We have scrapers pretending to be traffic from OSM.org. Scrapers are a particular problem as our architecture is designed around normal users. The exact means of detection and blocking are confidential, but we are much better positioned here with the logging changes and new Fastly features. Because openstreetmap.org is a site we run we've got additional means of detecting fake traffic.
Some legitimate traffic may be temporarily blocked by mistake as we work to identify abusive traffic.
Reducing QGIS Traffic
QGIS is the heaviest single user of the Standard Tile Layer by a large margin because some QGIS users use it to bulk-download tiles. We have sent the QGIS PSC an email notifying them they need to reduce excessive QGIS traffic. Rate-limiting may be of help here. We've got much improved rate-limiting abilities that don't involve sending error responses.
Blocking scrapers
The same improved detection abilities and rate-limiting mentioned above apply here.
Rate-limiting all non-OSM traffic
We have options other than error tiles now. This allows us to decrease the load from secondary priorities. We can also look into putting rate-limits on just cache misses with some larger VCL changes.
Telling large low priority users to switch
From when we moved to Fastly until this year, we never had to ask a user who was following the tile usage policy to switch to a different service. Our terms allow us to block users if they cause problems for others or at our discretion. However, for a first time we had to tell a user to switch providers this year.
We try to contact users and give them the time they need to switch service providers. Based on our past experience, we know that some users may have invalid or non-existent contact email addresses. Our main concern is the peak load on our busiest servers, so looking at our published worldwide daily average data will not indicate why we are asking certain users to switch.
There are many third-party tile providers, some of which are a drop-in replacement for the standard tile layer.
Does this impact me?
Gaining the needed capacity will likely involve using a combination of these methods, as well as possibly others we haven't thought of.
If you use any of our services, make sure to set a suitable user-agent and follow our usage policies. If you're following the policies and using only a few tens of thousands tiles the goal is none of this should impact you.
We don't offer support for third-party library integrations but if you need to contact us you must include the user-agent and referer headers you are sending as well as the response. We cannot start to help anyone without this information. In cases of mistaken blocks we will generally need an IP and timestamp to find a request in our logs.
Current capacity issues
A short partial outage on Mar 9th1 and an improved balancing of servers2 that took 12-hours has drawn me to conclude we're too close to capacity in Europe3 during their weekday peaks. The rebalancing has improved matters, but we're still nearing capacity in peaks.
Details of server balancing
The tile stores function similarily to a cache so we have what is similar to a two level cache in front of rendering. The 2nd level of the tile store caches metatiles not tiles so it's a different cache object we need to split across servers. Rendering new tiles if a tile store (and server) becomes unavailabile means we want each metatile to be warm in two servers. This also has to be done for two regions where we prefer sending requests out of region over sending requests to a server that is guaranteed to not have the tiles in the store.We then have to do all this with servers of varying capacity and make sure that the load distribution is the same as capacity if any one server goes down. That's why this takes a large amount of custom VCL to distribute requests unevenly to groups and handle fallback.
None of this matters very much since we're close enough to the limit it would be an issue under any means of distributing load. This is also why sending parts of a region to a different region with more capacity isn't a long-term solution.
init/define-directors.vcl
recv/choose-backend.vcl
Increasing capacity
We would appreciate more donated tile rendering nodes. If you can help, please reach out to us. Thanks for the tile rendering servers supplied by
Thanks for the space and bandwidth for OSMF-owned tile rendering servers provided by
Special thanks to Fastly's Fast Forward program which has been essential in running our service for the last five years.
Current hardware prices and the supply shortage make it a bad time to buy more capacity. If we had known prices would rise like this4, we would have purchased more capacity before the crunch.
Reducing traffic
Even if we had a donation of a new high capacity European node we need to maintain redundancy to have a server go down. This means we need to reduce load on the Standard Tile Layer.5
Our priority for the service is maps on openstreetmap.org and for supporting OpenStreetMap editing in general6. Providing maps to open-source, open data, and public good projects7 is secondary. Other uses are tertiary. To meet our core needs we must limit other usage.
The math says we should run at a most 60-70% of absolute maximum capacity if we want redundancy. I'd be happy with reducing traffic by 15% right now.
Blocking scrapers pretending to be openstreetmap.org traffic
We have scrapers pretending to be traffic from OSM.org. Scrapers are a particular problem as our architecture is designed around normal users. The exact means of detection and blocking are confidential, but we are much better positioned here with the logging changes and new Fastly features. Because openstreetmap.org is a site we run we've got additional means of detecting fake traffic.
Some legitimate traffic may be temporarily blocked by mistake as we work to identify abusive traffic.
Reducing QGIS Traffic
QGIS is the heaviest single user of the Standard Tile Layer by a large margin because some QGIS users use it to bulk-download tiles. We have sent the QGIS PSC an email notifying them they need to reduce excessive QGIS traffic. Rate-limiting may be of help here. We've got much improved rate-limiting abilities that don't involve sending error responses.
Blocking scrapers
The same improved detection abilities and rate-limiting mentioned above apply here.
Rate-limiting all non-OSM traffic
We have options other than error tiles now. This allows us to decrease the load from secondary priorities. We can also look into putting rate-limits on just cache misses with some larger VCL changes.
Telling large low priority users to switch
From when we moved to Fastly until this year, we never had to ask a user who was following the tile usage policy to switch to a different service. Our terms allow us to block users if they cause problems for others or at our discretion. However, for a first time we had to tell a user to switch providers this year.
We try to contact users and give them the time they need to switch service providers. Based on our past experience, we know that some users may have invalid or non-existent contact email addresses. Our main concern is the peak load on our busiest servers, so looking at our published worldwide daily average data will not indicate why we are asking certain users to switch.
There are many third-party tile providers, some of which are a drop-in replacement for the standard tile layer.
Does this impact me?
Gaining the needed capacity will likely involve using a combination of these methods, as well as possibly others we haven't thought of.
If you use any of our services, make sure to set a suitable user-agent and follow our usage policies. If you're following the policies and using only a few tens of thousands tiles the goal is none of this should impact you.
We don't offer support for third-party library integrations but if you need to contact us you must include the user-agent and referer headers you are sending as well as the response. We cannot start to help anyone without this information. In cases of mistaken blocks we will generally need an IP and timestamp to find a request in our logs.
Footnotes
The leading theory is that one server went down briefly, shedding load to others which started a rolling outage due to excess load. I'm not 100% convinced this was what was happening but it resolved too quickly to fully determine. ↩
Private repo: https://github.com/openstreetmap/opentofu-fastly/blob/5e31aef42bdf7eb7d900e3ed05d0e48204f50811/snippets/tile.openstreetmap.org/init/define-directors.vcl#L1-L35 ↩
The servers in the Americas were over capacity before the balancing changes but the increased efficiency had an increased impact in America so we're not as close to the edge there. ↩
If we had a reliable way of predicting future prices we'd be rich commodity traders. ↩
Improving OpenStreetMap Carto's efficiency is an also option, but outside operation's scope. If someone intends to work on this and we can supply data, please reach out. ↩
Both editors like JOSM and iD as well as services like osmcha, hdyc, etc that support editors. ↩
E.g. apps created in immediate response to disasters ↩