Skip to content

Commit db65860

Browse files
sabineKhadydavesnx
authored
(data) Update the Ahrefs success story (#2863)
* new Ahrefs success story * fmt * Update data/success_stories/ahrefs.md Co-authored-by: Louis <mail+github@louisroche.net> * Update data/success_stories/ahrefs.md Co-authored-by: Louis <mail+github@louisroche.net> * Update data/success_stories/ahrefs.md Co-authored-by: Louis <mail+github@louisroche.net> * clarification * Update data/success_stories/ahrefs.md Co-authored-by: Louis <mail+github@louisroche.net> * Update data/success_stories/ahrefs.md Co-authored-by: Louis <mail+github@louisroche.net> * be more vague on number of requests frontend/backend * devkit / bindings * Update data/success_stories/ahrefs.md Co-authored-by: Louis <mail+github@louisroche.net> * Update src/ocamlorg_web/lib/redirection.ml Co-authored-by: Louis <mail+github@louisroche.net> * rewrite taking into account feedback, reframe around always being an OCaml company * add relevant BuckleScript -> ReScript context * edits * two success stories * new image for full stack story * Update data/success_stories/ahrefs-full-stack-web.md Co-authored-by: David Sancho <dsnxmoreno@gmail.com> * addressing @davesnx review, thanks Dave * shorten list of why reasons * remove redirect bc it's two stories * redirect for title change of old ahrefs story * Apply suggestions from code review @Khady Co-authored-by: Louis <mail+github@louisroche.net> * Apply suggestions from code review * editing * remove full stack web success story (moved to another PR) --------- Co-authored-by: Louis <mail+github@louisroche.net> Co-authored-by: David Sancho <dsnxmoreno@gmail.com>
1 parent ff4f18b commit db65860

File tree

3 files changed

+58
-30
lines changed

3 files changed

+58
-30
lines changed
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
title: Petabyte-Scale Web Crawling and Data Processing
3+
logo: success-stories/ahrefs.svg
4+
card_logo: success-stories/white/ahrefs.svg
5+
background: /success-stories/ahrefs-bg.jpg
6+
theme: blue
7+
synopsis: "Ahrefs built the world's third-largest web crawler using OCaml, indexing petabytes of web data with a lean, efficient team."
8+
url: https://ahrefs.com/
9+
priority: 2
10+
why_ocaml_reasons:
11+
- Performance
12+
- Reliability
13+
- Expressiveness
14+
- Scalability
15+
- Maintainability
16+
---
17+
18+
## Challenge
19+
20+
[Ahrefs](https://ahrefs.com/) is a Singapore-based SaaS company that provides comprehensive SEO tools and marketing intelligence powered by big data. Since 2011, they've been crawling the entire web daily to maintain extensive databases of backlinks, keywords, and website analytics that help businesses with SEO strategy, competitor analysis, and content optimization. Today, they're trusted by 44% of Fortune 500 companies.
21+
22+
Building and operating a web crawler at internet scale presents extraordinary challenges. Ahrefs needs to index billions of web pages continuously, process petabytes of data in real-time, and turn this massive dataset into actionable insights for thousands of customers worldwide. The technical demands are staggering: their systems must handle **500 billion backend requests per day** while maintaining **over 100PB of storage**.
23+
24+
As a self-funded company, Ahrefs couldn't solve these challenges by throwing unlimited resources at the problem. They needed maximum efficiency from a small team — systems that could run reliably for months without intervention, code that could be understood and maintained by a lean engineering organization, and performance that could compete with tech giants despite having a fraction of their headcount.
25+
26+
The question wasn't just whether they could build a web-scale crawler, but whether they could do it sustainably with the constraints of a bootstrapped company.
27+
28+
## Result
29+
30+
Over a decade later, Ahrefs operates one of the world's most sophisticated web crawling operations. Their OCaml-powered systems maintains an index of **492.7 billion pages** across **500.4 million domains**.
31+
32+
This technical achievement translates directly to business success. Ahrefs has grown into a **$100M+ ARR company** with **150 employees** managing **4000+ servers**—all while maintaining their original philosophy of operational efficiency. They've become the sector leader in SEO tools, proving that the right technology choices can create sustainable competitive advantages.
33+
34+
The reliability of their OCaml systems is perhaps most impressive: programs written years ago continue running without surprises, requiring minimal maintenance from their engineering team. This "boring" reliability has allowed Ahrefs to focus engineering effort on building new features and capabilities rather than fighting infrastructure fires.
35+
36+
Their success demonstrates that OCaml can power not just technical excellence at massive scale, but sustainable business growth in highly competitive markets.
37+
38+
## Solution
39+
40+
Ahrefs built their crawling infrastructure around OCaml's strengths, creating a distributed system that balances performance, reliability, and maintainability. **[OCaml](https://ocaml.org/)** serves as the primary language for all crawling and data processing systems, compiled natively for maximum performance across their **4000+ servers**.
41+
42+
Their architecture treats data consistency as paramount. Defining shared data structures (using **[ATD (Adjustable Type Definitions)](https://github.com/ahrefs/atd)**, and now moving to [melange-json](https://github.com/melange-community/melange-json)), they ensure type safety throughout their processing pipeline — from initial web crawling to final data storage. This approach catches schema mismatches at compile time rather than at runtime, crucial when processing billions of pages daily.
43+
44+
Their storage layer combines **[ClickHouse](https://clickhouse.com/)**, **[MySQL](https://www.mysql.com/)**, **[Elasticsearch](https://www.elastic.co/)**. The key insight was designing these systems to work together seamlessly through shared OCaml types rather than complex API layers.
45+
46+
Ahrefs maintains their own libraries and frameworks rather than relying on generic solutions. This "build it ourselves" philosophy requires more initial investment but delivers systems perfectly tailored to web crawling demands. Their **1.5 million lines of OCaml code** represent years of accumulated domain expertise encoded in reliable, maintainable software.
47+
48+
The result is a unified system where improvements to crawling algorithms, data processing pipelines, or storage efficiency can be implemented quickly and deployed confidently across their entire infrastructure.
49+
50+
## Why OCaml
51+
52+
* **Low maintenance burden**: OCaml systems built years ago continue running without intervention, allowing engineers to focus on new development rather than troubleshooting production issues.
53+
* **Static typing catches errors**: At petabyte scale, compile-time type checking prevents data format inconsistencies and runtime failures that would be expensive to debug in production environments processing large volumes of web data.
54+
* **Language expressiveness reduces development time**: OCaml's abstractions enabled building domain-specific systems efficiently rather than adapting existing frameworks. Small teams could develop complex crawling and data processing systems with relatively few lines of code.
55+
* **Performance**: Native compilation provides the throughput needed for processing billions of daily requests while maintaining code readability for long-term maintenance.
56+
* **Cost-effective specialized tooling**: OCaml made it practical to build custom systems tailored to specific requirements rather than using general-purpose solutions, which aligned with their business constraints of limited engineering resources.

data/success_stories/ahrefs.md

Lines changed: 0 additions & 30 deletions
This file was deleted.

src/ocamlorg_web/lib/redirection.ml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,8 @@ let from_v2 =
252252
("/docs/platform-users", Url.tool_page "platform-users");
253253
("/docs/platform-roadmap", Url.tool_page "platform-roadmap");
254254
("/docs/configuring-your-editor", Url.tutorial "set-up-editor");
255+
( "/success-stories/peta-byte-scale-web-crawler",
256+
Url.success_story "peta-byte-scale-web-crawling-and-data-processing" );
255257
]
256258

257259
let make ?(permanent = false) t =

0 commit comments

Comments
 (0)