State-of-the-art web crawler 🔱
Zeno is a web crawler designed to operate wide crawls or to simply archive one web page. Zeno's key concepts are: portability, performance, simplicity. With an emphasis on performance.
It heavily relies on the gowarc module for traffic recording into WARC files.
The name Zeno comes from Zenodotus (Ζηνόδοτος), a Greek grammarian, literary critic, Homeric scholar, and the first librarian of the Library of Alexandria.
- Go 1.25+ - As specified in go.mod
- If CGO_ENABLED=1 (enabled by default):
GCC 12+ - Required for building C++ dependencies with C++20 constexpr support for the WHATWG URL parser (github.com/ada-url/goada).
- If CGO_ENABLED=0:
No additional requirements, as the CGO-free WebAssembly wrapper of goada (goada-wasm) will be used. (1x slower than CGO version on amd64 and arm64, and 10x or more slower on other CPU architectures! Check https://wazero.io/docs/#compiler for details)
Note: GCC 11 and earlier versions do not support the C++20 constexpr features required by the ada-url/goada dependency. On Ubuntu 22 LTS and earlier, you may need to install a newer GCC version or disable CGO.
go install github.com/internetarchive/Zeno@latestor utilize our pre-built release binaries here, but do note that we are mainly focused on linux/amd64 support at this time.
To archive a single web page:
Zeno get url https://www.france.frZeno is highly configurable with many parameters that can be customized. To see all available configuration options, use Zeno -h and/or Zeno get -h.
Contributions are welcome! Please feel free to submit a Pull Request & open issues!
Zeno is being developed and maintained by the Internet Archive and awesome contributors. The project has evolved into what it is today thanks to the invaluable contributions from the community. While we can't list everyone, special thanks to:
- Corentin Barreau former Wayback Machine Software Engineer at the Internet Archive for his initial work on the project.
- Jake LaFountain, Wayback Machine Software Engineer at the Internet Archive.
- Thomas Foubert, former Wayback Machine Platform Engineer at the Internet Archive.
- yzqzss, Lead Developer of the Save The Web Project.
- Will Howes, Wayback Machine Software Engineer at the Internet Archive.
- Vangelis Banos, Wayback Machine Software Engineer at the Internet Archive.
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.