Sharing interesting and noteworthy Data Engineering content - namely blogs, podcasts, repos, books, videos, and MOOCs. This was mostly curated by and for Fellows in the Insight Data Engineering Fellows Program, and inspired by the repo of one of our Fellows, Igor Barinov.
If you have ideas or other interesting resources, feel free to open an Issue or Pull Request.
All technologies are listed alphabetically in their given section.
- Excellent summary of the history of Hadoop by Marco Bonaci. This post is also read as a podcast by Software Engineering Daily.
- Jepsen - Kyle Kingsbury's (Aphyr) guide on distributed systems and databases, and how they fail.
- Nice post about using clustering order byin Cassandra
- Post by Datastax about basics of data modeling in Cassandra
- Excellent summary of the history of Hadoop by Marco Bonaci. This post is also read as a podcast by Software Engineering Daily.
- Strata Talk by Kostas Tzoumas on Flink Streaming's capabilities.
- Streaming Benchmark talk by Jamie Grier on extending Yahoo's Benchmark, based off this blog
- Asynchronous Snapshots Blog by Data Artisans, and a summary in the morning paper
- Millwheel Paper which discusses Low Watermarks for Exactly-Once Semantics
- Asynchronous Snapshots Barrier Paper describing Flink's snapshot algorithm
- Chandy-Lamport Paper on Distributed Snapshots, and a summary in the morning paper
- 
Part 1 of series of 3 blogs on how Datadog monitors Kafka. Part 1 is an especially good intro to Kafka's architecture. 
- [Video] (https://www.youtube.com/watch?v=aJuo_bLSW6s&feature=youtu.be) by Jay Kreps on logs, stream processing and Kafka
- [Interview with Maxime Beauchemin](Software Engineering Daily) on Airflow, Airpal, and Caravel on Software Engineering Daily.
- List of 100 Seminal Data Engineering Papers from Anil Madan
- General Notes from Kyle Kingsbury (Aphyr) on Distributed Sytems
- Visualization of Paxos with explanation
- [Paper] (https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf) on using z-values for implementing approximate k-nearest neighbors in a MapReduce framework. There is also a Background paper on the topic, describing the non-distributed version.
- [Paper] (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf) on sPCA -- Scalable Principal Component Analysis
- Chandy-Lamport Paper on Distributed Snapshots, and a summary in the morning paper
- Blog on nuances of the CAP theorem by Nicolas Liochon
- Repo of awesome computer science courses.
Excellent post on preparing for interview from TripleByte, both technically and strategically
- Cracking the Coding Interview, with solutions in many languages here
- Jennifer Widom's self-paced MOOC from first principles, based off her Stanford course.
- Repo of many sytem design studies, resources, and strategies.
- Excellent Review of Fair Scheduling in Linux from The Morning Paper.
- Blog on the impact of saving CPU cycles while processing billions of records and the effects of tuning CPU from the Localytics engineering team.