Skip to content
Ashwin Madavan edited this page Mar 24, 2018 · 11 revisions

Why not SQL?

While SQL has enjoyed remarkable popularity over the past four decades, the language is inherently flawed. In his letter entitled "Some Principles of Good Language Design," CJ Date, one of the original designers of SQL at IBM and a pioneer in relational database theory, outlined a number of important qualities of well-designed languages, including a consistent syntax, a canonical implementation, and a terse grammar, and he systematically showed how SQL fails to satisfy all of these fundamental requirements.

In my personal experience (and hopefully yours), I've found SQL to be far better suited to tasks involving data retrieval and not data manipulation. Therefore, I'm forced to write two programs whenever I use a relational database - one in SQL that interacts with my storage system and another in a language like Java or Python that manipulates the returned data. While SQL is intuitive to use and easy to pick up, it falls well short of providing the functionality required to build anything meaningful by itself.

In a particularly revealing Stack Overflow post, Ken Bloom admits that "there aren't any alternatives to SQL for speaking to relational databases" but suggests a variety of "language specific frontends that hide the SQL in our regular every-day programming languages, and use SQL as the protocol for talking to relational databases." If SQL is just a transport layer, then why use it at all? Why would programmers prefer using language-specific frontends if not because of deficiencies in the SQL language?

How is cas implemented?

A cas is a nontrivial operation in a distributed database. In general, transactions may involve keys that are physically located on different machines, and so these machines must coordinate with each other in order to safely commit the transaction. Databases typically use one of four methods to implement cas.

Given a set of potentially conflicting transactions, correct implementations of cas will find a schedule such that no two conflicting transactions are ever executed simultaneously and optimal implementations will find a schedule with minimal makespan. We show below, by a reduction to graph coloring, that finding an optimal schedule is NP-Hard. Because the optimal schedule problem is intractable, most systems should be concerned with solving the much simpler correct schedule problem. Even then, designing an efficient, fault-tolerant, and decentralized solution is extremely challenging.

Proof: Construct a disjunctive graph G in which vertices represent put operations and undirected edges connect conflicting operations. The schedule of put operations with minimal makespan, can be found by finding an acyclic orientation of G with the minimal longest path. Because all edges of G are initially undirected by construction, finding the acyclic orientation with the minimal longest path is equivalent to finding an optimal graph coloring of G by the Gallai–Hasse–Roy–Vitaver Theorem. Because graph coloring is NP-Complete, finding the schedule of operations with minimal makespan is NP-Hard.

Has this been done before?

Software transactional memory (STM) systems have existed for a while. Caustic differentiates itself by being more programmable, performant, and interoperable.

  • Programmability: Unlike most other STM systems, Caustic features its own programming language. Therefore, it does not suffer from the same syntactic limitations that plague STM systems. Caustic provides a number of features like conditional branching, static types, functions, variables, and objects that are useful primitives for building complex applications. It's a little subjective, but Caustic programs generally tend to be much shorter and more maintainable then those written in comparable systems.
  • Interoperability: STM systems traditionally impose significant requirements on the underlying storage system. In contrast, Caustic requires the underlying key-value store to support just two operations get and cput. Because databases typically have their own incompatible query languages, a choice of database is effectively permanent. Caustic provides a robustness that is uncommon in most STM systems.
  • Performance: Caustic performs a number of optimizations to significantly improve the performance of transaction execution, that are absent in many comparable systems. For example, Caustic batches as many as possible reads and all writes within a transaction. Consider a simple social network application. Whenever a user logs on, the system will need to fetch their information, friends, news feed, etc. With Caustic, all this data will be retrieved simultaneously instead of sequentially.

The most similar competitor to Caustic was SQLite4, which attempted to recover a transactional SQL interface over arbitrary key-value stores. However, the authors of SQLite4 admitted defeat. Development of SQLite4 has been terminated, due in large part to significant performance regressions.

Clone this wiki locally