Composable Data Systems: Lessons from Apache Calcite Success

Written by

Vladimir Ozerov

April 1, 2024

Composable Data Systems: Lessons from Apache Calcite Success

Abstract

"Composable data systems are eating the world"! "Database engines becoming a commodity"! You may have heard some of these claims recently, thanks to the active adoption of projects like Apache Arrow, Velox, Substrait, and others. I wholeheartedly support this trend. In fact, our company, Querify Labs, exists solely to adopt these technologies across different database vendors. However, I hardly see this new stack as a revolution that, in some magical way, opens a plethora of shiny new perspectives to the database market. Instead, this trend started many decades ago, gradually gaining adoption through a series of painful and costly development efforts. And while a "composable" stack may help companies bootstrap their new projects faster, it barely solves any fundamental problems vendors face when joining competition with mature and well-funded rivals.

In this blog post, I would like to share our experience with Apache Calcite — a powerful composable toolset for building query optimizers. Apache Calcite achieved tremendous success, powering query optimization in many popular systems, such as Apache Hive and Apache Flink. But even though such a great library has existed for more than ten years, query optimization development is still remarkably complicated and hardly "commoditized." Why is it so? We will discuss which exact technical decisions contributed to Apache Calcite's success, what role community plays in such projects, why it is still so difficult to integrate "composable" libraries into real products, and why I personally do not believe that "composable data systems" trend will fundamentally change the competition dynamics in the market, as predicted by influental engineers and researchers.

First, we will give a definition of composable systems. Then, we briefly discuss the technology aspects of composable optimizers and Apache Calcite in particular. Finally, we discuss the lessons learned from Apache Calcite usage across multiple projects, analyzing which exact problems Apache Calcite can and cannot solve, how proper design, APIs, and community culture contributed to the product's success, and how all these insights can help us build better composable data management frameworks.

What is Composable Data Systems?

Composable data systems are technologies that help us build data processing applications from reusable components rather than creating everything from scratch. These systems comprise a broad range of technologies, from highly specialized libraries to complete products, including but not limited to data formats (Apache Parquet, Apache Arrow) and data processing (Apache Arrow Data Fusion, Velox, Substrait, DuckDB, Polars).

This trend recently gained momentum thanks to the active growth of the Apache Arrow ecosystem and DuckDB. However, since the technology stack is diverse, technologies are gaining adoption at varying velocities. While reusable data formats like Apache Parquet already dominate their categories, composable data processing is still a largely unsolved problem. Throughout this post, we will look at the composable data processing perspective through the lens of Apache Calcite, a major framework that allows you to quickly build powerful query optimizers from independent components.

Technology

Background

Query optimization is a complicated task because to find a good plan one must solve several problems with high computational complexity. Examples are join order planning, cardinality estimation, interesting properties planning (ordering, distribution), materialized view selection, etc.

Not surprisingly, the community started exploring ways to keep the optimizer complexity under control and establish proper boundaries between its intrinsic components. Some of the most influential papers were published in 90-s by Goetz Graefe et al. as a part of EXODUS/Volcano/Cascades research projects, proposing composable evaluation of relational operators (Volcano: An Extensible and Parallel Query Evaluation System) and an extensible query optimizer that could be assembled from pluggable optimization rules (The Volcano Optimizer Generator: Extensibility and Efficient Search).

Graefe's ideas gained rapid adoption in the field in the form of extensible rule-based optimizers with relational IR and Volcano-style (aka "pull-based") query evaluation. The database community gradually gained sufficient understanding that a robust optimizer can be reasonably split into isolated components with more or less clean interfaces, such as syntax and semantic analyzers, type system, relational operators, function registry, metadata management, statistic management, optimization drivers, and rules.

For many years, the implementation details and design considerations of query optimization remained largely opaque, hidden behind complex papers and the office walls of commercial vendors. But proliferation of open-source software allowed more people to analyze practical aspects of production-grade query optimizers, ultimately cristalizing in several open-source projects, of which Apache Calcite is arguably the most successful.

Apache Calcite

Apache Calcite provides a set of components to build query optimizers. We now discuss where the "composability" comes from.

When a query string is submitted, Apache Calcite parses it into an abstract syntax tree (AST). JavaCC parser generator and Freemaker template engine are used under the hood. Users may extend the SQL syntax, providing their own JavaCC/Freemaker files and re-generating the parser. The new parser can be seamlessly injected into the optimization process.

Then, semantic analysis is performed on AST, which determines database objects from the schema registry, resolves function definitions from the function registry, and deduces data types using the type system. The schema registry, function registry, and type system are customizable and could be either extended or entirely replaced by the user. The semantic analyzer produces AST with additional information derived during analysis.

Fig. 1: Validated AST

With all context information in place, analyzed AST is then converted into a relational tree through a special component called "sql-to-rel converter". The result is a tree of relational operators which we will optimize further on.

Fig. 2: AST to relational tree conversion

Then, the query optimizer kicks in to find the optimal plan. Unlike what one may expect, query optimization is not a single step. Instead, this is a series of phases when the non-optimized logical tree is gradually converted to the optimized physical tree. Several core interfaces support the optimization process:

  • RelDataType — defines a data type.
  • RelNode — defines a relational operator, such as a Scan, Filter, Join, etc.
  • RexNode — defines a scalar expression within an operator. E.g., concrete predicate inside a Filter.
  • RelVisitor — a classical visitor pattern that allows you to traverse the relational tree.
  • RelOptPlanner — a rule-based driver abstraction that accepts the plan and a set of optimization rules and produces an optimized plan. There are two main optimizers: HepPlanner which is a simple and fast iterative planner, and VolcanoPlanner which is a cost-based planner that uses memoization and property propagation (including Cascades-based) to find the best plan.
  • RelOptRule — a rule abstraction for rule-based optimization that provides a pattern and a transformation. RelOptPlanner uses rules to find equivalent plans. Apache Calcite has a rich library of rules for operator simplification, operator reordering, Join planning, etc.
  • RelOptCost — provides an operator cost abstraction.
  • RelTrait represents a property of the node. A property is a value that describes relational operator output in some way but doesn't change the output itself. Examples are RelCollation, which describes how operator output is sorted, and RelDistribution, which describes how operator data is distributed across the engine's execution units. Operator properties can be changed (aka "enforced") by setting another operator on top of them. For example, to change the sortedness of the Scan output, you can add a Sort operator.
  • Metadata — an interface to calculate and propagate metadata across nodes. Metadata could be any value that nodes might want to request from each other, such as cost, cardinality, selectivity, constraints, etc.
  • Program — a generic interface accepting one plan (RelNode) and producing another. Production-grade optimizers wrap multiple visitors and rule-based planner invocations into Programs to conveniently chain optimization phases one after the other.
Fig. 3: Apache Calcite extension points

Looks pretty involved, huh? True, and this is exactly where the Apache Calcite superpower comes from. Each and every one of these guys could be extended in virtually any way: 

  • Need to support JSON? Add a new RelDataType
  • Want to add a special type of Join? Feel free to create a new type of RelNode
  • Miss some important optimization rule? Implement the RelOptRule interface and inject the new rule into the RelOptPlanner of your choice. 
  • Want to estimate cardinalities using HyperLogLog sketches from your engines? No worries, define a new metadata type and align it with your custom cost function.

Once optimization is finished and you have the optimized plan, you are expected to feed it to your engine somehow.

Additionally, Apache Calcite provides two more interesting components:

  • Avatica — a framework for building JDBC and ODBC drivers. It includes the protocol and a reference JDBC implementation. It competes with Arrow Flight SQL and ADBC.
  • Enumerable — a compiled query execution engine that can execute plans (i.e., RelNode). Competes with Velox and DataFusion.

Both subprojects never gained sufficient traction. We will briefly discuss why this happened further.

The great flexibility of Apache Calcite allowed it to become one of the most important and widely used query optimization frameworks. It is used in big data (Apache Hive), streaming (Apache Flink), in-memory data grids (Apache Ignite), hardware-accelerated databases, and many other startups and engines. However, even though this is definitively a successful case for the composable data systems community, it also has some serious problems. And we, engineers, should learn from both good and bad things to make our future products better. Let's jump to the lessons learned!

Lessons

Lesson 1: API is the whole code base

When building products and libraries, we often separate user-facing interfaces from internal implementation details. However, as the complexity grows, the boundaries between the two become blurred. The end user may want to change virtually any aspect of the product.

Consider join order planning. You may want to pick a specific Join enumeration algorithm (e.g., exhaustive vs. heuristic), choose a specific infrastructure (e.g., dynamic programming vs. memoization), provide your own cardinality estimator, provide custom join types (e.g., ASOF JOIN), etc. These numerous design decisions are tightly coupled to your engine's capabilities, design, and product strategy. Therefore, you may want to modify virtually any building block here and there. Since the quality of query optimization tremendously affects the future of your product, you can trust only query optimizer toolsets with an extreme level of flexibility.

To make things worse, each and every "composable" library or product will always have bugs and design flaws. Decomposition is hard, and you never make things ideal. Waiting for new releases or submitting patches is not always an option because you do not have any control over the timeframe when the patched version will be delivered. For this reason, you not only want to have customizable components but also the ability to patch the code without forks or dependency on the upstream updates.

Apache Calcite is extremely successful in these aspects. First, there is no notion of API at all. The whole product is your API. Second, you can overcome almost any bug without forking the product by doing some unusual hacks, such as method overrides or copy-pasting some pieces of Apache Calcite sources.
Note that composable query engines, such as Velox or Data Fusion, have exactly the same user demands.

Fig. 4: (1) traditional library; (2) Apache Calcite

The lesson: when building composable query optimizers or query engines, assume that your users will want to change or override each and every piece of your library. Without this in mind, the adoption of your library will be limited.

Lesson 2: Break compatibility to evolve quickly

This observation follows naturally from the previous point about APIs. If your API is the whole code base, each and every change may break a downstream project. This is precisely what currently happens when you upgrade to a newer version of Apache Calcite, Velox, etc. And this is ... ok! The pain you experience during upgrades is what allows library maintainers to evolve quickly, providing better abstractions, more utility methods, etc.

As a maintainer of query engine/optimizer library, one of you worst decisions would be to establish strict compatibility guarantees early on. Instead, it is better to "cool down" breaking changes gradually as more users arrive and the community gains a better understanding of how exactly the product is used and what the common pain points are. Ultimately, you want to reach the state when version upgrades are somewhat painful but do not prevent your users from upgrades. Establishing this balance requires a thoughtful engineering process and the proper community culture, leadership, and vision.

The lesson: do not be afraid to break things, but do not break too much.

Lesson 3: Community is a king

Apache Calcite is not the only composable query optimizer library. Another such library is ORCA — a modular cost-based optimizer. Originally implemented for Greenplum, it was later published as a separate open-source library. It is superior to Apache Calcite in some aspects. For example, it has sophisticated interesting property propagation mechanics that was missing from Apache Calcite for many years (ironically, in the end, it was added to Apache Calcite by the same team as ORCA). But have you heard of anybody building query optimizers with ORCA? Most likely, no. Its open-source incarnation is dead.

Why did Apache Calcite succeed and ORCA didn't? As usual, there are multiple reasons. However, I would argue that one of the biggest problems with ORCA was the lack of community. Without community, you do not know where to seek help when something goes wrong. If a project is governed by a single commercial vendor like in the ORCA case, the main motivation is to make money rather than create a public good. And even if the vendor dedicates a team to support the community, this is still a commercial effort that cannot provide sufficient diversion, which is crucial for the project's stability.

Apache Calcite is unique in this regard. It was very fortunate to have been created and led for many years by Julian Hyde, who relentlessly projected proper vision, leadership, and dedication. Over the years, this attracted more people from different companies with similar mindsets. And while most of them have employers, the community remains vibrant and healthy, pushing the project further. There are no dominating commercial vendors behind Apache Calcite. No single point of failure. Just a group of enthusiastic people. Sometimes, they face rough edges, but this is exactly the culture I would expect from a true open-source project. Apache Arrow and DuckDB communities exhibit similar properties, which is great. Time will tell.

The lesson for users: if you trust a project to be a backbone of some critical infrastructure, such as a query engine or optimizer, ensure you understand the community dynamics.

Lesson 4: Invest into documentation

We love to code but do not like writing docs. Docs are often an annoying afterthought rather than an integral part of the product. If you want to increase the adoption of your library, which ultimately boosts user feedback and drives innovation, you MUST teach your users how to use your product. Since the API surface is enormous, perhaps you do not want to document each and every method, variable, etc, rigorously. Instead, you may want to focus on the design and usage scenarios. This will help users start with some simple prototypes and then gradually deepen their product understanding.

I'll be straight — Apache Calcite fails miserably here. New users have a very hard time understanding product capabilities. What is worse, Apache Caclite documentation targets the wrong use cases: end-to-end engine and JDBC, instead of composable query optimization. As a result, many users abandon Apache Calcite early because they either do not understand what the product does, or use the product incorrectly. We observed this multiple times in our consulting practice. Unfortunately, we never found the time to contribute docs to the community, and I still feel very bad about it.

The lesson: documentation must be a first-class citizen.

Lesson 5: Keep focus

When building a composable data systems library, one may want to address as many use cases as possible because you may think that it will drive project adoption. In reality, too many features lead to loss of focus, with a large body of code that nobody understands how to maintain.

For example, both Apache Calcite and Apache Arrow Data Fusion provide query optimization and query execution capabilities. Apache Calcite has very advanced query optimization but weak execution. Data Fusion is the opposite: more advanced execution and basic query optimizer. One of the reasons for this is that believe it or not, in real products, there are no clear boundaries between query optimization and query execution. For example, a good optimizer would like to simplify trivial expressions, which requires an executor. Likewise, the query engine may want to re-optimize the plan during execution as it gathers stats from the real data. However, maintaining a good query engine and query optimizer in the same project is twice as hard as maintaining one of them. Does your community have sufficient dedication to deal with this?

Another example is external contributions. Your users may propose some solutions they created while integrating your library into their systems. This could be new join order planning algorithms, new concurrency primitives, memory managers, connectors to other data sources, or whatever. The problem is that if this solution does not have sufficient traction from other users, it will likely become abandoned. Aging code base that constantly consumes community time but brings very little value.

In my opinion, which is based on multiple real Apache Calcite integrations, the main advantage of composable query optimizers and query engines is the ability to quickly bootstrap simple, even mediocre, if I may, infrastructure. Not super sophisticated algorithms, not complete solutions, just infrastructure. E.g., an optimizer that produces some reasonable plans or an engine that outputs some results but doesn't care about memory management and resource queues. I want to ramp up them quickly to start doing customer demos, unblock engineers, etc. Just the 80/20 principle in action. And once the basic infrastructure is in place, you may want to gradually make your product more sophisticated, reaching your competitors step by step.

I strongly believe that this basic infrastructure, not the complete solutions, is the main value behind composable query optimizers and query engines. Real complex products, such as databases, are not about composability but about the very precise integration of different components - optimizer, executor, resource manager, etc. - that work together as a single mechanism. This intimate coupling (not decoupling!) is what helps products become leaders in their categories.

We can more or less easily agree on data formats, such as Apache Parquet or Apache Arrow. However, data processing is a much more complicated area with lots of moving pieces. This is why I personally think that we will not witness the "commoditization" of query optimizers or query engines and competition shifts from performance to elsewhere in the foreseeable future, even with libraries such as Calcite or Velox.

The lesson: focus on building strong basic infrastructure and let your users build custom solutions on top of it.

Summary

Composable data systems are a great trend that started many decades ago and is now actively reaching the mainstream. Apache Calcite is a query optimizer library and a prominent member of this movement. Apache Calcite success is based on (1) vendor-agnostic community with proper values, and (2) thorough decomposition of query optimization aspects into individual components. Analysis of how Apache Calcite reached its current state provides useful insights on how to build great products that drive innovation in the field.

But while libraries like Apache Calcite or Velox / Data Fusion lowers the bar for entering the database market, we should not overestimate their impact. Query processing is very complicated topic with tons of unsolved problems and constant influx of new ideas, so we are still pretty far from true "commoditization".

We are always ready to help you with your query engine design. Just let us know.