Apache Calcite is a dynamic data management framework with SQL parser, optimizer, executor, and JDBC driver.
Many examples of Apache Calcite usage demonstrate the end-to-end execution of queries using JDBC driver, some built-in optimization rules, and the `Enumerable` executor. Our customers often have their own execution engines and JDBC drivers. So how to use Apache Calcite for query optimization only, without its JDBC driver and `Enumerable` executor?
In this tutorial, we create a simple query optimizer using internal Apache Calcite classes.
First, we need to define the schema. We start with a custom table implementation. To create a table, you should extend Apache Calcite's `AbstractTable`. We pass two pieces of information to our table:
Our statistic class exposes only row count information.
We pass column names and types to our table class to construct the row type, which Apache Calcite uses to derive data types of expressions.
Our table also implements Apache Calcite's `ScannableTable` interface. We do this only for demonstration purposes because we will use a certain `Enumerable` optimization rule in our example that will fail without this interface. You do not need to implement this interface if you are not going to use the Apache Calcite `Enumerable` execution backend.
Finally, we extend Apache Calcite's `AbstractSchema` class to define our own schema. We pass a map from a table name to a table. Apache Calcite uses this map to resolve tables during semantic validation.
We are ready to start the optimization.
The optimization process consists of the following phases:
Many Apache Calcite classes that we use for query optimization require configuration. However, there is no common configuration class in Apache Calcite that could be used by all objects. For this reason, we store the common configuration in a single object and then copy configuration values into other objects when needed.
In this specific example, we instruct Apache Calcite on how to process object identifiers: do not change identifier casing, use case-sensitive name resolution.
First of all, we should parse the query string. The result of parsing is an abstract syntax tree, with every node being a subclass of `SqlNode`.
We pass a part of our common configuration to the parser configuration, then instantiate `SqlParser`, and finally perform the parsing. If you have a custom SQL syntax, you may pass a custom parser factory class to the configuration.
The goal of semantic analysis is to ensure that the produced abstract syntax tree is valid. Semantic analysis includes the resolution of object and function identifiers, data types inference, checking the correctness of certain SQL constructs (e.g., a group key in the `GROUP BY` statement).
The validation is performed by the `SqlValidatorImpl` class, one of the most complex classes in Apache Calcite. This class requires several supporting objects. First, we create an instance of `RelDataTypeFactory`, which provides SQL type definitions. We use the built-in type factory, but you may also provide your custom implementation if need.
Then, we create a `Prepare.CatalogReader` object that provides access to database objects. This is where our previously defined schema comes into play. Catalog reader consumes our common configuration object to have an object name resolution mechanics consistent with the one we used during parsing.
Then, we define a `SqlOperatorTable`, the library of SQL functions and operators. We use the built-in library. You may also provide your implementation with custom functions.
We created all the required supporting objects. Now we instantiate the built-in `SqlValidatorImpl`. As usual, you may extend it if you need a custom validation behavior (such as custom error messages).
Finally, we perform validation. Keep the validator instance because we will need it for AST conversion to a relational tree.
AST is not convenient for query optimization because the semantics of its nodes is too complicated. It is much more convenient to perform query optimization on a tree of relational operators, defined by the `RelNode` subclasses, such as `Scan`, `Project`, `Filter`, `Join`, etc. We use `SqlToRelConverter`, another monstrous class of Apache Calcite, to convert the original AST into a relational tree.
Interestingly, to create a converter, we must create an instance of a cost-based planner `VolcanoPlanner` first. This is one of Apache Calcite's abstraction leaks.
To create the `VolcanoPlanner`, we again pass the common configuration and the `RelOptCostFactory` that the planner will use to calculate costs. In a production-grade optimizer, you are likely to define a custom cost factory, because the built-in factories take in count only cardinality of relations, which is often insufficient for proper cost estimation.
You should also specify which physical operator properties the `VolcanoPlanner` should track. Every property has a descriptor that extends Apache Calcite's `RelTraitDef` class. In our example, we register only the `ConventionTraitDef`, which defines the execution backend for a relational node.
We then create a `RelOptCluster`, a common context object used during conversion and optimization.
We can create the converter now. Here we set a couple of configuration properties for a subquery unnesting, which is out of this post's scope.
Once we have the converter, we can create the relational tree.
During the conversion, Apache Calcite produces a tree of logical relational operators, are abstract and do not target any specific execution backend. For this reason, logical operators always have the convention trait set to `Convention.NONE`. It is expected that you will convert them into physical operators during the optimization. Physical operators have a specific convention different from `Convention.NONE`.
Optimization is a process of conversion of a relation tree to another relational tree. You may do rule-based optimization with heuristic or cost-based planners, `HepPlanner` and `VolcanoPlanner` respectively. You may also do any manual rewrite of the tree without rule. Apache Calcite comes with several powerful rewriting tools, such as `RelDecorrelator` and `RelFieldTrimmer`.
Typically, to optimize a relational tree, you will perform multiple optimization passes using rule-based optimizers and manual rewrites. Take a look at the default optimization program used by Apache Calcite JDBC driver or multi-phase query optimization in Apache Flink.
In our example, we will use `VolcanoPlanner` to perform cost-based optimization. We already instantiated the `VolcanoPlanner` before. Our inputs are a relational tree to optimize, a set of optimization rules, and traits that the optimized tree's parent node must satisfy.
In this example, we will optimize the TPC-H query №6. The full source code is available here. Run the `OptimizerTest` to see it in action.
We define the `Optimizer` class that encapsulates the created configuration, `SqlValidator`, `SqlToRelConverter` and `VolcanoPlanner`.
Next, we create the schema with the `lineitem` table.
Now we use our optimizer to parse, validate, and convert the query.
The produced logical tree looks like this.
Finally, we optimize the relational tree and convert it into the `Enumerable` convention. We use logical rules that convert and merge `LogicalProject` and `LogicalFilter` into compound `LogicalCalc`, and physical rules that convert logical nodes into `Enumerable` nodes.
The produced physical tree looks like this. Notice that all nodes are `Enumerable`, and that `Project` and `Filter` nodes have been replaced with `Calc`.
Apache Calcite is a flexible framework for query optimization. In this blog post, we demonstrated how to optimize SQL queries with Apache Calcite parser, validator, converter, and rule-based optimizer. In future posts, we will dig into individual components of Apache Calcite. Stay tuned!
We are always ready to help you with your SQL query optimizer design. Just let us know.