Physical properties are an essential part of the optimization process that allows you to explore more alternative plans.
Apache Calcite comes with convention and collation (sort order) properties. Many query engines require custom properties. For example, distributed and heterogeneous engines that we often see in our daily practice need to carefully plan the movement of data between machines and devices, which requires a custom property to describe data location.
In this blog post, we will explore how to define, register and enforce a custom property, also known as trait, with Apache Calcite cost-based optimizer.
We start our journey by looking at the example of common physical property - sort order.
Query optimizers work with relational operators, such as `Scan`, `Project`, `Filter`, and `Join`. During the optimization, an operator may require it's input to satisfy a specific condition. To check whether the condition is satisfied, operators may expose physical properties - plain values associated with an operator. Operators may compare the desired and actual properties of their inputs and enforce the desired property by injecting a special enforcer operator on top of the input.
Consider the join operator `t1 JOIN t2 ON t1.a = t2.b`. We could use a merge join if both inputs are sorted on their join attributes, `t1.a` and `t2.b`, respectively. We may define the collation property for every operator, describing the sort order of produced rows:
The merge join operator may enforce the sorting on `t1.a` and `t2.b` on its inputs. Since the first input is already sorted on `t1.a`, it remains unchanged. The second input is not sorted, so the enforcer `Sort` operator is injected, making a merge join possible:
In Apache Calcite, properties are defined by the `RelTrait` and `RelTraitDef` classes. `RelTrait` is a concrete value of the property. `RelTraitDef` is a property definition, which describes the property name, expected Java class of the property, the default value of the property, and how to enforce the property. Property definitions are registered in the planner via the `RelOptPlanner.addRelTraitDef` method. The planner will ensure that every operator has a specific value for every registered property definition, whether the default or not.
All properties of a node are organized in an immutable data structure `RelTraitSet`. This class has convenient methods to add and update properties with copying semantics. You may access the properties of a concrete operator using the `RelOptNode.getTraitSet` method.
To enforce a specific property on the operator during planning, you should do the following from within the rule:
Finally, before invoking the planner program, you may define the desired properties of the root operator of the optimized relational tree. After the optimization, the planner will either return the operator that satisfies these properties or throw an exception.
Internally, the Apache Calcite enforces properties by adding a special `AbstractConverter` operator with the desired traits on top of the target operator.
To transform the `AbstractConverter` into a real enforcer node, such as `Sort`, you should add the built-in `ExpandConversionRule` rule to your optimization program. This rule will attempt to expand the `AbstractConverter` into a sequence of enforcers to satisfy the desired traits consulting to the trait definitions that we already discussed. We have only one unsatisfied property in our example, so the converter expands into a single `Sort` operator.
You may use your custom expansion rule if needed. See Apache Flink custom rule as an example.
As we understand the purpose of properties and which Apache Calcite API to use, we will define, register, and enforce our custom property.
Consider that we have a distributed database, where every relational operator might be distributed between nodes in one of two ways:
In our example, we would like to ensure that the top operator always has a `SINGLETON` distribution, simulating the results' delivery to a single node.
First, we define the enforcer operator. To ensure the `SINGLETON` distribution, we need to move from all nodes to a single node. In distributed databases, data movement operators are often called `Exchange`. The minimal requirement for a custom operator in Apache Calcite is to define the constructor and the `copy` method.
Next, we define our custom trait and trait definition. Our implementation must adhere to the following rules:
Below is the source code of our trait. We define two concrete values, `PARTITIONED` and `SINGLETON`. We also define the special value `ANY`, which we use as the default. We say that both `PARTITIONED` and `SINGLETON` satisfy `ANY` but `PARTITIONED` and `SINGLETON` do not satisfy each other.
Our trait definition defines the `convert` function, which injects the `ExchangeRel` enforcer if the current property doesn't satisfy the target one.
You would likely have more distribution types, dedicated distribution columns, and different exchange types in production implementations. You may refer to Apache Flink as an example of a real distribution trait.
Let's see the new trait in action. The complete source code is available here.
First, we create a schema with a couple of tables - one with `PARTITIONED` distribution and another with `SINGLETON` distribution. We use custom table and schema implementation, similar to the ones we used in the previous blog post.
Then we create a planner instance and register our custom trait definition in it.
Finally, we create a table scan operator for each of our tables and enforce the `SINGLETON` distribution. Notice that we use the aforementioned `ExpandConversionRule` in our optimization program. Otherwise, the enforcement will not work.
Now we run the TraitTest from the sample project to see this in action. For the `PARTITIONED` table, the planner has added the `ExchangeRel` to enforce the `SINGLETON` distribution.
But the table with the `SINGLETON` distribution remains unchanged because it already has the desired distribution.
Congratulations! Our custom property is ready.
Physical properties are an important concept in query optimization that allows you to explore more alternative plans.
In this blog post, we demonstrated how to define the custom physical property in Apache Calcite. We created a custom `RelTraitDef` and `RelTrait` classes, registered them in the planner, and used the custom operator to enforce the desired value of the property.
However, we omitted one crucial question - how to propagate properties between operators? It turns out, Apache Calcite cannot do this well, and you will have to make a tough decision choosing between several non-ideal solutions. We will discuss property propagation in detail in future posts. Stay tuned!
We are always ready to help you with your SQL query optimizer design. Just let us know.