Learned Systems

Cardinality Estimation Benchmark

Authors: Parimarjan Negi, Ryan Marcus, Andreas Kipf

In this blog post, we want to go over the motivations and applications of the Cardinality Estimation Benchmark (CEB), which was a part of the VLDB 2021 Flow-Loss paper.

There has been a lot of interest in using ML for cardinality estimation. The motivating application is often query optimization: when searching for the best execution plan, a query optimizer needs to estimate intermediate result sizes. In the most simplified setting, a better query plan may need to process smaller sized intermediate results, thereby utilizing fewer resources, and executing faster.

Several approaches have shown that one can consistently outperform DBMS estimators, often by orders of magnitude in terms of average estimation accuracy. However, improving estimation accuracy may not necessarily improve an optimizer’s final query plan, as highlighted in the following simple example1.

Plan Cost Intuition

In order to build large workloads to evaluate the impact of cardinality estimators, we introduce a novel programmatic templating scheme. We use it to generate over 15K challenging queries on two databases (IMDb and StackExchange). More crucially, we provide a clean API to evaluate cardinality estimations on all of a query’s subplans with respect to their impact on query plans.


Consider query 1a66 from CEB-IMDb, shown below. A cardinality estimator provides the size estimate for each of its subplans (every connected subgraph of the join graph is a potential subplan we consider). CEB contains true cardinalities for all these subplans.

CEB query example

The subplan cardinality estimates can be fed into PostgreSQL, which gives us the best plan for these estimates. The cost of this plan using true cardinalities is the Postgres Plan Cost2. Below, we visualize these plans when using the PostgreSQL cardinality estimates (left), which is almost 7x worse than using the true cardinalities (right).

CEB query plan example

Each node is a scan or join operator. The colorbar for the nodes goes from green to red, i.e., cheap operations to expensive operations. Thus, looking for the red nodes immediately shows us why the estimates messed up: PostgreSQL underestimated cardinalities of two key nodes (highlighted in the left figure), and thus, using a nested-loop join was a bad choice — since the true cardinalities of these nodes were large, and would therefore require a lot more processing. This is a common pattern of PostgreSQL cardinality underestimates resulting in bad plans. That being said, we also see examples in CEB where PostgreSQL gets a worse plan due to overestimates, or other subtler motifs.

You can test your own cardinality estimator using our Python framework in the CEB GitHub repository. You just need to provide the estimates for all the subplans, and it should evaluate those estimates based on Q-Error, or different Plan Costs (e.g., using the PostgreSQL backend, or a simpler cost model).

We provide a simple featurization scheme for queries, and data loaders for PyTorch, which we use to train several known supervised learning models in the CEB repo. One of the key results we find is that when the query workload matches the training workload, learned models tend to do very well in terms of query plan performance, but when the workload changes somewhat, the learned models can be surprisingly brittle:

CEB key result

The figure on the left has the training / test set queries split equally on each template (only test set results are shown). We see even the mean of almost 6K queries shows a large gap between PostgreSQL estimates and learned models — this translates to several hours faster total runtime of the workload. The figure on the right splits training / test queries such that queries from half the templates are in the training set, and the rest are in the test set. Since we have only a few templates, the performance can be very sensitive to the particular seed used to do the split, therefore, we show the results across ten such splits (seeds = 1-10). Observe that there are some extreme splits where the learned model performance can degrade dramatically.

These trends are also reflected when we use Postgres Plan Cost as the evaluation metric:

CEB key result2

Computing runtimes is significantly expensive — and this experiment shows that we can often rely on trusting the Plan Cost evaluation metric instead, although the exact relationship between the runtimes and the plan costs should be an open research problem.

Why Is This Benchmark Needed?

The Join Order Benchmark (JOB) did a great job of highlighting why TPC-style synthetic benchmarks may not be enough for evaluating query optimizers, in particular, the impact of cardinality estimation. In the CEB repo, we also provide cardinality data for JOB, and other derived workloads, such as JOB-M, or JOB-light, so they can be as easily evaluated with the tools described so far. However, even though JOB illustrates query optimization challenges, it contains too few queries for a cardinality estimation benchmark suited to the deep learning-style models often used today. The table below shows key properties of CEB compared to JOB.3

There are several reasons why a larger benchmark is useful. Two critical motivations for us were:

Next Steps

We would love to get your contributions to further developing CEB, or exploring research questions using it. Here are a few ideas:

We envision CEB to serve as a foundation for benchmarking (learned) cardinality estimators. We hope to add additional database backends besides PostgreSQL, and several other databases and query workloads, in order to build a much more challenging set of milestones for building robust and reliable ML models for cardinality estimation in a query optimizer. Please contribute!


  1. Intuitively, this is because to get the best plan, you only need the cost of the best plan to be the cheapest. So for instance, large estimation errors on subplans that are not great, would not affect it. There are more such scenarios in the Flow-Loss paper

  2. Postgres Plan Cost (PPC) is based on the abstract Plan Cost defined in the excellent ten year-old paper, Preventing Bad Plans by Bounding the Impact of Cardinality Estimation Errors by Moerkotte et al. They also introduced Q-Error in the paper, which has been commonly used as the evaluation metric of choice in recent cardinality estimation papers. PPC is a useful proxy for query execution latencies in PostgreSQL, based on its cost model, but it is not DBMS specific. For instance, we have the basic ingredients for a hacky MySQL implementation here. PPC is useful because executing queries can be very resource intensive, noisy, and so on. Meanwhile, PPC can be computed almost as easily as Q-Error, and it is more closely aligned with the goals of query optimization. And these don’t always agree. For instance, we have seen scenarios where an estimator has lower average Q-Error, but higher Postgres Plan Cost. We show its correlation with runtimes, and further discuss the use of the Plan Costs in the Flow-Loss paper

  3. A new benchmark was released last week, which has similar motivations as CEB. We have not yet had the time to look at and compare it with our benchmark.