Jekyll2022-07-23T15:09:57+00:00http://learnedsystems.mit.edu/feed.xmlLearned SystemsA blog about research on learned systems.MIT DSGLSI: A Learned Secondary Index Structure2022-07-22T00:00:00+00:002022-07-22T00:00:00+00:00http://learnedsystems.mit.edu/learned-secondary-indexing<p><em>Authors: <a href="https://www.linkedin.com/in/dominik-horn-9b9187220/">Dominik Horn</a>,
<a href="https://people.csail.mit.edu/kipf/">Andreas Kipf</a>,
<a href="https://www.linkedin.com/in/pascalpfeil/">Pascal Pfeil</a></em></p>
<!--[Ryan Marcus](https://rmarcus.info/blog/),
and [Tim Kraska](https://people.csail.mit.edu/kraska/)_-->
<p>We are happy to present <a href="https://doi.org/10.1145/3533702.3534912">LSI: A Learned Secondary Index
Structure</a> and our accompanying
<a href="https://github.com/learnedsystems/LearnedSecondaryIndex"><strong>open-source C++
implementation</strong></a> that
can easily be included in other projects using <a href="https://cmake.org/cmake/help/latest/module/FetchContent.html">CMake
FetchContent</a>.
LSI is a learned secondary index. It offers competitive lookup performance on
real-world datasets while reducing space usage by up to 6x compared to
state-of-the-art secondary index structures.</p>
<h3 id="motivation">Motivation</h3>
<p>Most <a href="https://dl.acm.org/doi/pdf/10.1145/3183713.3196909">learned index
structures</a> focus on
primary indexing where base data is stored in sorted order. Since a table can
only be sorted by a single key, it is therefore of high interest to study the
feasibility of applying learned techniques to indexing unsorted, secondary
columns. Specifically, we were aiming to come up with an off-the-shelf usable
index structure that manages to retain the <a href="https://learnedsystems.github.io/SOSDLeaderboard/leaderboard/">proven benefits of learned primary
indexes</a> as best
as possible.</p>
<h2 id="lsi-overview">LSI Overview</h2>
<p>LSI consists of two parts. A pre-existing learned index structure acting as an
approximate index, e.g., a <a href="https://arxiv.org/abs/2108.05117">PLEX</a>, and a
permutation vector. The learned index is trained over a temporary sorted copy
of the base data’s keys. The permutation vector retains a mapping from the
sorted key’s positions to offsets (tuple IDs) into the base data. The following
paragraph contains a schematic illustration of LSI. A more detailed
explanation, especially concerning how we achieved LSI’s space efficiency, can
be found in our <a href="https://doi.org/10.1145/3533702.3534912">paper</a>.</p>
<h3 id="lookups">Lookups</h3>
<p>Querying LSI works in three steps, as illustrated below:</p>
<p><img src="/assets/lsi/overview.png" alt="LSI Architecture Overview" /></p>
<ol>
<li>Query the learned index for an approximate position in the sorted base data.
Again, note that the sorted representation is not actually stored and only
exists temporarily during construction.</li>
<li>Obtain search bounds based on the approximate position and the learned
index’s error bounds. These can either be guaranteed by construction, as is
the case with <a href="https://arxiv.org/abs/2108.05117">PLEX</a>, or could be measured
and retained during training.</li>
<li>Conduct a local search over the permutation vector to locate the key.
<ul>
<li><strong>Lower-Bound Lookups</strong> perform a binary search. Note that this requires
<code class="language-plaintext highlighter-rouge">O(log k)</code> probes into the base data for a search interval of size <code class="language-plaintext highlighter-rouge">k</code>. This
could potentially be prohibitively costly in practice, e.g., when base
data is not fully loaded into main memory.</li>
<li><strong>Equality Lookups</strong> can be sped up by retaining fingerprint bits for each
key and performing a linear scan over the search interval. This requires
<code class="language-plaintext highlighter-rouge">O(k)</code> base data probes for a search interval of size <code class="language-plaintext highlighter-rouge">k</code>. However, each
additional fingerprint bit roughly halves the number of required accesses.</li>
</ul>
</li>
</ol>
<h2 id="evaluation">Evaluation</h2>
<p>Evaluations were conducted on a <code class="language-plaintext highlighter-rouge">c5.9xlarge</code> AWS machine with 36 vCPUs and 72 GiB
of RAM using four real-world datasets from
<a href="https://arxiv.org/abs/1911.13014">SOSD</a>, each with 200 million 64-bit unsigned
integer keys:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">amzn</code> book popularity data</li>
<li><code class="language-plaintext highlighter-rouge">fb</code> randomly sampled Facebook user IDs</li>
<li><code class="language-plaintext highlighter-rouge">osm</code> cell IDs from Open Street Map</li>
<li><code class="language-plaintext highlighter-rouge">wiki</code> timestamps of edits from Wikipedia</li>
</ul>
<p>All experiments ran single threaded. To prevent skew from potential
out-of-order execution, we insert a full memory fence using the compiler’s
built in <code class="language-plaintext highlighter-rouge">__sync_synchronize()</code> after each lookup. A more detailed writeup of
our findings can be found in our <a href="https://doi.org/10.1145/3533702.3534912">paper</a>.</p>
<h3 id="lower-bound-lookups">Lower-Bound Lookups</h3>
<p>To benchmark lower-bound lookups, we randomly remove 10% of the
elements from the datasets before building each index. We only query for
non-keys.
LSI with <a href="https://arxiv.org/abs/2108.05117">PLEX</a> as its approximate index is
competitive with <a href="https://db.in.tum.de/~leis/papers/ART.pdf">ART</a>’s latency,
while consuming up to 6x less space. It is twice as fast as
<a href="https://github.com/bingmann/stx-btree">BTree</a> while reducing space consumption
by up to 3x.</p>
<p><img src="/assets/lsi/lowerbounds.png" style="width:66%;display:block;margin-left:auto;margin-right:auto;" alt="lowerbounds benchmark on all four datasets" /></p>
<p>Note that BTree was constructed by first sorting the data and bulk loading it
afterwards. This enables denser nodes and therefore faster lookups. Inserting
keys in random order one after another increased space usage by roughly 33% and
slowed down lookups by more than 20%. The text annotations denote the LSI
model’s error bound.</p>
<h3 id="equality-lookups">Equality Lookups</h3>
<p>Although LSI with <a href="https://arxiv.org/abs/2108.05117">PLEX</a> as its approximate
index is around 1.5x slower than the robin-hood hash table implementation
<a href="https://github.com/Tessil/robin-map">RobinMap</a>, it consumes less than a
quarter of the space. The text annotations again denote the LSI model’s error
bound. Note that we used the <code class="language-plaintext highlighter-rouge">amzn</code> dataset for this test, only to later
discover that RobinMap’s hash function appears to be unfavorably biased in
this case. On <code class="language-plaintext highlighter-rouge">fb</code> and <code class="language-plaintext highlighter-rouge">osm</code>, RobinMap’s lookup only takes 340ns compared to
the 440ns on <code class="language-plaintext highlighter-rouge">amzn</code>. Space consumption does not change.</p>
<p><img src="/assets/lsi/equality.png" style="width:45%;display:block;margin-left:auto;margin-right:auto;" alt="equality benchmark on amzn" /></p>
<h3 id="build-time">Build Time</h3>
<p>We measured the true end-to-end build time for each index given an unsorted
continuous in-memory representation of each dataset’s 200 million keys. LSI is
competitive in terms of build time. It performs slightly worse than BTree, but
surpasses <a href="https://db.in.tum.de/~leis/papers/ART.pdf">ART</a> on every dataset we
tested. As expected, tighter error bounds on the approximate index increase build
time.</p>
<p><a href="https://github.com/bingmann/stx-btree">BTree</a> is constructed by first sorting
and the bulk loading all keys. This is about 8x faster than inserting keys
in unsorted order one by one. ART does not have a bulk-loading interface,
however, its construction time is still cut in half when keys are first sorted
and then inserted one after another. We suspect this effect is primarily caused
by better cache behavior during construction.</p>
<p><img src="/assets/lsi/build_throughput.png" style="width:66%;display:block;margin-left:auto;margin-right:auto;" alt="build time comparison on all four datasets" /></p>
<p>Note that we would have expected
<a href="https://github.com/Tessil/robin-map">RobinMap</a> to exhibit equal build
throughput on <code class="language-plaintext highlighter-rouge">amzn</code>, <code class="language-plaintext highlighter-rouge">fb</code> and <code class="language-plaintext highlighter-rouge">osm</code>. Further investigation is required to
determine why it performs so poorly on <code class="language-plaintext highlighter-rouge">amzn</code>. We suspect that the hash function,
which we have no control over, might be unfavorably biased in this case. More
than half of all elements from <code class="language-plaintext highlighter-rouge">wiki</code> are duplicates. For this reason, build
times for RobinMap on this dataset appear faster.</p>
<h3 id="fingerprint-configuration">Fingerprint Configuration</h3>
<p>We conducted a micro experiment to study the trade off between binary search
and linear search with fingerprints. For this purpose, we trained LSI with
<a href="https://arxiv.org/abs/2108.05117">PLEX</a> as an approximate index on <code class="language-plaintext highlighter-rouge">amzn</code>
using various errors and fingerprint bit sizes (denoted in brackets), and
compared lookup latencies:</p>
<p><img src="/assets/lsi/binary_vs_linear.png" style="width:47%;" alt="micro experiment as line plot" />
<img src="/assets/lsi/error_fingerprint_study_cpu_time.png" style="width:52%;" alt="micro experiment as heat map" /></p>
<p>For a search bound of size <code class="language-plaintext highlighter-rouge">k</code>, binary search always performs <code class="language-plaintext highlighter-rouge">O(log k)</code> probes
into the base data, while linear search requires <code class="language-plaintext highlighter-rouge">O(k)</code>. Each additional
fingerprint bit will roughly halve the number of base data accesses.</p>
<h2 id="future-work">Future Work</h2>
<p>We are excited about our promising first results of approaching secondary
indexing using learned techniques. For production readiness, we will need to
add support for inserts. This will require two main adjustments. First, the
approximate index has to be updateable. We could either attempt to extend
<a href="https://arxiv.org/abs/2108.05117">PLEX</a> with support for inserts, or use an
existing off-the-shelf model that supports inserts like
<a href="https://dl.acm.org/doi/abs/10.14778/3389133.3389135">PGM</a>. Additionally, we
will have to find a better way to deal with dynamic updates to the permutation
vector. As implemented currently, each update will require shifting and
incrementing values, or <code class="language-plaintext highlighter-rouge">O(n)</code> time.</p>
<p>Another interesting direction is to try and reduce base data accesses during
lower-bound lookups. While we could just store a copy of all keys in the
permutation vector, similar to BTree’s leaf layer, this will negate most of
LSI’s space savings. However, there might be an opportunity to store just
enough information to reconstruct enough information about the keys to enable
efficient (binary) searching. Stay tuned!</p>MIT DSGAuthors: Dominik Horn, Andreas Kipf, Pascal Pfeil We are happy to present LSI: A Learned Secondary Index Structure and our accompanying open-source C++ implementation that can easily be included in other projects using CMake FetchContent. LSI is a learned secondary index. It offers competitive lookup performance on real-world datasets while reducing space usage by up to 6x compared to state-of-the-art secondary index structures.Cardinality Estimation Benchmark2021-09-22T00:00:00+00:002021-09-22T00:00:00+00:00http://learnedsystems.mit.edu/cardinality-estimation-benchmark<p><em>Authors: <a href="https://parimarjan.github.io">Parimarjan Negi</a>,
<a href="https://rmarcus.info/blog/">Ryan Marcus</a>, <a href="https://people.csail.mit.edu/kipf/">Andreas Kipf</a></em></p>
<p>In this blog post, we want to go over the motivations and applications of the <a href="https://github.com/learnedsystems/CEB">Cardinality Estimation Benchmark (CEB)</a>, which was a part of the VLDB 2021 <a href="http://vldb.org/pvldb/vol14/p2019-negi.pdf">Flow-Loss paper</a>.</p>
<p>There has been a lot of interest in using ML for cardinality estimation. The motivating application is often query optimization: when searching for the best execution plan, a query optimizer needs to estimate intermediate result sizes. In the most simplified setting, a better query plan may need to process smaller sized intermediate results, thereby utilizing fewer resources, and executing faster.</p>
<p>Several approaches have shown that one can consistently outperform DBMS estimators, often by orders of magnitude in terms of average estimation accuracy. However, improving estimation accuracy may not necessarily improve an optimizer’s final query plan, as highlighted in the following simple example<sup id="fnref:estimation_plan_quality" role="doc-noteref"><a href="#fn:estimation_plan_quality" class="footnote" rel="footnote">1</a></sup>.</p>
<p><img src="/assets/ceb/ceb-blog-intuition.jpeg" alt="Plan Cost Intuition" /></p>
<p>In order to build large workloads to evaluate the impact of cardinality estimators, we introduce a novel <a href="https://github.com/learnedsystems/CEB/blob/main/TEMPLATES.md">programmatic templating scheme</a>.
We use it to generate over 15K challenging queries on two databases (IMDb and StackExchange). More crucially, we provide a clean API to evaluate cardinality estimations on all of a query’s subplans with respect to their impact on query plans.</p>
<h2 id="example">Example</h2>
<p>Consider query 1a66 from CEB-IMDb, shown below. A cardinality estimator provides the size estimate for each of its subplans (every connected subgraph of the join graph is a potential subplan we consider). CEB contains true cardinalities for all these subplans.</p>
<p><img src="/assets/ceb/CEB-blog-eg1.jpeg" alt="CEB query example" /></p>
<p>The subplan cardinality estimates can be fed into PostgreSQL, which gives us the best plan for these <i>estimates</i>. The cost of this plan using <i>true cardinalities</i> is the Postgres Plan Cost<sup id="fnref:ppc" role="doc-noteref"><a href="#fn:ppc" class="footnote" rel="footnote">2</a></sup>. Below, we visualize these plans when using the PostgreSQL cardinality estimates (left), which is almost 7x worse than using the true cardinalities (right).</p>
<p><img src="/assets/ceb/1a66-plans.jpeg" alt="CEB query plan example" /></p>
<p>Each node is a scan or join operator. The colorbar for the nodes goes from green to red, i.e., cheap operations to expensive operations. Thus, looking for the red nodes immediately shows us why the estimates messed up: PostgreSQL underestimated cardinalities of two key nodes (highlighted in the left figure), and thus, using a nested-loop join was a bad choice — since the true cardinalities of these nodes were large, and would therefore require a lot more processing. This is a common pattern of PostgreSQL cardinality underestimates resulting in bad plans. That being said, we also see examples in CEB where PostgreSQL gets a worse plan due to overestimates, or other subtler motifs.</p>
<p>You can test your own cardinality estimator using our Python framework in the <a href="https://github.com/learnedsystems/CEB">CEB GitHub repository</a>. You just need to provide the estimates for all the subplans, and it should evaluate those estimates based on Q-Error, or different Plan Costs (e.g., using the PostgreSQL backend, or a simpler cost model).</p>
<p>We provide a simple featurization scheme for queries, and data loaders for PyTorch, which we use to train several known supervised learning models in the <a href="http://github.com/learnedsystems/CEB">CEB repo</a>. One of the key results we find is that when the query workload matches the training workload, learned models tend to do very well in terms of query plan performance, but when the workload changes somewhat, the learned models can be surprisingly brittle:</p>
<p><img src="/assets/ceb/CEB-blog-runtimes.jpeg" alt="CEB key result" /></p>
<p>The figure on the left has the training / test set queries split equally on each template (only test set results are shown). We see even the mean of almost 6K queries shows a large gap between PostgreSQL estimates and learned models — this translates to several hours faster total runtime of the workload. The figure on the right splits training / test queries such that queries from half the templates are in the training set, and the rest are in the test set. Since we have only a few templates, the performance can be very sensitive to the particular seed used to do the split, therefore, we show the results across ten such splits (seeds = 1-10). Observe that there are some extreme splits where the learned model performance can degrade dramatically.</p>
<p>These trends are also reflected when we use Postgres Plan Cost as the evaluation metric:</p>
<p><img src="/assets/ceb/CEB-blog-ppc.jpeg" alt="CEB key result2" /></p>
<p>Computing runtimes is significantly expensive — and this experiment shows that
we can often rely on trusting the Plan Cost evaluation metric instead, although
the exact relationship between the runtimes and the plan costs should be an
open research problem.</p>
<h1 id="why-is-this-benchmark-needed">Why Is This Benchmark Needed?</h1>
<p>The <a href="https://github.com/gregrahn/join-order-benchmark">Join Order Benchmark (JOB)</a> did a great job of highlighting why TPC-style synthetic benchmarks may not be enough for evaluating query optimizers, in particular, the impact of cardinality estimation.
In the <a href="http://github.com/learnedsystems/CEB">CEB repo</a>, we also provide cardinality data for JOB, and other derived workloads, such as <a href="https://github.com/neurocard/neurocard">JOB-M</a>, or <a href="https://github.com/andreaskipf/learnedcardinalities/blob/master/workloads/job-light.sql">JOB-light</a>, so they can be as easily evaluated with the tools described so far.
However, even though JOB illustrates query optimization challenges, it contains too few queries for a cardinality estimation benchmark suited to the deep learning-style models often used today. The table below shows key properties of CEB compared to JOB.<sup id="fnref:ceb-stats" role="doc-noteref"><a href="#fn:ceb-stats" class="footnote" rel="footnote">3</a></sup></p>
<p align="center">
<img width="360" height="360" src="/assets/ceb/CEB-benchmark-comparison.jpeg" />
</p>
<p>There are several reasons why a larger benchmark is useful. Two critical motivations for us were:</p>
<ul>
<li>
<p><b>Query-driven models require large training sets.</b>
Models, such as <a href="https://github.com/andreaskipf/learnedcardinalities">MSCN</a>, <a href="https://dl.acm.org/doi/10.14778/3329772.3329780">Fully Connected Neural Nets (FCNN) / XGBoost</a>, or <a href="http://pasalabs.org/papers/2021/VLDB21_Fauce.pdf">Fauce</a> learn from representative workloads.
Therefore, having <= 4 queries per template, such as in JOB, is not great for training such models.
Note that there is no shortage of representative SQL queries in industry settings — thus, the potential benefits of query-driven approaches make it a compelling use case to study.
Under appropriate circumstances, benefits over data-driven models include: significantly smaller sizes (e.g., <a href="https://arxiv.org/abs/2006.08109">NeuroCard</a> models for a restricted subset of IMDb takes 100s of MBs, while MSCN-style models are < 2MB), flexibility to support all kinds of queries (e.g., self-joins or regex filters), <a href="http://pasalabs.org/papers/2021/VLDB21_Fauce.pdf">uncertainty estimates (Fauce)</a>, incorporating runtime information as features — such as estimates from heuristic estimators (proposed <a href="https://dl.acm.org/doi/10.14778/3329772.3329780">here</a>), sampling (e.g., see sample bitmaps proposed in <a href="https://arxiv.org/abs/1809.00677">MSCN</a>).</p>
</li>
<li>
<p><b>Gaining confidence in deep learning models through rich execution scenarios and edge cases.</b>
For traditional cardinality estimation models, which were based on analytical formulas, we could be confident of their functioning, including shortcomings, based on intuitive analysis. A lot of the newer models are based on deep learning-based black box techniques. Even when a model does well on a particular task — it does not guarantee that it will have a predictably good performance in another scenario. At the same time, there is a promise of huge performance improvements. One way to gain confidence in these models is to show that they work well across significantly different, and challenging scenarios. Thus, the techniques used for developing CEB aim to create queries with a lot of edge cases, and challenges, which should be useful to study when these models work well, and when they don’t. This should let us study their performance in various contexts — changing workloads, changing data, different training set sizes, changing model parameters, and so on.</p>
</li>
</ul>
<h1 id="next-steps">Next Steps</h1>
<p>We would love to get your contributions to further developing CEB, or exploring research questions using it. Here are a few ideas:</p>
<ul>
<li>
<p><b>Adding more DBs, and query workloads.</b> There are a lot of learned models for cardinality estimation, and very few challenging evaluation scenarios. Thus, it is hard to compare these methods, and to reliably distinguish between the strengths and weaknesses of these approaches. There are two steps to expanding on the evaluation scenarios in CEB. First, we need new databases — for instance, if you have an interesting real-world database in PostgreSQL, then providing a pgdump for it should allow us to easily integrate it into CEB. We provide IMDb and StackExchange, and plan to add a couple of others in the future. Secondly, once we have a new database, we require query workloads for those. We have provided some tools for automated query generation, but at its core, all such methods would require some representative queries thought through by people familiar with the schema. This is the most challenging step for expanding such benchmarks, and we hope that open sourcing these tools can bring people together to collectively build larger workloads.</p>
</li>
<li>
<p><b>Limits of the learning models.</b> The paper <a href="http://vldb.org/pvldb/vol14/p1640-wang.pdf">Are We Ready For Learned Cardinality Estimation?</a> won a Best Paper award in VLDB 2021. They ask several important questions about learned cardinality estimation models, but their experiments are restricted to single table estimations, i.e., without joins. CEB should provide the tools to ask similar questions in the more complex, query optimization use case of these estimators.</p>
</li>
<li>
<p><b>Data-driven models.</b> Comparing unsupervised learning models (e.g., <a href="https://github.com/neurocard/neurocard">NeuroCard</a>, <a href="https://github.com/DataManagementLab/deepdb-public">DeepDB</a>, <a href="https://arxiv.org/abs/2012.14743">BayesCard</a>) with the supervised learning models. Some of these approaches are harder to adapt to our full set of queries — it involves modeling self joins, regex queries, and so on. Working to extend the data-driven approaches to these common use cases should be interesting. But we also have also converted other simpler query workloads, like JOB-M as used in NeuroCard, to our format, and provide scripts to run them and generate the plan costs etc.</p>
</li>
<li>
<p><b>Different execution environments.</b> In the <a href="http://vldb.org/pvldb/vol14/p2019-negi.pdf">Flow-Loss paper</a>, we mainly focused on one particular execution scenario: single-threaded execution on NVMe hard drives, which ensured minimum additional noise and variance. We have explored executing these queries in other scenarios, and we notice the runtime latencies fluctuate wildly. For instance, when executing on AWS EB2 storage, even the same query plan latencies can fluctuate due to the I/O bursts. On slower SATA hard disks, we find all the query plans get significantly slower, thus potentially causing many more challenging scenarios for the cardinality estimators. Similar effects are seen when we don’t use indexes. These effects can also be somewhat modeled by different configuration settings — which would allow Postgres Plan Cost to serve as a viable proxy for latencies in these situations. These queries, and evaluation framework, provide many interesting opportunities to analyze these impacts.</p>
</li>
<li>
<p><b>Alternative approaches to cardinality estimation.</b> Another interesting line of research suggests that query optimizers should not need to rely on precise cardinality estimates when searching for the best plan. This includes <a href="https://dl.acm.org/doi/10.1145/2588555.2588566">plan bouquets</a>, <a href="https://waltercai.github.io/assets/pessimistic-query-optimization.pdf">pessimistic query optimization</a>, <a href="http://www.vldb.org/pvldb/vol11/p1360-wolf.pdf">robust query optimization</a>, and so on. For instance, pessimistic QO approaches have done quite well on JOB. It is interesting to see if they can do equally well on a larger workload, potentially with more edge cases, such as CEB.</p>
</li>
<li>
<p><b>Featurization schemes.</b> Our featurization makes simplifying assumptions, such as already knowing the exact templates that will be used in the queries and so on. This may be reasonable in some practical cases, like templated dashboard queries, but will not support ad-hoc queries. Similarly, models should be able to support self joins without knowing the templates beforehand, and so on.</p>
</li>
</ul>
<p>We envision <a href="https://github.com/learnedsystems/CEB">CEB</a> to serve as a foundation for benchmarking (learned) cardinality estimators. We hope to add additional database backends besides PostgreSQL, and several other databases and query workloads, in order to build a much more challenging set of milestones for building robust and reliable ML models for cardinality estimation in a query optimizer. Please contribute!</p>
<h1 id="notes">Notes</h1>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:estimation_plan_quality" role="doc-endnote">
<p>Intuitively, this is because to get the best plan, you only need the cost of the best plan to be the cheapest. So for instance, large estimation errors on subplans that are not great, would not affect it. There are more such scenarios in the <a href="http://vldb.org/pvldb/vol14/p2019-negi.pdf">Flow-Loss paper</a>. <a href="#fnref:estimation_plan_quality" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:ppc" role="doc-endnote">
<p>Postgres Plan Cost (PPC) is based on the abstract Plan Cost defined in the excellent ten year-old paper, <a href="http://www.vldb.org/pvldb/vol2/vldb09-657.pdf">Preventing Bad Plans by Bounding the Impact of Cardinality Estimation Errors</a> by Moerkotte et al. They also introduced Q-Error in the paper, which has been commonly used as the evaluation metric of choice in recent cardinality estimation papers. PPC is a useful proxy for query execution latencies in PostgreSQL, based on its cost model, but it is not DBMS specific. For instance, we have the basic ingredients for a hacky MySQL implementation <a href="https://github.com/parimarjan/mysql-server">here</a>. PPC is useful because executing queries can be very resource intensive, noisy, and so on. Meanwhile, PPC can be computed almost as easily as Q-Error, and it is more closely aligned with the goals of query optimization. And these don’t always agree. For instance, we have seen scenarios where an estimator has lower average Q-Error, but higher Postgres Plan Cost. We show its correlation with runtimes, and further discuss the use of the Plan Costs in the <a href="http://vldb.org/pvldb/vol14/p2019-negi.pdf">Flow-Loss paper</a>. <a href="#fnref:ppc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:ceb-stats" role="doc-endnote">
<p>A new <a href="https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark">benchmark</a> was released last week, which has similar motivations as CEB. We have not yet had the time to look at and compare it with our benchmark. <a href="#fnref:ceb-stats" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>MIT DSGAuthors: Parimarjan Negi, Ryan Marcus, Andreas Kipf In this blog post, we want to go over the motivations and applications of the Cardinality Estimation Benchmark (CEB), which was a part of the VLDB 2021 Flow-Loss paper. There has been a lot of interest in using ML for cardinality estimation. The motivating application is often query optimization: when searching for the best execution plan, a query optimizer needs to estimate intermediate result sizes. In the most simplified setting, a better query plan may need to process smaller sized intermediate results, thereby utilizing fewer resources, and executing faster.Defeating Duplicates: A Re-Design of the LearnedSort Algorithm2021-07-08T00:00:00+00:002021-07-08T00:00:00+00:00http://learnedsystems.mit.edu/defeating-dups-learned-sort<p><em>Author: <a href="https://anikristo.com">Ani Kristo</a></em></p>
<p><a href="https://github.com/learnedsystems/LearnedSort">LearnedSort</a> is a novel sorting algorithm that uses fast ML models to boost the sorting speed. We introduced the algorithm in <a href="https://dl.acm.org/doi/10.1145/3318464.3389752">SIGMOD 2020</a> together with a large set of benchmarks that showed outstanding performance as compared to state-of-the-art sorting algorithms.</p>
<p>However, given the nature of the underlying model, its performance was affected on high-duplicate inputs. In this post we introduce <a href="https://arxiv.org/abs/2107.03290"><strong>LearnedSort 2.0</strong></a>: a re-design of the algorithm that maintains the leading edge even for high-duplicate inputs. Extensive benchmarks demonstrate that it is on average 4.78× faster than the original LearnedSort for high-duplicate datasets, and 1.60× for low-duplicate datasets.</p>
<h2 id="background-on-learnedsort">Background on LearnedSort</h2>
<p style="text-align: center;"><img src="/assets/learnedsort/original_diag.png" alt="original_diag" style="max-height: 12em" /></p>
<p>The core idea of the LearnedSort algorithm was simple: we used ML techniques to train a model that estimated the distribution of the data from a sample of the keys; then used it to predict the order of keys in the sorted output. The model acts as an estimator of the <em>scaled empirical CDF</em> of the input.</p>
<p>In order to ensure extremely fast and accurate distribution modeling, as well as effective cache utilization, we designed LearnedSort such that it:</p>
<ol>
<li>Partitioned input keys into fixed-capacity buckets based on the predicted values.</li>
<li>Repeated the partitioning process until the buckets get small enough to fit in the cache.</li>
<li>Used a model-based, Counting Sort routine to sort the elements inside the small buckets.</li>
<li>Performed a clean-up step for minor imperfections using Insertion Sort.</li>
<li>Used a <em>spill bucket</em> that collected all the overflowing keys from the fixed-capacity buckets. This was sorted separately and merged back to the output array at the end.</li>
</ol>
<p>Eventually, LearnedSort resulted in a very fast, cache-efficient, ML-enhanced sorting algorithm that showed impressive results in our benchmarks. LearnedSort achieved an average of 30% better performance than the next-best sorting algorithm (I1S⁴o), 49% over Radix Sort, and an impressive 238% better than std::sort - the default sorting algorithm in C++. These exciting results also landed LearnedSort a feature in <a href="https://blog.acolyer.org/2020/10/19/the-case-for-a-learned-sorting-algorithm/">the morning paper</a>.</p>
<h2 id="the-achilles-heel-of-learned-sort">The Achilles’ heel of Learned Sort</h2>
<p style="text-align: center;"><img src="/assets/learnedsort/original_zipf.png" alt="original_zipf" style="max-height: 12em" /></p>
<p>One of the biggest challenges for LearnedSort was the impact of inputs having too many duplicate keys. In such scenarios, the eCDF values for any two equal keys would also be equal. Therefore, the eCDF model makes the same predictions and places them onto the same bucket.</p>
<p>LearnedSort could tolerate a certain degree of duplicate keys; however, this became an issue when the input contained a substantial amount of such keys (usually more than 50%). In this case, certain buckets would quickly reach full capacity and overflow onto the spill bucket, while others would be mostly empty. In turn, progressively more keys needed to be sorted using an external algorithm (i.e., std::sort, which is slower).</p>
<p>Thus, the spill bucket sorting step became the bottleneck, and the performance of LearnedSort on high-duplicate inputs deteriorated compared to the average case. The figure above shows that LearnedSort’s sorting rate decreases by 22% in the case when the input data had a Zipfian distribution with skew 0.9, which corresponds to 72% duplicates.</p>
<h2 id="learned-sort-20">Learned Sort 2.0</h2>
<p>It was clear that to improve LearnedSort’s performance on high-duplicate inputs, we had to eliminate the spill bucket, and the possibility of bucket overflows. To achieve this, we had to allocate precisely as much space as each bucket needed. One approach is to scan the input and calculate bucket sizes based on model predictions. However, the overhead for this additional step is not insignificant. To mitigate that cost, we could use the training sample for estimating each bucket’s size before allocating. In that case, however, we have to make room for estimation errors by over-allocating, and additional memory acquisition also results in a slowdown.</p>
<p>Instead, what worked best was to think of buckets as a collection of smaller, fixed-capacity fragments. The eCDF prediction step would be the same; however, instead of placing the key directly to its predicted bucket, we place it to the corresponding pre-allocated fragment owned by the predicted bucket.</p>
<p style="text-align: center;"><img src="/assets/learnedsort/step1.png" alt="step1" style="max-width: 36em" /></p>
<p>Once a fragment reaches its full capacity (i.e., 100 elements), it is copied back to the input at a given write-head. The write-head is updated, and the copied fragment is cleared to make space for more incoming elements. This procedure continues until the entire input has been processed.</p>
<p style="text-align: center;"><img src="/assets/learnedsort/step2.png" alt="step2" style="max-width: 36em" /></p>
<p>Using this method, we managed to avoid using a spill bucket altogether and logically give each bucket as many slots that they needed, but at the cost of bucket fragmentation. After the first partitioning step has finished, the fragments owned by the same bucket (sibling fragments) will most probably be not contiguous, rather scattered throughout the input array. Therefore, it is necessary to perform an additional step that combines sibling fragments to form a contiguous space that represents a whole bucket.</p>
<p style="text-align: center;"><img src="/assets/learnedsort/step3.png" alt="step3" style="max-width: 36em" /></p>
<p>After the buckets have been defragmented, the algorithm will re-partition the keys inside each bucket to further refine the sortedness of the input. This step is similar to the original LearnedSort, and it is done in two steps as a way to maximize data locality and cache utilization. While iterating through the formed buckets, the new algorithm performs a quick check to see if all the elements in the bucket are equal, in which case, it may skip the re-partitioning and jump to the next bucket.</p>
<p>The next step is to do an in-bucket sort using the same model-enhanced Counting Sort subroutine as in the original LearnedSort. This subroutine is fast and has linear time complexity.</p>
<p style="text-align: center;"><img src="/assets/learnedsort/step4.png" alt="step4" style="max-width: 36em" /></p>
<p>Finally, LearnedSort 2.0 still uses Insertion Sort as a final and deterministic touch-up subroutine, which, in almost linear time, guarantees that the input has been monotonically ordered.</p>
<h2 id="benchmarks">Benchmarks</h2>
<p>The new design of LearnedSort 2.0 resulted in remarkable performance improvements as compared to the original one. It is on average 4.33× faster for high-duplicate <em>real</em> datasets and 6.57× faster for high-duplicate <em>synthetic</em> datasets. For all other cases, LearnedSort 2.0 receives a 1.60× performance speed-up from the original algorithm.</p>
<p>We first evaluated LearnedSort 2.0 against the original algorithm on ten different datasets. The figure below shows the sorting rate of LearnedSort and LearnedSort 2.0 on a mix of real and synthetic datasets whose majority of keys is comprised of duplicates. In all of the cases, LearnedSort 2.0 gets a major performance boost: It is <em>at least</em> 60% faster than the original LearnedSort (in the case of TwoDups), and <em>on average</em> 378% faster for all these datasets.</p>
<p style="text-align: center;"><img src="/assets/learnedsort/hi_dups.png" alt="step4" style="max-width: 36em" /></p>
<p>In order to further analyze the improvements in LearnedSort 2.0, it is also important to see how the algorithm performs on datasets that LearnedSort was already very good at. We have demonstrated that the original LearnedSort algorithm was the best-performing one in datasets that contained a low degree of duplicates, and this re-design was not aimed at improving in those aspects. Nonetheless, LearnedSort 2.0 still outperforms the original algorithm by an average of 60%. The figure below shows this benchmark’s results.</p>
<p style="text-align: center;"><img src="/assets/learnedsort/lo_dup.png" alt="step4" style="max-width: 36em" /></p>
<p>Finally, we show the performance of LearnedSort 2.0 in comparison with other state-of-the-art sorting algorithms on Zipfian datasets with progressively increasing skew – which corresponds to a higher proportion of duplicates. As shown in the figure below, the sorting rate of LearnedSort 2.0 remains fairly unchanged by the increasing ratio of duplicates, dropping only 3% below the reference line in the case of the highest skew parameter. At the same time, LearnedSort 2.0 leads with the best performance overall, remaining highly competitive with respect to other algorithms even for datasets with a large degree of duplicates.</p>
<p style="text-align: center;"><img src="/assets/learnedsort/zipf.png" alt="step4" style="max-width: 36em" /></p>
<p>For more in-depth analysis of LearnedSort 2.0, please check out our <a href="https://github.com/learnedsystems/LearnedSort">Github repo</a> and our <a href="https://arxiv.org/abs/2107.03290">new paper</a>, where we give a detailed description of the benchmark setup and show additional experiments on datasets with diverse distributions. We also showed micro-benchmarks to explain its cache efficiency and the improvements from the new design.</p>
<h2 id="conclusion">Conclusion</h2>
<p>LearnedSort 2.0 is a major revision of the original LearnedSort algorithm that builds on its strongest points, while improving on the side-effects of large spill buckets on datasets containing a large portion of duplicates. Here we described the new algorithmic changes and showed benchmark highlights to evaluate its performance. There still remains future work for the extension of LearnedSort on parallel execution models and the handling of disk-scale data.</p>MIT DSGAuthor: Ani Kristo LearnedSort is a novel sorting algorithm that uses fast ML models to boost the sorting speed. We introduced the algorithm in SIGMOD 2020 together with a large set of benchmarks that showed outstanding performance as compared to state-of-the-art sorting algorithms. However, given the nature of the underlying model, its performance was affected on high-duplicate inputs. In this post we introduce LearnedSort 2.0: a re-design of the algorithm that maintains the leading edge even for high-duplicate inputs. Extensive benchmarks demonstrate that it is on average 4.78× faster than the original LearnedSort for high-duplicate datasets, and 1.60× for low-duplicate datasets.LEA: A Learned Encoding Advisor for Column Stores2021-06-21T00:00:00+00:002021-06-21T00:00:00+00:00http://learnedsystems.mit.edu/lea-a-learned-encoding-advisor-for-column-stores<p><em>Authors: Lujing Cen, <a href="https://people.csail.mit.edu/kipf/">Andreas Kipf</a></em></p>
<p>We are presenting LEA, our new learned encoding advisor, at <a href="http://www.aidm-conf.org/">aiDM @ SIGMOD 2021</a>. Check out our <a href="https://youtu.be/9jaJLrAdiPQ">presentation</a> and <a href="https://arxiv.org/pdf/2105.08830.pdf">paper</a>.</p>
<p>In this blog post, we will be going over a high level overview of LEA. LEA helps the database choose the best encoding for each column. At the moment, it can optimize for compressed size or query speed. On TPC-H, LEA achieves 19% lower query latency while using 26% less space compared to the encoding advisor of a commercial column store.</p>
<h2 id="motivation">Motivation</h2>
<p style="text-align: center;"><img src="/assets/lea/background.png" alt="Background" style="max-height: 12em" /></p>
<p>Modern databases support different encodings (lossless compression schemes) for storing columnar data. These encodings can reduce the overall storage footprint and improve query performance. A few attributes to consider when selecting a good encoding scheme are compressed size, random access capabilities, and decompression speed.</p>
<p>As an example, we compressed a 1 GiB CSV file using <a href="https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)">LZ4</a> and <a href="https://en.wikipedia.org/wiki/Gzip">Gzip</a> on a <strong>5d.2xlarge</strong> EC2 instance with a network-attached general purpose SSD. The decompressed speed is measured in the cold-cache scenario (i.e., the file system cache is cleared).</p>
<div align="center">
<table>
<thead>
<tr>
<th></th>
<th>LZ4 (level 1)</th>
<th>Gzip</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Compression Speed</b></td>
<td>703 MiB</td>
<td>428 MiB</td>
</tr>
<tr>
<td><b>Decompression Speed</b></td>
<td>3.85 s</td>
<td>9.45 s</td>
</tr>
</tbody>
</table>
</div>
<p>We see that Gzip achieves a much better compression ratio while LZ4 has a much faster decompression speed. This leads us to the observation that compression schemes inherently represent a tradeoff between I/O operations and CPU operations. The speeds of the underlying storage device and CPU are what determine the overall decompression speed. An encoding advisor should incorporate the tradeoffs of different encodings when selecting the best one for each column.</p>
<p style="text-align: center;"><img src="/assets/lea/related.png" alt="Related" style="max-height: 12em" /></p>
<p>There is some work in the area of encoding selection. Existing column advisors usually try to minimize size because they assume that the storage device is much slower than the CPU, which is no longer always true. They either use heuristics based on column statistics or extract a sample from a column and try all encodings. A project known as <a href="https://vks.ai/2019-12-05-shrynk-using-machine-learning-to-learn-how-to-compress">Shrynk</a> uses statistics from a Pandas DataFrame to build a classification model which predicts the best compression scheme to use. Concurrent work on <a href="http://people.cs.uchicago.edu/~hajiang/paper/codecdb.pdf">CodecDB</a> presents a neural network for ranking different encodings based on compressed size. It also has specialized operators which can operate on encoded columns without fully decoding the data.</p>
<h2 id="goal">Goal</h2>
<p style="text-align: center;"><img src="/assets/lea/goal.png" alt="Goal" style="max-height: 12em" /></p>
<p>Our goal is to build an encoding advisor that takes into account the following when determining what encoding to apply for each column:</p>
<ul>
<li>Underlying hardware (e.g., the speed of the storage device and CPU characteristics).</li>
<li>User’s data (e.g., how data is distributed in each column).</li>
<li>User’s objective (e.g., minimizing compressed size or minimizing query latency).</li>
</ul>
<h2 id="overview">Overview</h2>
<p style="text-align: center;"><img src="/assets/lea/overview.png" alt="Overview" style="max-height: 12em" /></p>
<p>LEA operates on <em>slices</em>. We define a slice to be 1 million rows of a column. Each slice needs to be large enough that we can meaningfully perform experiments involving time, but small enough that we can efficiently produce many of them for training. Given a slice and an encoding, LEA predicts three properties – the encoded size of the slice, the in-memory scan speed of the slice, and the from-storage scan speed of the slice.</p>
<p>Here are the steps that LEA uses to obtain its predictions:</p>
<ol>
<li>Take a 1% contiguous sample from the slice and encode it. This will give us the encoded size of the sample.</li>
<li>Compute relevant statistics about the entire slice (discussed later).</li>
<li>Use the encoded size of the sample and the slice statistics to predict the encoded size of the entire slice.</li>
<li>Use the predicted encoded size and the slice statistics to predict the in-memory scan speed.</li>
<li>Use the predicted encoded size and the predicted in-memory scan speed to predict the from-storage scan speed.</li>
<li>Repeat steps 1-5 for each slice and encoding.</li>
<li>Apply an arbitrary objective to select the best encodings (e.g., compressed size, query speed, or a mix of the two).</li>
</ol>
<p style="text-align: center;"><img src="/assets/lea/statistics.png" alt="Statistics" style="max-height: 12em" /></p>
<p>The figure above shows the slice statistics that we collect for each data type. These statistics can be computed efficiently in one pass. In addition, we found that they work well for encoding selection.</p>
<div align="center">
<table>
<thead>
<tr>
<th></th>
<th>Integral & Short String</th>
<th>Long String</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Encoded Size</b></td>
<td><a href="https://en.wikipedia.org/wiki/Random_forest">Random Forest</a></td>
<td>Linear</td>
</tr>
<tr>
<td><b>Memory Speed</b></td>
<td>Random Forest</td>
<td>Linear</td>
</tr>
<tr>
<td><b>Storage Speed</b></td>
<td>Constrained Linear</td>
<td>Constrained Linear</td>
</tr>
</tbody>
</table>
</div>
<p>Unlike other techniques, LEA uses regression instead of classification, which allows for arbitrary objectives that do not have to be defined during training. For integral and short strings, which we define as having a mean length of at most 64 characters, we use random forest regression for predicting the encoded size and in-memory scan speed. For long strings, we just use linear regression because random forests cannot extrapolate. For predicting the from-storage scan speed, we use a constrained linear regression to model the latency and throughput of the underlying storage device.</p>
<p style="text-align: center;"><img src="/assets/lea/training.png" alt="Training" style="max-height: 14em" /></p>
<p>For training, we use synthetic data generated from distributions. Although it is possible to use real-world datasets, they would need to be transferred to the target system since LEA needs to measure the scan speed of different encodings on many slices. Once a slice is generated, we perform additional post-processing steps like inserting null values at random locations and possibly sorting the data to more effectively explore the input space.</p>
<h2 id="evaluation">Evaluation</h2>
<p>To evaluate our technique, we compare four different strategies on a commercial column store (System-C) using the same <strong>5d.2xlarge</strong> EC2 setup as before. C-Default is System C’s default encodings, which uses delta for integral types and LZ4 for strings. C-Heuristic is System C’s encoding advisor, which optimizes for size. LEA-S and LEA-Q are two versions of LEA which try to minimize size and query latency respectively.</p>
<p style="text-align: center;"><img src="/assets/lea/stackoverflow.png" alt="StackOverflow" style="max-height: 20em" /></p>
<p>The figure above shows results for the StackOverflow workload. It consists of a denormalized table with around 12 million rows and 4 queries. The y-axis represents the ratio over C-Default. Therefore, values less than 1 represent an improvement over C-Default. Note that query latency experiments are performed with cold cache.</p>
<p>In terms of encoded size, we see that LEA-S outperforms C-Heuristic on a few columns. LEA-Q, which does not optimize for size, still achieves a good overall compression ratio. In terms of query latency, we see that purely optimizing for size does not lead to better performance. In fact, it can significantly slow down queries compared to using the default encodings. However, LEA-Q successfully finds encodings that improve the performance of all four queries.</p>
<p style="text-align: center;"><img src="/assets/lea/tpch.png" alt="TPC-H" style="max-height: 20em" /></p>
<p>The figure above shows results for the TPC-H workload at scale factor 10. We run all 22 queries with default substitution parameters. For encoded size, C-Heuristic finds better encodings for around half of the columns, whereas LEA-S finds better encodings for around 80% of the columns. For query latency, C-Heuristic and LEA-S are both worse than the default for a majority of queries, whereas LEA-Q improves on all but one query. Overall, LEA-Q achieves 19% lower query latency while using 26% less space compared to the encoding advisor of System C.</p>
<p style="text-align: center;"><img src="/assets/lea/ablation.png" alt="Ablation" style="max-height: 12em" /></p>
<p>We also compared LEA against ablated versions to demonstrate the importance of using both the sample encoded size as well as slice statistics. For this experiment, we measured the symmetric mean absolute percentage error (SMAPE) of predicted values for different encodings. We see that LEA outperforms its ablated versions for all encodings. This is reasonable because dictionary encoding and frame-of-reference encoding depend on the cardinality and range, respectively, both of which are hard to estimate from a sample. However, slice statistics are less useful for more complex compression schemes like Zstandard.</p>
<h2 id="future-work">Future Work</h2>
<p>In future work, we hope to make LEA query-aware. So far, we’ve treated all columns equally. However, in an actual workload, some columns will be queried more frequently than others. In addition, the type of access will vary (e.g., sequential scan versus random access). These statistics can be extracted from the query logs.</p>MIT DSGAuthors: Lujing Cen, Andreas Kipf We are presenting LEA, our new learned encoding advisor, at aiDM @ SIGMOD 2021. Check out our presentation and paper. In this blog post, we will be going over a high level overview of LEA. LEA helps the database choose the best encoding for each column. At the moment, it can optimize for compressed size or query speed. On TPC-H, LEA achieves 19% lower query latency while using 26% less space compared to the encoding advisor of a commercial column store.More Bao Results: Learned Distributed Query Optimization on Vertica, Redshift, and Azure Synapse2021-06-17T00:00:00+00:002021-06-17T00:00:00+00:00http://learnedsystems.mit.edu/bao-distributed<p><em>Author: <a href="https://rmarcus.info">Ryan Marcus</a></em></p>
<p>Next week, we’ll present our new system for learned query optimization, <a href="https://rm.cab/bao">Bao</a>, <a href="https://2021.sigmod.org/">SIGMOD21</a>, where we are thrilled to receive a <a href="https://2021.sigmod.org/sigmod_best_papers.shtml">best paper award</a>.</p>
<p>In our paper, we show how Bao can be applied to the open-source <a href="https://www.postgresql.org/">PostgreSQL DBMS</a>, as well as an <a href="https://en.wikipedia.org/wiki/David_DeWitt">unnamed</a> commercial system. Both DBMSes ran in a traditional, single-node environment. Here, we’ll give a brief overview of the Bao system and then walk through our early attempts at applying Bao to commercial, cloud-based, distributed database management systems.</p>
<p>For more information on Bao in the traditional, single-node context, check out:</p>
<ul>
<li><a href="https://rm.cab/bao">Our research paper</a>, published in SIGMOD 2021.</li>
<li>For a video overview, the recording of Ryan’s SIGMOD talk (<a href="https://www.youtube.com/watch?v=nEy90-WNkjo">20 minute version</a> or <a href="https://www.youtube.com/watch?v=-tvt8QzZcXM">3 minute version</a>).</li>
<li>For an overview of tree convolution as applied to query plans, Ryan’s <a href="https://www.youtube.com/watch?v=g4iiDVWtQZo">AIDB 2019 talk</a>.</li>
</ul>
<p>However, since we wrote the paper almost a year ago, much has happened. First, We worked together with Microsoft to explore how Bao can help with Big Data workloads.
This work will also be presented at SIGMOD as part of the industry session, and received a honorable mention for the industry best paper award.</p>
<p>Second, after the discussion we had with Mike Stonebraker around the potential impact Bao could have on distributed database warehouse systems (obviously, he was very skeptical), we ran a whole range of additional experiments on Vertica, Redshift, and Azure Synapse using a real-world dataset we received from an anonymous corporation.</p>
<h1 id="how-bao-works">How Bao works</h1>
<p>Previous approaches<sup id="fnref:prev" role="doc-noteref"><a href="#fn:prev" class="footnote" rel="footnote">1</a></sup> to learned query optimization attempted to replace large parts of traditional query optimizers. In contrast, Bao sits on top of a traditional query optimizer (called the <em>underlying optimizer</em>) and learns to steer the underlying optimizer in the right direction.</p>
<p><img src="/assets/bao/bao_blog_diag.svg" alt="Overview of the Bao process. Text in image repeated below." /></p>
<p>The image above illustrates the process Bao uses to steer the underlying query optimizer.</p>
<ol>
<li>When a query arrives, the underlying optimizer is used to generate a number of plan variants for the query. For example, we might have the optimizer generate a plan using index nested-loop joins, another plan using sort-merge joins, and a third plan using hash joins.<sup id="fnref:arms" role="doc-noteref"><a href="#fn:arms" class="footnote" rel="footnote">2</a></sup></li>
<li>Once these plan variants are constructed, a predictive model (a deep neural network) predicts the latency of each plan.</li>
<li>The plan with the best predicted latency is selected and executed. The result is sent to the user, and the actual latency is recorded.</li>
<li>The actual latency of the recorded query is added to Bao’s experience set, which is used to further refine the predictive model using <a href="https://en.wikipedia.org/wiki/Thompson_sampling">Thompson sampling</a>.<sup id="fnref:nosup" role="doc-noteref"><a href="#fn:nosup" class="footnote" rel="footnote">3</a></sup></li>
</ol>
<p>Over time, Bao’s predictive model learns from its mistakes, hopefully making increasingly accurate predictions. Experimentally, we’ve <a href="https://rm.cab/bao">shown</a> that Bao can learn in the presence of dynamic workloads, shifting data distributions, and schema modifications.</p>
<h1 id="from-single-node-to-distributed">From single-node to distributed</h1>
<p>In the original Bao paper, we evaluated Bao on a single-node database system (e.g., Oracle and PostgreSQL). However, many data warehouse databases are, in fact, distributed. Luckily, Bao is largely agnostic to the underlying execution engine or storage layout: as long as you have a set of hints, Bao can pick and choose from them and learn from its mistakes. In the paper, we discuss what good hints for a single-node DBMS look like: for example, forcing particular join types, or forcing particular access paths (i.e., index vs. table scan). These choices can impact the performance of the plan on their own, but can also impact the join order selected by the underlying query optimizer, <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">potentially resulting in drastically different run times</a>.</p>
<p><strong>Adapting Bao to a distributed DBMS is simply a matter of finding the right hints.</strong> While many aspects of the single-node case apply to the distributed case as well (operator choice matters, join order matters), distributed DBMSes bring about other important performance considerations:</p>
<p>Suppose a fact table is distributed across multiple nodes based on a key column. Should a join of that fact table with a dimension table be done via sending the dimension table to all nodes and using a hash algorithm? By partitioning the dimension table on the foreign key and using a merge join and union? By collecting the matching rows of the fact table on a single node, then performing a non-distributed merge?</p>
<p>Depending on network costs, query selectivity, materialization strategy, what data is already present at each node, and a wide range of other factors, any of these strategies might be applicable. Different DBMSes will lean towards different options depending on which code paths have been optimized. Most distributed DBMSes choose between these different strategies using the same tools that non-distributed DBMSes use: heuristics, cost models, and cardinality estimation. These tools are already highly error-prone (and require significant tuning) in non-distributed settings, so you can imagine how tricky things get when entire clusters are involved!</p>
<p>Next, we’ll walk through how Bao can be applied to three different state-of-the-art commercial cloud distributed DBMSes: Vertica, Amazon Redshift, and Azure Synapse (an analytics-focused offering of SQL Server). After discussing each system, we’ll show a small experiment highlighting potential gains (or lack thereof) from applying Bao. <em>We’ll be running each DBMS on different hardware, so please do not attempt to draw comparisons between these DBMSes</em>.</p>
<p>In each test, we’ll be executing an analytic dashboarding workload called <code class="language-plaintext highlighter-rouge">Corp</code>. The workload was donated to our research team by a large corporation under the condition of anonymity. The workload contains about 1 TB of data and 2000 unique queries which change over time. A large schema change (normalizing a fact table) happens in the middle of the workload – we do not count the time required to perform this modification. The workload makes use of analytic functions, (materialized) views, and other advanced SQL features.</p>
<h2 id="vertica">Vertica</h2>
<p>Vertica is a distributed columnar DBMS, and is the commercial adaptation of the <a href="https://dl.acm.org/doi/pdf/10.1145/3226595.3226638">C-Store paper</a>. Vertica’s optimizer is reasonably transparent, and the <a href="https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Hints/Hints.htm">documented set of hints</a> gives us a lot of control over what the optimizer considers.</p>
<p>For Vertica, we select query hints to:</p>
<ul>
<li>Force a particular join operator (<code class="language-plaintext highlighter-rouge">JTYPE</code> hint),</li>
<li>force a particular group by operator (<code class="language-plaintext highlighter-rouge">GBYTYPE</code> hint),</li>
<li>force a particular distributed join algorithm (<code class="language-plaintext highlighter-rouge">DISTRIB</code> hint, with options <code class="language-plaintext highlighter-rouge">L</code>, <code class="language-plaintext highlighter-rouge">R</code>, <code class="language-plaintext highlighter-rouge">B</code>, or <code class="language-plaintext highlighter-rouge">A</code>, representing a local join, a resegment join, a broadcast join, or letting the optimizer pick, respectively).</li>
</ul>
<p>We started up two identical 3-node clusters on AWS using the official Vertica image (which, at time of writing, uses <code class="language-plaintext highlighter-rouge">r4.4xlarge</code> nodes by default). We loaded the data into both clusters, and used the <a href="https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/AdministratorsGuide/ConfiguringTheDB/PhysicalSchema/DBD/AboutDatabaseDesigner.htm">Vertica DBD tool</a> to create a good layout, which we manually verified through testing. One cluster ran Bao on top of Vertica (<code class="language-plaintext highlighter-rouge">Bao</code>), while the other cluster did not run Bao (<code class="language-plaintext highlighter-rouge">Vertica</code>). The resulting time and costs are plotted below:</p>
<p><img src="/assets/bao/bao_vert.svg" alt="Cost and latency data for Vertica and Vertica with Bao, described below" /></p>
<p>Bao was able to reduce the end-to-end processing time of this workload by over three hours, while reducing cost by over 25% (about $25, in our case). The time savings with Bao are more significant than the cost savings because Bao must periodically retrain its predictive model, which (temporarily) requires a GPU.</p>
<p>Around 45% of Bao’s gains (the plurality) come from forcing the Vertica optimizer to use a broadcast join instead of a resegment join when the Vertica optimizer wrongly over-estimates the cardinality of a particular subplan. The small subplan result can be easily materialized and sent to all nodes, allowing for a faster computation than shuffling around parts of the subplan evenly to each node in the cluster (a resegment).</p>
<p>In the future, we intend to experiment with specific hints to include or exclude projections (<code class="language-plaintext highlighter-rouge">PROJS</code> and <code class="language-plaintext highlighter-rouge">SKIP_PROJS</code>) to allow Bao to further custom-tailor its strategy to the user’s data.</p>
<h2 id="azure-synapse-sql-server">Azure Synapse (SQL Server)</h2>
<p>Azure Synapse is a analytics offering from Microsoft, based on SQL Server. SQL Server is one of the most widespread commercial database systems. Unlike Vertica, SQL Server can store data in either a row-based or a column-based format. Here, we’ll focus on the column store format provided by the Azure Synapse Analytics cloud database. Unfortunately, the types of hints available for Synapse Analytics <a href="https://docs.microsoft.com/en-us/sql/t-sql/queries/option-clause-transact-sql">are quite limited</a> – so we had to get a bit creative.</p>
<p>For Synapse, we select query hints to:</p>
<ul>
<li>Disable or enable a particular type of join operator (hash, merge, and/or loop),</li>
<li>use a replicated or partitioned versions of all dimension tables,</li>
<li>use an indexed or non-indexed version of the fact table</li>
</ul>
<p>In order to implement the last two hints, we cheat a little bit: we created two versions of each dimension table, one replicated and one partitioned, along with two versions of the fact table (one indexed and one non-indexed). Based on the hint Bao selects, we rewrite the query to use the correct versions of each table.</p>
<p>We spun up two Azure Synapse Analytic instances, each with a dedicated 1000 DWUs (“data warehouse units,” a measure of query processing resources available to queries). We ran one instance with Bao and the other instance without Bao. After loading the data, we added the specialized tables described above to the instance using Bao. The other instance, which would use the stock optimizer, had all dimension tables replicated and an indexed version of the fact table. The resulting time and costs are plotted below:</p>
<p><img src="/assets/bao/bao_ss.svg" alt="Cost and latency data for Azure and Azure with Bao, described below" /></p>
<p>Bao was able to reduce the end-to-end runtime of this workload by a little over two hours, while reducing the cost by around 10% (a little under $35). Again, the cost reduction is smaller than the latency reduction because of the cost of training the Bao model.</p>
<p>A plurality of Bao’s gains (40%) came from avoiding the index on the fact table (sometimes by removing loop join as an option, sometimes by rewriting the query to use the non-indexed table). Removing the index from the instance running with the stock optimizer led to a decrease in performance, meaning that while the index normally helped, it hurt some queries, and Bao was able to identify those queries automatically.</p>
<p>In the future, Bao could be applied to Azure Synapse in serverless mode, but this mode currently does not support query hints.</p>
<h2 id="redshift">Redshift</h2>
<p>Redshift is Amazon’s data analytics offering, and is often cited as one of the first “cloud native” analytics databases. As far as we can tell, Redshift only offers <em>two</em> query optimization hints, which we’ll use alongside the same trick we used for Synapse.<sup id="fnref:noindex" role="doc-noteref"><a href="#fn:noindex" class="footnote" rel="footnote">4</a></sup></p>
<p>For Redshift, we select query hints to:</p>
<ul>
<li>Disable or enable AQUA, the Redshift query accelerator (<code class="language-plaintext highlighter-rouge">activate_aqua</code>),</li>
<li>disable or enable using (up-to-date) materialized views in query processing (<code class="language-plaintext highlighter-rouge">mv_enable_aqmv_for_session</code>),</li>
<li>use a replicated or partitioned version of all dimension tables (as with Synapse).</li>
</ul>
<p>Unfortunately, Redshift offers by-far the least visibility and control into its query optimizer, leaving our Bao implementation (seemingly) limited. We started two 3-node clusters (<code class="language-plaintext highlighter-rouge">ra3.4xlarge</code> nodes), one running Bao and one running the stock optimizer. The results are plotted below:</p>
<p><img src="/assets/bao/bao_rs.svg" alt="Cost and latency data for Redshift and Redshift with Bao, described below" /></p>
<p>Bao speeds up total workload latency by over an hour, and reduces costs by around 10% (in this case, about $10). Most of the gains (65%) came from disabling the usage of materialized views within subtree expressions, which the Redshift optimizer seems to do a little too aggressively (e.g., scanning the entire materialized view is slower than applying the predicates to the relations involved and performing the join). Disabling the use of materialized views in query processing globally hurts query performance overall, again showing that Bao is able to learn when the feature helps and when it hurts.</p>
<p>Redshift seems like a really cool system, but unfortunately it does not provide the visibility or control required by researchers for deep study in query optimization. In the future, it would be awesome to see Amazon build a few more windows and knobs (with sane defaults!) into the optimizer for scientists to use.</p>
<h1 id="discussion">Discussion</h1>
<p>Looking over the results, we find ourself returning to the same set of takeaways:</p>
<ul>
<li>Don’t spend too much time comparing results between systems. The experiments are ran on different clouds, using different hardware, at different times of day. While we followed best practice guidelines to tune the systems, an expert might still do better, in particular for Redshift and Azure Synapse, as we have less experience with them.</li>
<li>Just because Bao produces large gains on system X but smaller gains on system Y doesn’t mean the default optimizer of system X is better than the default optimizer of system Y. In our opinion, it is more likely that system X provides more visibility into the optimizer, whereas system Y keeps things pretty opaque, limiting Bao’s possible gains.</li>
<li>Everyone wants a “knob-free” query optimizer, but nobody tells you that the cost of such is decreased query performance. Without incorporating feedback from query execution in some way, heuristic optimizers are doomed to have edge cases that make them fall flat. “Knob-free” sometimes just means “not sophisticated enough to give you the tools you need to fix the mistakes.” A real “knob-free” query optimizer should take query latency feedback into account.</li>
<li>The idea behind Bao – use a deep neural network as a predictive model to choose between a number of different variants, then retrain that model progressively using Thompson sampling – seems widely applicable. We focused on database systems, but maybe there are other applications as well?</li>
</ul>
<p>So is there a downside of Bao? Certainly! Bao causes query optimization to take a little bit more time (~300ms), requiring quite a bit more computation. We studied this overhead in our SIGMOD paper. For data warehouse workloads, which largely consists of long-running, resource intensive queries, Bao’s increased overhead is hardly noticeable. However, for workloads with a lot of short running queries, like OLTP workloads, this might not be the case. We are currently working on new approaches to mitigate that problem – so stay tuned!</p>
<p>If you feel so inclined, you can <a href="https://rm.cab/bao">read the Bao paper</a>, check out our <a href="https://learned.systems/bao">open source prototype</a> for PostgreSQL, or take a look at the <a href="http://dsail.csail.mit.edu/index.php/publications-new/">other publications</a> from our group.</p>
<h1 id="notes">Notes</h1>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:prev" role="doc-endnote">
<p>Approaches such as <a href="https://rm.cab/rejoin">ReJOIN</a>, <a href="https://arxiv.org/abs/1808.03196">DQ</a>, and <a href="https://rm.cab/neo">Neo</a> (in chronological order) fall into this category. <a href="#fnref:prev" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:arms" role="doc-endnote">
<p>The choice of which plan variants to generate is important. Generally, one variant is always just what the underlying query optimizer would have chosen with no intervention, and the other variants are generated by forcing the optimizer to ignore particular optimization paths. See the paper for more details. <a href="#fnref:arms" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:nosup" role="doc-endnote">
<p>Note that training the predictive model using standard supervised learning techniques will not balance the exploration of new policies (i.e., trying something new to see how well it works) against the exploitation of existing knowledge (i.e., doing what we know works well). Thus, <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learning</a> techniques are preferred. It is easy to think that, to avoid regression, you would want to maximize exploitation. But this can get you trapped in a local minima: if you mis-predict the value of the optimal plan, you will never select it. In other words, without exploration, you might initially have fewer regressions, but you’ll never recover from them! <a href="#fnref:nosup" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:noindex" role="doc-endnote">
<p>Redshift, like Vertica, does not support traditional indexes on tables like Synapse does, so we do not add those indexed variants here. <a href="#fnref:noindex" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>MIT DSGAuthor: Ryan Marcus Next week, we’ll present our new system for learned query optimization, Bao, SIGMOD21, where we are thrilled to receive a best paper award. In our paper, we show how Bao can be applied to the open-source PostgreSQL DBMS, as well as an unnamed commercial system. Both DBMSes ran in a traditional, single-node environment. Here, we’ll give a brief overview of the Bao system and then walk through our early attempts at applying Bao to commercial, cloud-based, distributed database management systems.Announcing the Learned Indexing Leaderboard2021-06-14T00:00:00+00:002021-06-14T00:00:00+00:00http://learnedsystems.mit.edu/announcing-the-learned-indexing-leaderboard<p><em>Authors: Allen Huang, <a href="https://people.csail.mit.edu/kipf/">Andreas Kipf</a>, <a href="https://rmarcus.info/blog/">Ryan Marcus</a>, and <a href="https://people.csail.mit.edu/kraska/">Tim Kraska</a></em></p>
<p><a href="https://dl.acm.org/doi/pdf/10.1145/3183713.3196909">Learned indexes</a> have received a lot of attention over the past few years. The idea is to replace existing index structures, like B-trees, with learned models. In recent a <a href="https://vldb.org/pvldb/vol14/p1-marcus.pdf">paper</a>, which we did in collaboration with TU Munich and are going to present at <a href="https://vldb.org/2021/">VLDB 2021</a>, we compared learned index structures against various highly tuned traditional index structures for in-memory read-only workloads. The benchmark, which we published as <a href="https://github.com/learnedsystems/SOSD">open source</a> including all datasets and implementations, confirmed that learned indexes are indeed significantly smaller while providing similar or better performance than their traditional counterparts on real-world datasets.</p>
<p>Since the initial release of SOSD, we’ve made a few additions to the framework:</p>
<ul>
<li>Added new competitors (<a href="https://github.com/microsoft/ALEX">ALEX</a> and a <a href="https://github.com/stoianmihail/CHT">C++ implementation</a> of <a href="http://cidrdb.org/cidr2021/papers/cidr2021_paper20.pdf">HistTree</a> contributed by Mihail Stoian).</li>
<li>Added synthetic datasets as well as smaller (50M rows) datasets.</li>
<li>Reduced overhead of benchmarking framework further.</li>
</ul>
<p>While SOSD certainly filled the gap of a standardized benchmark, we feel that due to the sheer number of papers in this area, there’s still a lot of discrepancies among experimental evaluations. This is mainly due to the fact that many implementations need to be tuned for the datasets and hardware at hand.</p>
<p>Today, we’re happy to announce the <a href="https://learnedsystems.github.io/SOSDLeaderboard/leaderboard/">Learned Indexing Leaderboard</a>, an ongoing indexing benchmark on various synthetic and real-world datasets based on <a href="https://github.com/learnedsystems/SOSD">SOSD</a>. For each dataset, there are different size categories (e.g., M stands for an index size of up to 1% of the dataset size). We’ll be using the <strong>m5zn.metal</strong> AWS instance type for the leaderboard to ensure a common playing field.</p>
<p>We hope that our benchmark will receive contributions from the community (index implementations, datasets, and workloads) and can serve as a common benchmarking testbed.</p>
<p><img src="/assets/sosd/screenshot.png" alt="SOSD Leaderboard" /></p>MIT DSGAuthors: Allen Huang, Andreas Kipf, Ryan Marcus, and Tim Kraska Learned indexes have received a lot of attention over the past few years. The idea is to replace existing index structures, like B-trees, with learned models. In recent a paper, which we did in collaboration with TU Munich and are going to present at VLDB 2021, we compared learned index structures against various highly tuned traditional index structures for in-memory read-only workloads. The benchmark, which we published as open source including all datasets and implementations, confirmed that learned indexes are indeed significantly smaller while providing similar or better performance than their traditional counterparts on real-world datasets.Welcome2021-06-07T00:00:00+00:002021-06-07T00:00:00+00:00http://learnedsystems.mit.edu/welcome<p>In the coming weeks, we’ll start sharing regular research updates on learned systems by <a href="http://dsg.csail.mit.edu/mlforsystems/">MIT DSG</a>. Stay tuned!</p>
<p>To receive email notifications about new posts, you can <a href="https://forms.gle/YVYXRi729aJPA3tg8">subscribe here</a>.</p>MIT DSGIn the coming weeks, we’ll start sharing regular research updates on learned systems by MIT DSG. Stay tuned! To receive email notifications about new posts, you can subscribe here.