LSI: A Learned Secondary Index Structure

July 22, 2022

Authors: Dominik Horn, Andreas Kipf, Pascal Pfeil

We are happy to present LSI: A Learned Secondary Index Structure and our accompanying open-source C++ implementation that can easily be included in other projects using CMake FetchContent. LSI is a learned secondary index. It offers competitive lookup performance on real-world datasets while reducing space usage by up to 6x compared to state-of-the-art secondary index structures.

Cardinality Estimation Benchmark

September 22, 2021

Authors: Parimarjan Negi, Ryan Marcus, Andreas Kipf

In this blog post, we want to go over the motivations and applications of the Cardinality Estimation Benchmark (CEB), which was a part of the VLDB 2021 Flow-Loss paper.

There has been a lot of interest in using ML for cardinality estimation. The motivating application is often query optimization: when searching for the best execution plan, a query optimizer needs to estimate intermediate result sizes. In the most simplified setting, a better query plan may need to process smaller sized intermediate results, thereby utilizing fewer resources, and executing faster.

Defeating Duplicates: A Re-Design of the LearnedSort Algorithm

July 08, 2021

Author: Ani Kristo

LearnedSort is a novel sorting algorithm that uses fast ML models to boost the sorting speed. We introduced the algorithm in SIGMOD 2020 together with a large set of benchmarks that showed outstanding performance as compared to state-of-the-art sorting algorithms.

However, given the nature of the underlying model, its performance was affected on high-duplicate inputs. In this post we introduce LearnedSort 2.0: a re-design of the algorithm that maintains the leading edge even for high-duplicate inputs. Extensive benchmarks demonstrate that it is on average 4.78× faster than the original LearnedSort for high-duplicate datasets, and 1.60× for low-duplicate datasets.

LEA: A Learned Encoding Advisor for Column Stores

June 21, 2021

Authors: Lujing Cen, Andreas Kipf

We are presenting LEA, our new learned encoding advisor, at aiDM @ SIGMOD 2021. Check out our presentation and paper.

In this blog post, we will be going over a high level overview of LEA. LEA helps the database choose the best encoding for each column. At the moment, it can optimize for compressed size or query speed. On TPC-H, LEA achieves 19% lower query latency while using 26% less space compared to the encoding advisor of a commercial column store.

More Bao Results: Learned Distributed Query Optimization on Vertica, Redshift, and Azure Synapse

June 17, 2021

Author: Ryan Marcus

Next week, we’ll present our new system for learned query optimization, Bao, SIGMOD21, where we are thrilled to receive a best paper award.

In our paper, we show how Bao can be applied to the open-source PostgreSQL DBMS, as well as an unnamed commercial system. Both DBMSes ran in a traditional, single-node environment. Here, we’ll give a brief overview of the Bao system and then walk through our early attempts at applying Bao to commercial, cloud-based, distributed database management systems.

Announcing the Learned Indexing Leaderboard

June 14, 2021

Authors: Allen Huang, Andreas Kipf, Ryan Marcus, and Tim Kraska

Learned indexes have received a lot of attention over the past few years. The idea is to replace existing index structures, like B-trees, with learned models. In recent a paper, which we did in collaboration with TU Munich and are going to present at VLDB 2021, we compared learned index structures against various highly tuned traditional index structures for in-memory read-only workloads. The benchmark, which we published as open source including all datasets and implementations, confirmed that learned indexes are indeed significantly smaller while providing similar or better performance than their traditional counterparts on real-world datasets.

Welcome

June 07, 2021

In the coming weeks, we’ll start sharing regular research updates on learned systems by MIT DSG. Stay tuned!

To receive email notifications about new posts, you can subscribe here.