The Data Quarry

Back

An evolving hardware landscape#

In September 2023, Wes McKinney, creator of the Pandas library, published an illuminating blog post describing his thoughts on 15 years of evolution in data management systems. In particular, he highlights the impact of modern hardware in pushing the industry towards a more modular, composable data stack. It’s no wonder, then, that his friend, colleague and co-author of Pandas, Chang She happens to be the founder of LanceDB, a developer-friendly embedded vector database that aligns very well with this vision, and is the focus of the third post in this series.

LanceDB is an open-source, embedded, and developer-friendly vector database. Some of its key features (among many others) that make it different from a number of existing solutions are listed below:

  • Incredibly lightweight (no DB servers to manage), because it is entirely in-process
  • Extremely scalable from development to production
  • Ability to perform full-text search (FTS), SQL search and vector search
  • Multi-modal data support (images, text, video, audio, point-clouds, etc.)
  • Zero-copy (via Arrow) with automatic versioning of data

The aim of this post is to understand the internals of LanceDB, and to showcase its performance on full-text and vector search via two realistic integration modes: direct async clients and end-to-end REST API calls.

Towards “deconstructed databases”#

The term deconstructed database1 was coined in 2019 by Julien Le Dem, one of the original designers of the Parquet file format and Arrow specification. These database systems deviate from the well-known vertically-integrated systems that have dominated the DB landscape for decades. Instead, they are built from a collection of modular, reusable components, each of which can be developed and optimized by entirely separate groups of people, typically as open-source projects.

In his blog2, Wes points out that today’s compute hardware is radically different from what it was in 2013, when Parquet was conceived. In particular, the availability of blazing-fast SSDs and NVMe drives has led to a shift in thinking toward designing systems that can store and query specialized data types like vectors and time series at scale. This explains the development of Lance, a new data format for the age of AI.

The composition of LanceDB#

LanceDB implements its own vector and full-text search index on top of the underlying Lance data format. DataFusion, an embeddable SQL query engine, is used to power the full-text/vector search queries via a SQL interface. The Apache Arrow format is used to allow a smooth transition between in-memory and on-disk data storage, as well as for seamless interoperability with other data formats from systems like DuckDB, Pandas, or Polars.

Lance#

Lance is a modern, columnar, multimodal lakehouse format that’s the foundation of LanceDB’s capabilities. It’s optimized for fast random access and scans on both traditional data and binary data (blobs). Lance is both a file and table format that stores traditional tabular data, blobs, nested data, and embeddings in a single file (with a table layer on top). Adding new columns and backfilling them is inexpensive, and the format supports versioning and time travel.

Arrow#

It’s impossible to overstate the impact that Apache Arrow2 has had on the consolidation of analytical tooling in the industry, due to the way it connects disparate portions of the data stack via an in-memory, language-agnostic specification. The following key features of Arrow2 are relevant to Lance and LanceDB:

  • Column-oriented in-memory operation optimized for fast analytical processing
  • Zero-copy, chunk-oriented data layer designed for moving and accessing large amounts of data from disparate storage layers
  • Extensible type metadata for describing a wide variety of flat and nested data types occurring in real-world systems, with support for user-defined types

Lance’s type system is based on the Rust implementation of Arrow’s type system, arrow-rs, and Lance was itself rewritten from the ground up in Rust in 20233 (having originally been written in C++).

DataFusion#

DataFusion is a fast, extensible, embeddable query engine that supports a SQL API, and can be used to query data stored in Arrow and Lance. In recent times, it has become the standard for building domain-specific query engines that are decoupled from the storage layer, and because it’s written in Rust, performance is excellent despite not being natively integrated with the storage layer. Due to its extensibility, DataFusion has been repurposed to generate optimized query plans for LanceDB for full-text/vector search via a unified SQL interface.

Benchmarks are typically useful when they answer a comparative question. Here, the question is: what kind of performance, latency profile, and operational shape can we expect to see in practice with an embedded engine like LanceDB? Elasticsearch is a useful reference point here, since it’s widely used for these exact use cases. To answer this question, two benchmark modes can be run:

  1. Direct async client calls (no HTTP/server overhead).
  2. FastAPI endpoints (end-to-end service behavior over HTTP).

In both modes, the benchmark driver uses the async Python clients for LanceDB and Elasticsearch with a fixed maximum concurrency.

Dataset#

The dataset is the Wine Reviews dataset from Kaggle. It consists of 129,971 wine reviews from the Wine Enthusiast magazine, made available in newline-delimited JSON as shown below. Refer to the Kaggle source for more information on the dataset and how it was scraped.

An example JSON line containing a wine review and its metadata is shown below.

{
    "id": 40825,
    "points": "90",
    "title": "Castello San Donato in Perano 2009 Riserva  (Chianti Classico)",
    "description": "Made from a blend of 85% Sangiovese and 15% Merlot, this ripe wine delivers soft plum, black currants, clove and cracked pepper sensations accented with coffee and espresso notes. A backbone of firm tannins give structure. Drink now through 2019.",
    "taster_name": "Kerin O'Keefe",
    "taster_twitter_handle": "@kerinokeefe",
    "price": "30.0",
    "designation": "Riserva",
    "variety": "Red Blend",
    "region_1": "null",
    "region_2": null,
    "province": "Tuscany",
    "country": "Italy",
    "winery": "Castello San Donato in Perano"
}

Embedding model#

For vector search, queries are embedded using the nomic-ai/modernbert-embed-base model (256 dimensions).

Shared benchmark queries#

Both systems use the same FTS and vector query lists, shown below. Each benchmark run samples 1000 queries at random (with repetition) from the corresponding file to form the query suite for that run.

Full-text search (FTS) queries#

The current benchmark uses simple term/phrase queries (one per line), rather than Lucene/Tantivy-style boolean symbols. These are used to search the description field of the dataset.

apple pear
tropical fruit
citrus almond
orange grapefruit
full bodied
citrus acidity
blueberry mint
beef lamb
shellfish seafood
vegetable fish

Vector search queries#

Ten vector search queries are defined as shown below. The first query below searches for documents whose description fields are similar to the vector representation of vanilla and a hint of smokiness. The second query searches for descriptions similar to dessert, sweetness and tartness in vector space. You get the idea!

vanilla and a hint of smokiness
rich and sweet dessert wine with balanced tartness
cherry and plum aromas
right balance of citrus acidity
grassy aroma with apple and tropical fruit
bitter with a dry aftertaste
sweet with a hint of chocolate and berry flavor
acidic on the palate with oak aromas
balanced tannins and dry and fruity composition
peppery undertones that pairs with steak or barbecued meat

Benchmark protocol#

Both benchmark modes use the same run protocol, averaged over 3 trials per search type:

  • Fixed query count: 1000 queries per search type (fts, vector) per trial
  • Fixed trial count: 3 trials per search type
  • Reported metrics: QPS, plus P50/P95/P99 latencies

There is intentionally no separate “serial vs concurrent” split here. Instead, we run each suite with a fixed maximum concurrency (shown in the result tables) and compare the two composition boundaries: direct client vs HTTP API.

At a high level, each system follows the same lifecycle: ingest the dataset, generate embeddings for the text fields, build both an FTS index and a vector index, and then execute the fixed-size query suites in the two modes above. The exact scripts and configuration live in the benchmark repo.

Results#

The results below are averaged over 3 runs for 1000 queries per search type (fts, vector).

Result 1: Direct async client#

This mode runs direct async client calls against LanceDB / Elasticsearch. It isolates search-engine + embedding/runtime behavior without FastAPI overhead.

LanceDB

searchqueriesrunssuccess_avgelapsed_s_avgqps_avgp50_ms_avgp95_ms_avgp99_ms_avgmax_concurrencyseedwarmup_queries
fts100031000.000.65221534.3410.1814.2815.92163710
vector100031000.0010.310696.99134.55170.47181.94163710

Elasticsearch

searchqueriesrunssuccess_avgelapsed_s_avgqps_avgp50_ms_avgp95_ms_avgp99_ms_avgmax_concurrencyseedwarmup_queries
fts100031000.000.16815948.882.594.075.25163710
vector100031000.0010.189398.14110.37212.83222.35163710

Result 2: Through FastAPI endpoints#

This mode runs requests through FastAPI endpoints over HTTP, mimicking a scenario in the real world where we would typically integrate the search engine as part of a larger stack. In such cases, it makes more sense to measure end-to-end service behavior, even if it introduces a small additional overhead due to the REST API.

LanceDB

searchqueriesrunssuccess_avgelapsed_s_avgqps_avgp50_ms_avgp95_ms_avgp99_ms_avgmax_concurrencyseedwarmup_queries
fts100031000.000.74701338.6911.4916.2918.40163710
vector100031000.0010.459395.61158.97219.66235.47163710

Elasticsearch

searchqueriesrunssuccess_avgelapsed_s_avgqps_avgp50_ms_avgp95_ms_avgp99_ms_avgmax_concurrencyseedwarmup_queries
fts100031000.000.28993452.314.336.057.70163710
vector100031000.0010.641893.97203.52224.66256.44163710

Takeaways#

The main takeaway is that these numbers are best treated as directional, not absolute. Both LanceDB and Elasticsearch have many knobs that change the performance envelope (index types and hyperparameters, refresh/merge behavior, batch sizes, caching, and the embedding/runtime cost for vector search). In a composable stack, those knobs often live at different layers, so it’s important to interpret results as properties of the full system you deploy—not just the search algorithm in isolation.

It’s also worth zooming out from raw latency. In many practical applications, the difference between a 10 ms response and a 20 ms response isn’t user-visible—especially in RAG, where LLM latency is orders of magnitude larger. Reliability, operability, cost, and overall system complexity are often more significant to the developer experience. This is one of the main reasons systems like LanceDB are compelling: as an embedded engine, they can reduce operational complexity and let improvements in the underlying storage and query execution layers propagate up to the database layer.

The “propagation” effect in LanceDB becomes apparent when you inspect its source code. Apache Arrow (via arrow-rs) is at its core, providing a fast, interoperable in-memory columnar foundation; the Lance format builds on Arrow to deliver a storage format optimized for scans and random access on AI data; and DataFusion continues to improve its SQL query planning and execution under the hood. LanceDB then composes these well-established building blocks into an embedded search system.

Future developments in the Lance ecosystem#

As described in this post, LanceDB is a powerful addition to the search and retrieval landscape because of its refreshingly different internals and its approach to composability. It’s built on strong open-source foundations and offers both open-source and commercial products for a range of use cases.

As of writing this post, there are already numerous developments ongoing in both the open-source Lance format and LanceDB itself, as well as LanceDB Enterprise. It’s worth reading some recent posts and case studies on the LanceDB blog to get a sense of the exciting developments in the ecosystem. There is also a lot of activity ongoing in the Lance format, which continues to integrate with ever more parts of the data stack, and is being adopted by a growing number of companies and projects.

Give LanceDB a try on your own data, and consider starring ⭐️ both the Lance format and LanceDB on GitHub.

And check out the two other posts in this series on embedded databases:

Code#

All the code required to reproduce the results from the benchmark is available here.


Footnotes#

  1. How Apache Arrow Is Changing the Big Data Ecosystem, thenewstack.io

  2. The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future, wesmckinney.com 2 3

  3. Please pardon our appearance during renovations, by Chang She, LanceDB blog

Embedded databases (3): LanceDB and the modular data stack
https://thedataquarry.com/blog/embedded-db-3
Author Prashanth Rao
Published at November 20, 2023