The Data Quarry

Back

Almost every data or AI application starts with a stretch of unglamorous work: turning raw data into something usable. That’s data processing, the computation and transformation that sits between a source and whatever consumes it downstream. Parsing a PDF, cleaning a messy table, chunking a document, embedding text, enriching records with a model call, reshaping JSON into rows: it’s the layer underneath almost everything, and it’s also where a surprising share of engineering time quietly disappears!

There’s no shortage of frameworks for this kind of work, but most of them share a familiar set of default assumptions about how it should be done. Run in batches. Flatten everything into DataFrames for faster compute. Wire up workflows at a regular cadence. While these choices aren’t exactly wrong, they quietly shape how an engineer thinks about building and maintaining reliable pipelines over months.

That’s why my eyes immediately lit up when I discovered CocoIndex, an open source, incremental data processing framework that makes a different set of choices. Like most tools I know and love, its core engine is written in Rust 🦀, and you can drive it from a Python SDK. It’s lightweight, local and runs in-process, with no separate server.

But what really won me over is CocoIndex’s incredible design philosophy and how it works under the hood. CocoIndex holds a handful of opinions about how data processing should work that, taken together, feel unusually well matched to where AI engineering is heading.

This post is a high-level opinion on what clicked for me, and what aspects I like the most. But a follow-up post will get into a concrete pipeline with real code, stay tuned!

1. Incremental-first#

The most obvious design choice that becomes apparent when you use CocoIndex is that it’s incremental by nature. Say you build a pipeline that loads documents, chunks them, embeds them, and pushes the results into a vector store. The first run is satisfying: data flows end to end, and searching the index returns relevant results. However, the data never sits still. New documents or records arrive, and existing ones get modified or disappear entirely, so the results get stale. Worse, data changes tend to occur sporadically, at no particular time of day.

The tried-and-tested way of the data engineer is to run a batch job on a fixed cadence: every night, every hour, every few minutes. A batch workload can usually tell which items changed, since it can identify them by their IDs and recompute just those. What it can’t escape is its own clock. Anything that arrives right after a batch job sits untouched until the next one runs. So, for data that trickles in unpredictably, your index spends long stretches quietly lagging reality. The gap stays invisible until someone queries and gets a stale result. Running the job more frequently to address this only means more compute and more time spent on bookkeeping to track what changed since the last run.

CocoIndex rethinks this, by processing data the moment it comes in. When a source changes, the affected part of the pipeline reruns right then, instead of waiting for the next scheduled job. Freshness becomes continuous rather than discrete.

Batch vs. incremental data freshness over time
Data arrives sporadically (gold lines). Batch jobs result in a stale-data window until the next run; incremental pipelines immediately run as the data arrives.

There’s a nice side effect to working this way. Because you’re handling small changes as they happen instead of bulk-reprocessing the data on a schedule, you also tend to require less compute at any given moment. The real win with incremental processing is continuous freshness, but the lighter compute footprint (because we process fewer records more frequently) is a bonus that falls out of doing the work incrementally.

This is a deliberate design choice in CocoIndex. Once incremental updates are the default instead of something you reach for with extra scheduling and bookkeeping, it shapes everything else about how the framework behaves. And it only pays off if the system can reliably tell exactly what changed in the first place - that’s the next principle.

2. Change detection for messy AI data#

So how does CocoIndex know what actually changed? The database world has had an answer for decades: change data capture (CDC). A database tails an internal transaction log and emits a tidy stream of row-level inserts, updates, and deletes. While CDC is elegant, it’s built for a very particular world: structured rows, all inside one system.

AI data rarely follows that pattern. When building a typical AI application, the data can be a mix of a folder of images and PDFs, a pile of markdown, and JSON blobs from a flaky API. Even more challenging: downstream of the source data, you often have a tangle of transformations and computations that derive new outputs from the inputs (e.g., embeddings of text or images). A change to one source object might affect only some outputs but not others, and it’s not always obvious which ones.

CocoIndex brings a “CDC mindset” to this messy world, built on one guiding idea: it identifies content by what it is, not where it is. Its default mechanism is content-based hashing. As it scans a source, it computes a compact fingerprint of each object (a 128-bit Blake2b hash of the serialized contents) and compares it against the fingerprint it stored on the previous run. If they match, the object is untouched and the cached result is reused; if they differ, only that object and whatever derives from it get recomputed. Only what changed is recomputed and written to the target.

How the content hash decides what reaches the target
CocoIndex compares each object's content hash to the previous run. doc-a is untouched and notes.md is doc-a renamed, so neither is written; only doc-b, whose content changed, is recomputed.

Content hashing only watches your data, not your logic. If you change how you chunk a document, or swap your embedding model for another one, every source object is byte-for-byte identical and content hashing doesn’t detect that the source data drifted, yet every derived output is now stale. CocoIndex handles this with a second fingerprint that tracks processing logic. It hashes each transformation from its parsed syntax tree and checks that hash alongside the content hash when deciding whether to reuse a cached result. Change the data and the affected outputs rebuild; change the logic in the code and CocoIndex recomputes derived entities even though nothing in the source moved.

And because CocoIndex tracks lineage (which outputs came from which source object), it knows exactly what to retire when something is deleted or replaced. Stale rows get cleaned out as part of the same update instead of quietly accumulating.

3. Declarative and reactive#

CocoIndex is a very clever blend of declarative and reactive design principles.

The declarative part is all about saving developer effort. You declare upfront how the target should look once it finishes, and write out the functions that do actual work. At any point, your focus is on how the data should look - the system handles everything to do with change capture, recomputation, and performance.

More concretely, what you actually declare in CocoIndex are processing components: the unit of incremental execution that syncs sources to the target as a whole. You decide how big each component is, and it’s a real design lever. For a folder of PDFs, the boundary can be coarse (one component for all the files), medium (one per file), or fine (one per chunk). Finer components push each piece to the target the moment it’s ready and isolate a failure to that piece; coarser components sync more state together as a single unit.

Processing component granularity: coarse, medium, and fine
A processing component is both the unit of execution and the sync boundary, and you choose its size: fine (per chunk) syncs each piece as soon as it's ready; coarse (all files) syncs everything together.

The reactive part is a pattern most developers already know from spreadsheets: you write a formula, and when an upstream cell changes, the dependent cells recompute on their own. CocoIndex implements that same idea in a stateful manner for arbitrary data transformations (akin to how Kubernetes manages infrastructure). It’s constantly comparing the desired state to the actual state and doing whatever work closes the gap. Your transformations form a dependency graph, so paired with change detection and lineage, a source change traces to exactly the outputs downstream of it, while everything unrelated stays put.

Expressing a pipeline as “what should be true?” rather than “how to make it true?” turns it into a statement of intent: easier for humans and coding agents to reason about.

4. Nested data is first-class#

Real-world data is rarely born as a clean table. A product has variants and a list of description sections; a document has nested headings; a conversation has turns; an image carries its own metadata. Yet a lot of tooling tends to flatten all of that into DataFrame-like layouts before you can do anything with it. Take a single product with two variants and two description sections:

{
  "product_id": "A123",
  "title": "Trail Runner",
  "image": { "url": "img/a123.jpg", "alt": "Trail Runner, side view" },
  "variants": [
    { "sku": "A123-S-RED", "size": "S", "color": "red" },
    { "sku": "A123-M-RED", "size": "M", "color": "red" }
  ],
  "sections": [
    { "heading": "Overview", "text": "A lightweight trail shoe..." },
    { "heading": "Specs",    "text": "Weight: 240g. Drop: 8mm..." }
  ]
}

To force that into one flat table, you take the cross product of variants and sections: the row count explodes and the top-level fields repeat on every line.

product_id | title        | sku        | size | color | heading  | text
A123       | Trail Runner | A123-S-RED | S    | red   | Overview | A lightweight...
A123       | Trail Runner | A123-S-RED | S    | red   | Specs    | Weight: 240g...
A123       | Trail Runner | A123-M-RED | M    | red   | Overview | A lightweight...
A123       | Trail Runner | A123-M-RED | M    | red   | Specs    | Weight: 240g...

Two variants and two sections already become four rows for one product, and every extra variant or section multiplies it further.

Flattening data like this has its place, especially for analytical workloads in columnar stores: aggregating or scanning a single column across millions of rows is cheap when the values sit contiguously and the work can be vectorized. But plenty of AI workloads aren’t analytical. Agentic search and retrieval are dominated by point-access queries: “fetch this product with its variants and sections,” or a semantic search that returns specific chunks of a document.

The shape mismatch between nested and tabular structures also shows up at the compute boundary, where your mental model is object-shaped: you embed each image, chunk each section, enrich each variant. CocoIndex lets you keep the nested structure all the way through your transforms, so you can reason about the data in its natural shape, without worrying about performance implications.

The snippet below shows how CocoIndex lets Python developers use familiar dataclasses and lists, which are just the declarative surface: CocoIndex’s Rust core maps them to an efficient internal representation and runs the actual work (while applying batching, concurrency, memoization, and incremental execution directly over the nested data). So the code and logic stays readable, and costs you nothing at runtime.

@dataclass
class Variant:
    sku: str
    size: str
    color: str

@dataclass
class Section:
    heading: str
    text: str

@dataclass
class Product:
    product_id: str
    title: str
    variants: list[Variant]                       # stays nested
    sections: list[Section]                        # stays nested
    image_embedding: Annotated[NDArray, EMBEDDER]  # vector, typed

As a bonus, CocoIndex quietly hands you a convenient Pythonic “map-reduce” abstraction. The function is the only primitive you need: coco.map fans a function out across the items of a nested field, running them concurrently and memoizing each one as required, then gives you back an ordinary list of results. The “reduce” is just plain Python that does the required task (e.g., join the text).

# map: format each section concurrently (CocoIndex parallelizes + caches each)
section_texts = await coco.map(format_section, product.sections)

# reduce: ordinary Python — join into one document
document = "\n\n".join(section_texts)

5. Extensible by design#

AI pipelines are almost never contained inside one neat system. The sources are a mixed bag: local files, an object store, a Postgres table, a SaaS API, a message queue. The targets are just as varied: a vector store, a shared drive, a relational database, a graph. CocoIndex ships with a healthy set of built-in connectors for these (local files, Postgres, LanceDB, Google Drive, object storage, and more), so most common setups work out of the box.

What happens when your setup isn’t common? At no point does CocoIndex restrict what a source or target can be. A source is just something that can enumerate its items; a target is something you teach to reconcile its desired rows against what’s already there, via a small TargetHandler you implement. If the connector you need doesn’t exist yet, you write it against the very same primitives the built-ins use.

The payoff is that a custom connector is never a second-class citizen. The moment you plug it in, it inherits everything the engine already does: content-based change detection, incremental updates, concurrency, memoization, and lineage tracking. You implement the thin slice that’s specific to your system (how to list items, how to write a row) and get the hard part for free.

Any source, any target: custom connectors inherit the CocoIndex engine
The boundaries are open: a custom source implements items(), a custom target implements reconcile(). Both (gold dashed) inherit the same engine as the built-ins.

This is what makes the “CDC for messy AI data” idea generalize. The change detection and compute logic were never tied to a particular database or file format: it works for anything you can enumerate and anything you can reconcile.

Bringing it all together#

Step back, and the five design principles stop looking like separate features. They complement one another really well:

  1. Being an incremental-first framework keeps data continuously fresh, instead of letting it drift between batch runs.

  2. Capturing change through content and logic fingerprinting tells the engine exactly what to recompute, and what to leave alone.

  3. Exposing a declarative, reactive programming model lets you describe the target and have the runtime reconcile it for you.

  4. Treating nested data as first-class respects the real shape of your objects instead of flattening them too early.

  5. Staying extensible at the source and target boundaries lets the whole approach reach the systems you actually use.

This coherence matters more now than it would have a few years ago. AI agents are only as good as the context they’re handed, and that context is typically the result of a well-engineered pipeline: documents parsed, chunked, embedded, enriched, and kept in sync with sources that never stop changing. Agents don’t like to be kept waiting for data, and they need it to be fresh.

As these pipelines sprawl across more sources, more transforms, and more model calls, keeping their output fresh and correct by hand becomes the dominant cost. CocoIndex pushes that cost down, saving not only compute, but also developer time, which is precisely what context engineering needs in order to scale.

What you declare vs. what the CocoIndex runtime handles
You describe the target you want; the runtime handles the mechanics of getting there and keeping it fresh.

In several months of using CocoIndex, it’s been genuinely pleasant to adopt. It’s open source and lightweight, the core is all in Rust 🦀 (so performance is at the forefront), and it’s quite simple to get started writing apps in Python. It runs in-process with no separate server to stand up or operate, so you can just drop it into an existing environment without taking on a new piece of infrastructure.

That’s it for the design philosophy! In the next post, I’ll come back with some implementation details, showing a real pipeline, with a focus on the primitives that CocoIndex exposes. Till then, it’s worth reading their well-written docs and giving them a star ⭐️ on GitHub.

5 reasons I love CocoIndex, and why you should try it
https://thedataquarry.com/blog/5-reasons-i-love-cocoindex-and-why-you-should-try-it
Author Prashanth Rao
Published at June 19, 2026