Why Pydantic just keeps getting better

2023-12-02

12x possible speedup over v1, with more to come#

Pydantic is a data parsing, transformation and validation library for Python that’s become integral to the PyData ecosystem. Version 2 of Pydantic, which came out in June 2023, had its core rewritten in Rust, which I described in some detail in my previous post.

It’s been really fascinating to watch each major release of Pydantic v2 lately, as they’ve been showing incremental performance improvements over prior releases.

Performance comparison of each successive Pydantic v2 major release
Performance comparison of each successive Pydantic v2 major release

As can be seen, there’s not only a 12x speedup over v1.10 in the latest release (2.5.2 as of writing this post) – there’s also a 12% improvement over the first v2 release. These improvements are likely due to a variety of optimizations and new features at the lower levels in Rust.

The goal of this post is to describe the details of the benchmark whose results are shown above, and also to uncover the underlying components of pydantic-core (the part that’s written in Rust) that contribute to the overall performance improvement that we see at the Python level.

It’s assumed you know what the purpose of Pydantic is, and have some familiarity with its API – if not, check out the first post in this series.

Uncovering the Pydantic stack#

Before we go into the benchmark and the reasons for the performance gain, it’s worth spending a bit of time to really understand Pydantic’s internals.

As we know, v2 has a new design that splits it up into two Python packages:

  • pydantic: Written in Python and is user-facing
  • pydantic-core: A base Python package that contains the core functionality for serialization and validation, whose core is based on Rust

The Python API we, the end users, are used to interfacing with, is built directly on top of pydantic-core. However, it’s important to understand that pydantic-core itself is not a monolithic entity.

The multi-layer stack of Pydantic v2
The multi-layer stack of Pydantic v2

The lowest layers of Pydantic are composed of a multitude of Rust crates that form what we know as pydantic-core. The top (user-facing) layer is the Python API. The intermediate layers, PyO3 and maturin, are explained in more detail below.

Rust crates#

speedate is a datetime and time duration parser that functions at the Rust level. jiter is an iterable JSON parser that offers multiple interfaces for handling JSON data (enums, iterators and strings) in Rust. The regex crate is the regex engine that’s maintained by the Rust foundation. There are many other base crates as well, coming together to compose what we know as pydantic-core.

Earlier versions of Pydantic v2 used orjson (a Python library), which itself depended on the age-old serde_json Rust crate for JSON parsing. serde_json is among the most downloaded Rust crates, with almost 200 million downloads as of 2023, so the fact that the Pydantic team decided to move away from such a mature package in the Rust ecosystem to write their own JSON parser indicates how critical this part of the stack is to Pydantic’s performance.

As Samuel Colvin, the creator of Pydantic, explains:

orjson uses serde-json, to deserialize JSON, then takes some serious liberties with the Python C API/PyO3 FFI interface for performance. jiter, an alternative to serde_json, is significantly faster, and also allows you to get the position of a value even if JSON parsing passes – so Pydantic validation errors can show the JSON file positions.

With JSON being the key serialization and deserialization component in Pydantic workflows, and with Pydantic being used to validate ever larger amounts of data, every ounce of performance matters!

PyO3#

PyO3 is a Rust crate that allows Rust code to be called from Python, via generated Rust bindings. It’s a fork of rust-cpython, which aims to offer a Rust wrapper around the libpython C API so that Rust code can be called from Python. PyO3 is now a mature project with a large community, and is the most common way to write performant Python extensions in Rust.

Recent versions of Pydantic have begun applying profile-guided optimization (PGO) during pydantic-core compilation. As described in the docs, PGO collects data about the typical execution of a binary and uses this data to inform optimizations such as inlining, machine code layout, register allocation, and branch prediction. The result is that the pydantic-core bindings, even prior to being called in Python, are already optimized for the typical execution patterns of Pydantic.

Maturin#

Maturin is a utility that’s used to generate native Python modules (wheels) from Rust crates that were built with PyO3. The result of using Maturin is that the underlying Rust code in pydantic-core can be made available as a Python package that’s itself imported into pydantic, to be available to end users.

Pydantic#

The user-facing part of Pydantic, written in Python, is essentially a wrapper around pydantic-core. None of the heavy-lifting is done at the Python level – all the parsing, validation and serialization is done at the Rust level, so Python users can continue about their business as usual without worrying about Rust.


Performance benchmark#

This section covers the benchmark that was run to study the performance improvements in each major release of v2, with the reference version (for comparison) being v1.10. The benchmark code is available here.

Dataset#

The dataset used is a familiar one if you’ve read my previous blogs: A wine reviews dataset from Kaggle. It consists of 129,971 wine reviews from the Wine Enthusiast magazine, made available in newline-delimited JSON as shown below. Refer to the Kaggle source for more detailed information on the dataset and how it was scraped.

An example JSON line representing a single wine review, when read into Python, is shown below. Note how the comments in each line specify what we want, vs. what’s actually present in the data.

{
    "id": 40825,  # This field is compulsory (not nullable)
    "points": "90",  # Bad value: This should be an integer
    "title": "Castello San Donato in Perano 2009 Riserva  (Chianti Classico)",
    "description": "Made from a blend of 85% Sangiovese and 15% Merlot, this ripe wine delivers soft plum, black currants, clove and cracked pepper sensations accented with coffee and espresso notes. A backbone of firm tannins give structure. Drink now through 2019.",
    "taster_name": "Kerin O'Keefe",
    "taster_twitter_handle": "@kerinokeefe",
    "price": "30.0",  # Bad value: This should be a float
    "designation": "Riserva", # Rename this to "vineyard"
    "variety": "Red Blend",
    "region_1": "null", # Drop string 'null' fields
    "region_2": None, # Drop None fields
    "province": "Tuscany",
    "country": "Italy",  # Missing country is unacceptable: replace with "Unknown
    "winery": "Castello San Donato in Perano"
}

Simple validator#

The first benchmark makes use of a simple validator based on familiar Pydantic concepts – fields, validators and models. The validator is shown below.

schemas.py
from pydantic import BaseModel, ConfigDict, Field, model_validator

class Wine(BaseModel):
    model_config = ConfigDict(
        populate_by_name=True,
        str_strip_whitespace=True,
    )

    id: int
    points: int
    title: str
    description: str | None
    price: float | None
    variety: str | None
    winery: str | None
    designation: str | None = Field(None, alias="vineyard")
    country: str | None
    province: str | None
    region_1: str | None
    region_2: str | None
    taster_name: str | None
    taster_twitter_handle: str | None

    @model_validator(mode="before")
    def _remove_unknowns(cls, values):
        "Set other fields that have the value 'null' as None so that we can throw it away"
        for field in ["region_1", "region_2"]:
            if not values.get(field) or values.get(field) == "null":
                values[field] = None
        return values

    @model_validator(mode="before")
    def _fill_country_unknowns(cls, values):
        "Fill in missing country values with 'Unknown', as we always want this field to be queryable"
        country = values.get("country")
        if not country or country == "null":
            values["country"] = "Unknown"
        return values

The goal of the validator is to coerce bad data types to the types we want, and to drop fields that we don’t want, or are missing. In addition, the model_config is used to strip whitespace from all string fields, and to populate specific fields by an alias. model_validator class methods are used to modify elements of a dict as they are parsed, before validation.

Improved validator#

The improved validator makes use of some new features in Pydantic v2 as shown below.

schemas_improved.py
from pydantic import BeforeValidator, TypeAdapter, constr, field_validator
from pydantic_core import PydanticOmit
from typing_extensions import Annotated, NotRequired, TypedDict

not_required_fields = ["region_1","region_2"]

def exclude_none(s: str | None) -> str:
    if s is None:
        # since we want `exclude_none=True` in the end,
        # just omit it if it's `None` during validation
        raise PydanticOmit
    else:
        return s

ExcludeNoneStr = Annotated[str, BeforeValidator(exclude_none)]

class Wine(TypedDict):
    id: int
    points: int
    title: str
    description: NotRequired[str]
    price: NotRequired[float]
    variety: NotRequired[str]
    winery: NotRequired[str]
    designation: NotRequired[constr(strip_whitespace=True)]
    country: NotRequired[str]
    province: NotRequired[str]
    region_1: NotRequired[str]
    region_2: NotRequired[str]
    taster_name: NotRequired[str]
    taster_twitter_handle: NotRequired[str]

    @field_validator(*not_required_fields, mode="before")
    def omit_null_none(cls, v):
        # type: ignore
        if v is None or v == "null":
            raise PydanticOmit
        else:
            return v

    @field_validator("country", mode="before")
    def country_unknown(cls, s: str | None) -> str:
        # type: ignore
        if s is None or s == "null":
            return "Unknown"
        else:
            return s

WinesTypeAdapter = TypeAdapter(list[Wine])

There are a number of differences between this validator and the previous one:

  • The Wine model in the improved version is a TypedDict instead of a BaseModel. This is a new feature in Pydantic v2 that allows us to define a schema using Python’s native TypedDict type (which, at runtime, is a plain dict)

  • field_validator is used instead of model_validator. These operate only on specific fields, and we apply the the NotRequired type to all fields that we want to be able to omit from the final output if their value is either None or 'null'. This is done via a new exception type PydanticOmit

  • A TypeAdapter is used, which exposes only some of the functionality of BaseModel. This is much more performant, as it avoids the overhead of the BaseModel class when simply performing validation on a dict.

Because these features listed above are unique to Pydantic v2, the improved validator can only be run on v2.x and not on v1.

Tip

As I learned from Samuel Colvin himself while making this benchmark, certain validation workflows such as this one, largely spend their time in serialization, i.e., converting a dict to a Pydantic model and back to a dict for downstream use. The model here is also rather simple, with no nested fields or complex validation logic. In such cases, TypedDict coupled with TypeAdapter is a better choice for performance reasons than BaseModel.

Benchmark code#

The benchmark is run using pytest-benchmark. The data is first loaded into a test fixture.

from typing import Any
from pathlib import Path
from util import get_json_data  # A helper function to load data via srsly

import pytest

@pytest.fixture
def data() -> list[dict[str, Any]]:
    """Load the data once per session"""
    DATA_DIR = Path("/path/to/data")
    FILENAME = "winemag-data-130k-v2.jsonl.gz"
    data = list(get_json_data(DATA_DIR, FILENAME))
    return data

The key-value pairs from each JSON record in the dataset are unpacked as kwargs and passed to the validator to produce a Pydantic mode as shown below. The model is then dumped back to a dict via model_dump for downstream use.

basic_validator
```py
def validate(data):
    """Validate a list of JSON blobs against the Wine schema"""
    validated_data = [Wine(**item).model_dump(exclude_none=True, by_alias=True) for item in data]
    return validated_data

# Run validator
def test_validate(benchmark, data):
    """Validate the data"""
    result = benchmark(validate, data)
    assert len(result) == len(data)

The improved validator is run directly from the TypeAdapter we defined earlier. Note that we don’t need to unpack the JSON items into a dict and then pass them to the validator, as TypeAdapter can operate directly on a list of dicts, which avoids unnecessary serialization overhead.

improved_validator
1def test_validate_improved(benchmark, data):
2 """Validate a list of JSON blobs against the Wine schema"""
3 result = benchmark(WinesTypeAdapter.validate_python, data)
4 assert len(result) == len(data)

Note how the validate_python method is used instead of serializing a dict to a Pydantic model. This is much better for performance, and as a rule of thumb, should be used when the only goal is to validate a dict against a schema.

The benchmark is then run via benchmark_validator.py.

bash
pytest benchmark_validator.py --benchmark-sort=fullname --benchmark-warmup-iterations=5 --benchmark-min-rounds=10

Reasons for performance improvement#

The performance improvements can be attributed to multiple causes, broadly categorized as follows:

Improvements at the Rust level#

  • PGO applied at the PyO3 level during pydantic-core compilation
  • Usage of a fast, custom JSON parser, jiter, which avoids the serde_json dependency via orjson, and is tuned to Pydantic’s JSON parsing requirements

Improvements at the Python level#

  • Using TypedDict and TypeAdapter instead of BaseModel to avoid serialization overhead (transferring the Python dict to be parsed and validated at the Rust level via PyO3 bindings)
  • Using field_validator instead of model_validator to avoid unnecessary validation of all fields when not required
  • Using NotRequired type extensions and a new PydanticOmit exception in v2 to avoid unnecessary serialization of fields that are None or 'null'. This is done during the parsing stage, prior to validation, which is faster.

Towards a nogil future#

Going forward, there are even more improvements happening at a very low level, related to PyO3 and how to work around the Python Global Interpreter Lock (GIL).

In the near term, an upcoming release of pydantic-core is attempting to port the upcoming improvements to PyO3’s GIL handling. A further 20-30% performance improvement is expected from this, all available to end users for free, with zero code changes required. 🤯

Over the longer term, a future release of Python 3.13 that aims to make the GIL completely optional will allow the Python code at the top layer of Pydantic to perform operations within pydantic-core in a way that’s completely free of the GIL. This is fondly referred to as the “nogil” future of Python, and has the potential to be a game-changer for Pydantic, as well as a host of other Python libraries that can benefit from true multi-threaded code execution.

Conclusions#

Because of these low level developments that are happening silently under the hood, largely at the Rust level, end users can expect to see more-than-subtle performance improvements in their Pydantic code over time. Considering the billions (or trillions) of times a day that Pydantic code executes on servers all over the world, every ounce of efficient computation is a win for companies that run the code, and for the planet as a whole. 🌎

I hope this post has gotten you as excited as I am about the future of Pydantic, and, in general, the future of Rust in the PyData ecosystem. A lot of this is enabled by the work of David Hewitt, who recently joined Pydantic, and is simultaneously pushing the boundaries of what’s possible with PyO3 and Pydantic.

Have fun porting your Pydantic code to v2, and let’s keep tracking and celebrating the performance gains! 🚀

Code#

All code for the benchmarks shown in this post is available here.

Acknowledgements#

Special thanks to Samuel Colvin for taking the time to create a PR for the improved validation logic.

Open source is truly a wonderful thing, so if you’re a team that hugely benefits from Pydantic, consider sponsoring the project to help it thrive. 🫶🏼