Skip to content

Parse, don't validate

Parse, don't validate

Type-driven design

This alternative approach to data wrangling is summed up in a slogan coined by Alexis King in 2019 in an influential blog post: "Parse, don't validate". King here argues for the use of "Type-driven design", also called "Type-driven development":

This definition of "Type-driven development" entails that each parser is defined at the outset to guarantee the production of a particular data type (also called "data model", "data structure", or "schema"). With an abundance of different precisely defined data types in the code base, transformations can be defined with precise syntax as a cascade of data type conversions, e.g.:

list [ Any ] ⇩ list [ str ] ⇩ list [ conint ( ge = 0, le = 1000 ) ]*

* These data types were written using Python type hint notation, where conint(ge=0, le=1000) is a pydantic data type representing a positive integer less than or equal to 1000.

The data types remember

A main advantage of this approach is that once a variable is defined to follow a particular data type, e.g. "list of positive integers less than or equal to 1000", then this restriction is preserved in the data type itself; the variable never needs to be parsed or validated again!

Requires static typing

The "Parse, don't validate" approach requires that the programming language is statically typed and also that the language supports complex data types, so that e.g. a full metadata schema can be expressed as a type. It is no surprise that Alexis King in the above-mentioned blog post demonstrated the concepts in Haskell, a statically typed and purely functional programming language.

What about Python?

Python is one of the most popular programming languages in bioinformatics and data science in general. Python is also one of the most famous examples of a duck typed language, i.e. that if something "walks like a duck and quacks like a duck, then it must be a duck". Unfortunately, in traditional Python code, if a variable looks like a "list of positive integers less than 1000", there is no way to know this for sure without validating the full list, and even then, there are no guarantees that the data will stay that way forever.

Fortunately, with the integration of type hints and compile-time static type checkers such as mypy this is changing. Moveover, with the advent of run-time type checking with libraries like pydantic, the time is ripe to take advantage of type-driven design also in Python.