-
-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Is Your Feature Request Related to a Problem? Please Describe
I've been consistently frustrated with the hacky nature of defining and validating data frame schemas in Pandas. I'd like to be able to expected info about each column, like the data type, what columns are minimally present, acceptable ranges, etc. Pandera has some of this, but none of it feels especially slick.
Describe the Solution You'd Like
It appears Polars has native support for schema definitions, and several options of packages for defining and validating data frame schemas in Polars.
See:
- https://www.reddit.com/r/dataengineering/comments/1k20jie/we_built_a_new_opensource_validation_library_for/
- https://posit-dev.github.io/pointblank/blog/validation-libs-2025/
There are other benefits to migrating to Polars which I've heard mentioned but not looked into - speed, interface and syntax, extensibility, etc.
Describe Alternatives You've Considered
The current dev state of the circe module uses a Pandera DataFrameModel to define the expected shape and attributes of PAQ data. I guess this is working, but I didn't really enjoy the Pandera experience.
# soundscapy.satp.circe.py
class SATPSchema(pa.DataFrameModel):
"""
Pandera schema for validating SATP (Soundscape Attributes Translation Project) data.
This schema validates DataFrame columns containing PAQ ratings
and participant identifiers. PAQ ratings must be between 0 and 100.
"""
PAQ1: Series[float] = Field(ge=0, le=100)
PAQ2: Series[float] = Field(ge=0, le=100)
PAQ3: Series[float] = Field(ge=0, le=100)
PAQ4: Series[float] = Field(ge=0, le=100)
PAQ5: Series[float] = Field(ge=0, le=100)
PAQ6: Series[float] = Field(ge=0, le=100)
PAQ7: Series[float] = Field(ge=0, le=100)
PAQ8: Series[float] = Field(ge=0, le=100)
participant: Series[str]Additional Context
No response