Skip to content

Migrate to Polars #126

@MitchellAcoustics

Description

@MitchellAcoustics

Is Your Feature Request Related to a Problem? Please Describe

I've been consistently frustrated with the hacky nature of defining and validating data frame schemas in Pandas. I'd like to be able to expected info about each column, like the data type, what columns are minimally present, acceptable ranges, etc. Pandera has some of this, but none of it feels especially slick.

Describe the Solution You'd Like

It appears Polars has native support for schema definitions, and several options of packages for defining and validating data frame schemas in Polars.

See:

There are other benefits to migrating to Polars which I've heard mentioned but not looked into - speed, interface and syntax, extensibility, etc.

Describe Alternatives You've Considered

The current dev state of the circe module uses a Pandera DataFrameModel to define the expected shape and attributes of PAQ data. I guess this is working, but I didn't really enjoy the Pandera experience.

# soundscapy.satp.circe.py

class SATPSchema(pa.DataFrameModel):
    """
    Pandera schema for validating SATP (Soundscape Attributes Translation Project) data.

    This schema validates DataFrame columns containing PAQ ratings
    and participant identifiers. PAQ ratings must be between 0 and 100.
    """

    PAQ1: Series[float] = Field(ge=0, le=100)
    PAQ2: Series[float] = Field(ge=0, le=100)
    PAQ3: Series[float] = Field(ge=0, le=100)
    PAQ4: Series[float] = Field(ge=0, le=100)
    PAQ5: Series[float] = Field(ge=0, le=100)
    PAQ6: Series[float] = Field(ge=0, le=100)
    PAQ7: Series[float] = Field(ge=0, le=100)
    PAQ8: Series[float] = Field(ge=0, le=100)

    participant: Series[str]

Additional Context

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions