human-readable, type safe serialization

**Is your feature request related to a problem? Please describe.**

I just chased a very mysterious bug for 4 hours. I couldn't believe my eyes because the results made zero sense and I thought I messed up some very basic functions. Well, turns out that I serialized an `ScmRun` to CSV, but one of my metadata columns (containing IPCC 2006 categories) had only top-level categories, which all could be parsed as integer. When deserializing this `ScmRun`, the inferred data type was integer for this metadata column. Then, selecting for the category using a string would not turn up anything. This is hard to debug because data types are not shown for columns by default by pandas or ScmRun representations. 

**Describe the solution you'd like**

I want a human-readable, gitlab-diffable serialization of an ScmRun which is type safe.

My ideal serialization would be:
* data format close to CSV, with one line per row
* maybe even diffable with advanced diffing tools like csvdiff

Nothing which is built into pandas fits the bill fully. There is `df.to_json(orient="records", lines=True)` which looks nice but isn't type safe for all-null columns (a string column with all-null values will be converted to a float column when serialized and deserialized again). There is `df.to_json(orient="table")` which is fully type safe but one giant line. You can prettyprint that with any json pretty printer, but it leads to every value on its own line, not one line per row like CSV.

An alternative would maybe involve something which is based on CSV but with an additional header which looks like `df.to_json(orient="table")["schema"]`, but it wouldn't be deserializable any more with standard pandas functionality, you have to use some custom wrapper.

**Describe alternatives you've considered**

Absolutely forbid CSV in any context and always use binary, type safe formats. Then, build infrastructure to make changes in the binary formats reviewable. Dunno, a bot which adds visual diffs or just a smallish CLI which can be given two git branches and a file name and makes nice diffs.

**Additional context**

Maybe I'm biased right now because I wasted so much time on this, but I think we really can't tolerate not-type-safe serializations. If I have to worry about details like "oh no, my categories are only top-level after filtering this, now I can't serialize it any more" when building higher-level functionality, it will be an ongoing mental capacity cost. Not only when debugging, but every time I have to serialize or deserialize anything. I would probably be paranoid and at least start `assert`ing dtypes everywhere on my `ScmRun`s.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

human-readable, type safe serialization #296

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

human-readable, type safe serialization #296

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions