Skip to content

Deserialization error for some querysets #34

@Peder2911

Description

@Peder2911

Me and @jimdale found an issue where viewser would raise a deserialization
error, while there was obviously at least partial Parquet bytes data in the
response:

DeserializationError: DeserializationError:

  Description:
                Could not deserialize as parquet:             "b'PAR1\x15\x04\
                x15\xe0D\x15\xf8?L\x15\xcc\x08\x15\x04\x12\x00\x00\x1f\x8b\x08
                \x00\x00\x00\x00\x00\x00\x03-Wi8Vk\x1b5e\x1e\xdei\x8f\xafY*\x9
                1\xc21'..."

This only seems to happen with certain querysets. The queryset that lead to this error was:

queryset = (Queryset("jim_fatalities_conflict_history_lag_tdecay", "priogrid_month")
 
            # target variable
            .with_column(Column("ln_ged_sb", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                        )
            
            # spatial-tree-lagged d^-2 target variable
             .with_column(Column("ln_ged_sb_treelag_2_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,2)
                        )
            
            # 1 tlagged spatial-tree-lagged d^-2 target variable
             .with_column(Column("ln_ged_tlag_1_sb_treelag_2_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,2)
                         .transform.temporal.tlag(1)
                         .transform.missing.fill()
                        )
            
            # spatial-tree-lagged d^-1 target variable
             .with_column(Column("ln_ged_sb_treelag_1_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,1)
                        )
            
            # 1 tlagged spatial-tree-lagged d^-1 target variable
             .with_column(Column("ln_ged_tlag_1_sb_treelag_1_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,1)
                         .transform.temporal.tlag(1)
                         .transform.missing.fill()
                        )
            
            # spatial-tree-lagged ln(1+d) target variable
             .with_column(Column("ln_ged_sb_treelag_0_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,0)
                        )
            
            # 1 tlagged spatial-tree-lagged ln(1+d) target variable
             .with_column(Column("ln_ged_tlag_1_sb_treelag_0_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,0)
                         .transform.temporal.tlag(1)
                         .transform.missing.fill()
                        )
             )

To begin diagnosing this, we need to write some tooling for dumping the
erroneous response data to figure out what is being returned that is not
deserializable. This will give us a clue about whether or not the issue is
being caused by something upstream, or is caused by some issue with
deserialization.

A clue is that there is no exception happening upstream, which means that the
data is written to parquet and sent away just fine. This hints towards there
being something wrong with viewser.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions