Skip to content

Failed writing a dataframe to '.avro' file #58

@Anna050689

Description

@Anna050689

Prerequisites:

  • Python 3.10
  • pandavro==1.8.0
  • fastavro==1.9.7

Steps to reproduce the issue:

  • Create a dataframe with the following data:
import pandas as pd

data = {
    'id': [545, 539, 643, 615, 502, 599, 542, 587, 537, 518],
    'first_name': ['caallai', 'Xzaaen', 'olrie', 'Iaairl', 'hfreiio', 'yieri', 'hcninn', 'irannir', 'Cmrnnan', 'Mnaeail'],
    'last_name': ['kroaoe', 'trrot', 'haill', 'kolide', 'errhnd', 'aoaoet', 'yBorrd', 'evbceyd', 'Wcnoee', 'eMloen'],
    'created_date': ['12/22/1992', '06/02/1992', '09/23/1998', '01/01/1997', '03/26/1990', '06/01/1996', '08/08/1992', '01/14/1995', '06/16/1992', '06/24/1991'],
    'Active': [False, False, False, False, False, True, False, False, False, True]
}
df = pd.DataFrame(data=data).astype('object')
  • Attempt to save the dataframe to an '.avro' file using the following command:
import pandavro as pdx

path = 'output.avro'
pdx.to_avro(path, df, schema=None)

Expected behavior:

The dataframe should be saved to an '.avro' file without any errors.

Actual behavior:

The following error is raised:

  File "fastavro/_write.pyx", line 779, in fastavro._write.writer
  File "fastavro/_write.pyx", line 687, in fastavro._write.Writer.__init__
  File "fastavro/_schema.pyx", line 173, in fastavro._schema.parse_schema
  File "fastavro/_schema.pyx", line 407, in fastavro._schema._parse_schema
  File "fastavro/_schema.pyx", line 475, in fastavro._schema.parse_field
  File "fastavro/_schema.pyx", line 233, in fastavro._schema._parse_schema
  File "fastavro/_schema.pyx", line 263, in fastavro._schema._parse_schema
TypeError: argument of type 'NoneType' is not iterable

The inferred schema is:

{
    'fields': [
        {'name': 'id', 'type': ['null', None]},
        {'name': 'first_name', 'type': ['null', 'string']},
        {'name': 'last_name', 'type': ['null', 'string']},
        {'name': 'created_date', 'type': ['null', 'string']},
        {'name': 'Active', 'type': ['null', 'boolean']}
    ],
    'name': 'Root',
    'type': 'record'
}

Additional Information:

The issue occurs because the "id" column is inferred as ['null', None] instead of ['null', 'int'] when its data type is set to object.
When the "id" column has the data type integer, the process of saving the '.avro' file is successful.

Workaround:

As a temporary workaround, the data type of the "id" column should be explicitly set to integer before saving the dataframe to an '.avro' file:

df['id'] = df['id'].astype('int')
pdx.to_avro(path, df, schema=None)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions