Skip to content

Possible speed up when reading .xml by using lxml instead of pandas #734

@pt-kkraemer

Description

@pt-kkraemer

Description of the issue

I always had the feeling pd.read_xml is slow.

Ideas of solution

Currently we have this function in utils_write_to_database.py:

def read_xml_file(f: ZipFile, file_name: str) -> pd.DataFrame:
    """Read the xml file from the zip file and return it as a DataFrame."""
    with f.open(file_name) as xml_file:
        try:
            return pd.read_xml(xml_file, encoding="UTF-16", parser="etree")
        except lxml.etree.XMLSyntaxError as error:
            return handle_xml_syntax_error(xml_file.read().decode("utf-16"), error)

Exchanging pd.read_xml with lxml.etree.fromstring as shown here:

def read_xml_file(f: ZipFile, file_name: str) -> pd.DataFrame:
    """Read the xml file from the zip file and return it as a DataFrame."""
    with f.open(file_name) as xml_file:
        raw = xml_file.read()
    try:
        root = lxml.etree.fromstring(raw)
    except lxml.etree.XMLSyntaxError as error:
        return handle_xml_syntax_error(raw.decode("utf-16"), error)
    records = [{child.tag: child.text for child in row} for row in root]
    return pd.DataFrame(records)

led to a roughly 20% speed up when using db.download(data=['wind','combustion','biomass']). We actually already import lxml so there is no additional module overhead.

Workflow checklist

  • I am aware of the workflow in CONTRIBUTING.md
  • I will carry out more elapsed time tests and document them here
  • I will futher investigate if lxml.etree.fromstring handles the "utf-16" encoding correctly

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions