Description of the issue
I always had the feeling pd.read_xml is slow.
Ideas of solution
Currently we have this function in utils_write_to_database.py:
def read_xml_file(f: ZipFile, file_name: str) -> pd.DataFrame:
"""Read the xml file from the zip file and return it as a DataFrame."""
with f.open(file_name) as xml_file:
try:
return pd.read_xml(xml_file, encoding="UTF-16", parser="etree")
except lxml.etree.XMLSyntaxError as error:
return handle_xml_syntax_error(xml_file.read().decode("utf-16"), error)
Exchanging pd.read_xml with lxml.etree.fromstring as shown here:
def read_xml_file(f: ZipFile, file_name: str) -> pd.DataFrame:
"""Read the xml file from the zip file and return it as a DataFrame."""
with f.open(file_name) as xml_file:
raw = xml_file.read()
try:
root = lxml.etree.fromstring(raw)
except lxml.etree.XMLSyntaxError as error:
return handle_xml_syntax_error(raw.decode("utf-16"), error)
records = [{child.tag: child.text for child in row} for row in root]
return pd.DataFrame(records)
led to a roughly 20% speed up when using db.download(data=['wind','combustion','biomass']). We actually already import lxml so there is no additional module overhead.
Workflow checklist
Description of the issue
I always had the feeling
pd.read_xmlis slow.Ideas of solution
Currently we have this function in
utils_write_to_database.py:Exchanging
pd.read_xmlwithlxml.etree.fromstringas shown here:led to a roughly 20% speed up when using
db.download(data=['wind','combustion','biomass']). We actually already importlxmlso there is no additional module overhead.Workflow checklist
lxml.etree.fromstringhandles the "utf-16" encoding correctly