Possible speed up when reading .xml by using lxml instead of pandas

## Description of the issue

I always had the feeling `pd.read_xml` is slow.

## Ideas of solution

Currently we have this function in `utils_write_to_database.py`:

```python
def read_xml_file(f: ZipFile, file_name: str) -> pd.DataFrame:
    """Read the xml file from the zip file and return it as a DataFrame."""
    with f.open(file_name) as xml_file:
        try:
            return pd.read_xml(xml_file, encoding="UTF-16", parser="etree")
        except lxml.etree.XMLSyntaxError as error:
            return handle_xml_syntax_error(xml_file.read().decode("utf-16"), error)
``` 

Exchanging `pd.read_xml` with `lxml.etree.fromstring` as shown here:

```python
def read_xml_file(f: ZipFile, file_name: str) -> pd.DataFrame:
    """Read the xml file from the zip file and return it as a DataFrame."""
    with f.open(file_name) as xml_file:
        raw = xml_file.read()
    try:
        root = lxml.etree.fromstring(raw)
    except lxml.etree.XMLSyntaxError as error:
        return handle_xml_syntax_error(raw.decode("utf-16"), error)
    records = [{child.tag: child.text for child in row} for row in root]
    return pd.DataFrame(records)
```

led to a roughly 20% speed up when using `db.download(data=['wind','combustion','biomass'])`. We actually already import `lxml` so there is no additional module overhead.

## Workflow checklist
- [ ] I am aware of the workflow in [CONTRIBUTING.md](https://github.com/OpenEnergyPlatform/open-MaStR/blob/production/CONTRIBUTING.md)
- [ ] I will carry out more elapsed time tests and document them here
- [ ] I will futher investigate if `lxml.etree.fromstring` handles the "utf-16" encoding correctly


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible speed up when reading .xml by using lxml instead of pandas #734

Description of the issue

Ideas of solution

Workflow checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible speed up when reading .xml by using lxml instead of pandas #734

Description

Description of the issue

Ideas of solution

Workflow checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions