Skip to content

Allow Elements to be passed to parse_*()#11

Open
levic wants to merge 5 commits intoMimino666:masterfrom
levic:master
Open

Allow Elements to be passed to parse_*()#11
levic wants to merge 5 commits intoMimino666:masterfrom
levic:master

Conversation

@levic
Copy link

@levic levic commented Feb 6, 2023

Addresses #10

@Mimino666 There's no documentation here yet (I wasn't going to add it until you're happy with what I've done)

Handling parse() was an unexpected quirk: if we only have an Element then it doesn't look like we can know whether a document was parsed as HTML or XML so we don't know whether to use an XML or a HTML extractor.

We can guess based on the presence (or not) of a namespace on the Element, but you can still parse XML snippets without a namespace so that could still lead to unexpected results. It also has the side effect of casting the Element back to a string as part of the XML header snooping which is what we were trying to avoid in the first place (although a check for this could be added).

I've opted to force the caller to be explicit: if you want to pass an Element to parse() then you must use parse_html() or parse_xml() instead.

levic added 5 commits February 7, 2023 03:44
…hould be explicit and use parse_html() or parse_xml()� instead]

If we don't do this then an etree Element that was originally parsed as XML will be treated as a HTML document.
@levic
Copy link
Author

levic commented Feb 6, 2023

Calling code would now look like:

    def test_element_as_parser(self):
        """
        we can pass an Element as the extractor to parse_*()
        """
        html = '''
            <div><span>Hello world!</span></div>
            <div></div>
            <div><span>Hello mars!</span></div>
        '''

        # take only the first containers so we can verify that the correct descendant is chosen
        container = Element(css='div', count=3).parse(html)[2]

        val = Element(css='span', count=1).parse_html(container)
        self.assertEqual(val.tag, 'span')
        self.assertEqual(val.text, 'Hello mars!')

The important line is val = Element(css='span', count=1).parse_html(container). Instead of re-parsing the tree the container Element passed to parse_html() is simply wrapped up in a new HtmlXPathExtractor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant