Allow Elements to be passed to parse_*()#11
Open
levic wants to merge 5 commits intoMimino666:masterfrom
Open
Conversation
…hould be explicit and use parse_html() or parse_xml()� instead] If we don't do this then an etree Element that was originally parsed as XML will be treated as a HTML document.
Author
|
Calling code would now look like: def test_element_as_parser(self):
"""
we can pass an Element as the extractor to parse_*()
"""
html = '''
<div><span>Hello world!</span></div>
<div></div>
<div><span>Hello mars!</span></div>
'''
# take only the first containers so we can verify that the correct descendant is chosen
container = Element(css='div', count=3).parse(html)[2]
val = Element(css='span', count=1).parse_html(container)
self.assertEqual(val.tag, 'span')
self.assertEqual(val.text, 'Hello mars!')The important line is |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #10
@Mimino666 There's no documentation here yet (I wasn't going to add it until you're happy with what I've done)
Handling
parse()was an unexpected quirk: if we only have an Element then it doesn't look like we can know whether a document was parsed as HTML or XML so we don't know whether to use an XML or a HTML extractor.We can guess based on the presence (or not) of a namespace on the Element, but you can still parse XML snippets without a namespace so that could still lead to unexpected results. It also has the side effect of casting the Element back to a string as part of the XML header snooping which is what we were trying to avoid in the first place (although a check for this could be added).
I've opted to force the caller to be explicit: if you want to pass an Element to
parse()then you must useparse_html()orparse_xml()instead.