Skip to content

Scrapemark fails to decode hex-encoded HTML entities #9

@arshaw

Description

@arshaw

Reported by blejdf...@gmail.com, Aug 11, 2010

Scrape something that has an HTML entity encoded in hex (ex title of http://www.youtube.com/videos)

Entity should be decoded, instead a ValueError is thrown.

At the time of writing, the title for the above mentioned youtube page is (some whitespace removed for clarity):

<title>YouTube - &#x202a;Most viewed videos&#x202c;&lrm</title>

Testcode below:

#!/usr/bin/env python
import scrapemark

url = "http://www.youtube.com/videos"
data = scrapemark.scrape("<title>{{title}}</title>", url = url)
print data['title']

I've attached a patch

diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
     def _substitute_entity(m):
    ent = m.group(2)
    if m.group(1) == "#":
-       return unichr(int(ent))
+       # Hex value
+       if ent[0] == 'x':
+           return unichr(int(ent[1:], 16))
+       else:
+           return unichr(int(ent))
    else:
        cp = name2codepoint.get(ent)
        if cp:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions