Scrapemark fails to decode hex-encoded HTML entities

Reported by blejdf...@gmail.com, Aug 11, 2010

Scrape something that has an HTML entity encoded in hex (ex title of http://www.youtube.com/videos)

Entity should be decoded, instead a ValueError is thrown.

At the time of writing, the title for the above mentioned youtube page is (some whitespace removed for clarity):

```
<title>YouTube - &#x202a;Most viewed videos&#x202c;&lrm</title>
```

Testcode below:

```
#!/usr/bin/env python
import scrapemark

url = "http://www.youtube.com/videos"
data = scrapemark.scrape("<title>{{title}}</title>", url = url)
print data['title']
```

I've attached a patch

```
diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
     def _substitute_entity(m):
    ent = m.group(2)
    if m.group(1) == "#":
-       return unichr(int(ent))
+       # Hex value
+       if ent[0] == 'x':
+           return unichr(int(ent[1:], 16))
+       else:
+           return unichr(int(ent))
    else:
        cp = name2codepoint.get(ent)
        if cp:
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapemark fails to decode hex-encoded HTML entities #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scrapemark fails to decode hex-encoded HTML entities #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions