Skip to content

Support other content-encodings, other then utf8 (support Swedish characters) #10

@arshaw

Description

@arshaw

Reported by wigg...@gmail.com, Aug 13, 2010

When using Scrapemark to get text from Swedish websites or sites that are not using utf-8 as content encoding, which is common here. Scrapemark removes all special characters (åäö), the text "Hjälp" becomes "Hjlp".

What steps will reproduce the problem?

  1. Use a url of a homepage with content-encoding iso-8859-1 for example this Swedish homepage http://www.asciitabell.se/
  2. Scrape the <title>{{ }}</title>

What is the expected output? What do you see instead?
The output is "ASCII-tabellen (8 bitars utkad ASCII, enligt ISO 8859-1)" the expected result would be "ASCII-tabellen (8 bitars utökad ASCII, enligt ISO 8859-1)" (notice the o with two dots in the middle :))

What version of the product are you using? On what operating system?
Version 0.9, Mac OSX Snow Leopard

Please provide any additional information below.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.

diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
 def _substitute_entity(m):
    ent = m.group(2)
    if m.group(1) == "#":
-       return unichr(int(ent))
+       # Hex value
+       if ent[0] == 'x':
+           return unichr(int(ent[1:], 16))
+       else:
+           return unichr(int(ent))
    else:
        cp = name2codepoint.get(ent)
        if cp:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions