Skip to content

Conversation

@amw
Copy link

@amw amw commented Mar 17, 2017

I encountered a JPEG file that had invalid unicode string inside "software" meta tag. This caused UnicodeDecodeError in filemagic's compatibility.py:

    description = magic.id_buffer(chunk)
  File "/Users/amw/.virtualenvs/asd/lib/python3.6/site-packages/magic/identify.py", line 29, in wrapper
    return func(self, *args, **kwargs)
  File "/Users/amw/.virtualenvs/asd/lib/python3.6/site-packages/magic/compatability.py", line 30, in wrapper
    return func(*encoder(args), **kwargs)
  File "/Users/amw/.virtualenvs/asd/lib/python3.6/site-packages/magic/compatability.py", line 56, in wrapper
    return value.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 210: invalid continuation byte

In this PR I am passing errors='replace' option to the decode method so that we can return safe string with the rest of the file description intact. Another alternative is ignore which I deemed less safe.

When testing my specific file I have noticed that both replace and ignore returned strings are unicode equivalent when copied to some context (like this GitHub page), but they do look different in Terminal. See below comparison text and screenshots.

python replace

JPEG image data, JFIF standard 1.01, resolution (DPI), density 96x96, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=4, xresolution=62, yresolution=70, resolutionunit=2, software=ˮӡרҵ], baseline, precision 8, 560x372, frames 3

replace

python ignore

JPEG image data, JFIF standard 1.01, resolution (DPI), density 96x96, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=4, xresolution=62, yresolution=70, resolutionunit=2, software=ˮӡרҵ], baseline, precision 8, 560x372, frames 3

ignore

file shell command

JPEG image data, JFIF standard 1.01, resolution (DPI), density 96x96, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=4, xresolution=62, yresolution=70, resolutionunit=2, software=????ˮӡרҵ??], baseline, precision 8, 560x372, frames 3

file-cmd

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.0%) to 88.177% when pulling ae5ec7c on amw:replace-invalid-unicode into 1386490 on aliles:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants