Adding ftfy to fix encoding issues #537

albarrentine · 2017-02-04T00:20:02Z

Hey all,

As mentioned in openaddresses/openaddresses#2392, this PR adds the supremely handy ftfy to machine for fixing common encoding issues like Mojibake, etc. Should also help in cases where there's mixed encoding, say if a source contains some legacy data with a different encoding that was never converted after a migration.

As a dependency it's a simple pure-Python library so doesn't add much weight, works on Python 2 & 3, etc. I use it whenever I have to deal with text from the web, including on the Python/ingestion side of libpostal when importing data sets like OpenAddresses and OSM.

It works on Unicode, so essentially just wrap every call to .decode with a call to ftfy.fix_encoding. Here's a (Python2) usage example:

import ftfy
input = u"Р‘СЂРµСЃС‚СЃРєР°СЏ РѕР±Р»., РЎС‚РѕР»РёРЅСЃРєРёР№ СЂ-РЅ, Р‘РµР»РѕСѓС€СЃРєРёР№ СЃ/СЃ, Рґ. Р С‹Р±РЅРёРєРё, СѓР». РўРѕР»СЃС‚РѕРіРѕ, Рґ. 17"
ftfy.fix_encoding(input)
# returns: "Брестская обл., Столинский р-н, Белоушский с/с, д. Рыбники, ул. Толстого, д. 17"

At present my dev setup for machine is not fully baked so using the PR to run tests.

…on encoding issues at te conform level

migurski · 2017-02-04T20:13:18Z

Looking good, I’ll keep an eye on this PR!

albarrentine · 2017-02-05T21:34:58Z

Ran tests locally but saw the same test fail on master as on this branch. Is that a known issue?

migurski · 2017-02-05T23:17:32Z

The failing test was added here, to uncover GDAL 2.x related issues: #514

It looks like ftfy is returning with a null value?

albarrentine added 2 commits February 1, 2017 14:05

add ftfy (https://github.com/LuminosoInsight/python-ftfy) to fix comm…

00539ad

…on encoding issues at te conform level

merging in master

c710edd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ftfy to fix encoding issues #537

Adding ftfy to fix encoding issues #537

albarrentine commented Feb 4, 2017

migurski commented Feb 4, 2017

albarrentine commented Feb 5, 2017

migurski commented Feb 5, 2017

Adding ftfy to fix encoding issues #537

Are you sure you want to change the base?

Adding ftfy to fix encoding issues #537

Conversation

albarrentine commented Feb 4, 2017

migurski commented Feb 4, 2017

albarrentine commented Feb 5, 2017

migurski commented Feb 5, 2017