Skip to content
This repository has been archived by the owner on May 5, 2022. It is now read-only.

Adding ftfy to fix encoding issues #537

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

albarrentine
Copy link
Contributor

Hey all,

As mentioned in openaddresses/openaddresses#2392, this PR adds the supremely handy ftfy to machine for fixing common encoding issues like Mojibake, etc. Should also help in cases where there's mixed encoding, say if a source contains some legacy data with a different encoding that was never converted after a migration.

As a dependency it's a simple pure-Python library so doesn't add much weight, works on Python 2 & 3, etc. I use it whenever I have to deal with text from the web, including on the Python/ingestion side of libpostal when importing data sets like OpenAddresses and OSM.

It works on Unicode, so essentially just wrap every call to .decode with a call to ftfy.fix_encoding. Here's a (Python2) usage example:

import ftfy
input = u"Брестская обл., Столинский р-н, Белоушский с/с, д. Рыбники, ул. Толстого, д. 17"
ftfy.fix_encoding(input)
# returns: "Брестская обл., Столинский р-н, Белоушский с/с, д. Рыбники, ул. Толстого, д. 17"

At present my dev setup for machine is not fully baked so using the PR to run tests.

@migurski
Copy link
Member

migurski commented Feb 4, 2017

Looking good, I’ll keep an eye on this PR!

@albarrentine
Copy link
Contributor Author

Ran tests locally but saw the same test fail on master as on this branch. Is that a known issue?

@migurski
Copy link
Member

migurski commented Feb 5, 2017

The failing test was added here, to uncover GDAL 2.x related issues: #514

It looks like ftfy is returning with a null value?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants