Address matching algorithm #23

derekeder · 2013-02-13T04:29:00Z

Determine a quick, accurate way of matching addresses across multiple datasets.

matthewgee · 2013-02-13T04:31:07Z

I'll do it! Pick me!

derekeder · 2013-02-13T06:06:25Z

related to #22

mccc · 2013-02-13T06:36:47Z

Out of ignorance, I'll chime in and just note that there are a wide variety of commercial applications, bordering on cheap (I've seen anywhere from $100-$1000), for standardizing (and thereby matching) addresses. While I assume this kind of thing has never been an open-source project due to boring maintenance aspects (keeping track of new construction, changed zip codes, etc.) I wonder if it wouldn't be possible to achieve a high degree of accuracy if we just limited ourselves to Chicago. This would of course depend on what address standardization tools the city departments are using internally -- does anyone know about this?

jpvelez · 2013-02-13T07:05:17Z

When he first assembled edifice, Cory used a bag of tricks to get addresses
to match. Getting these from him would save a ton of time.

On Wed, Feb 13, 2013 at 12:36 AM, Michael Castelle <[email protected]

wrote:

Out of ignorance, I'll chime in and just note that there are a wide
variety of commercial applications, bordering on cheap (I've seen anywhere
from $100-$1000), for standardizing (and thereby matching) addresses. While
I assume this kind of thing has never been an open-source project due to
boring maintenance aspects (keeping track of new construction, changed zip
codes, etc.) I wonder if it wouldn't be possible to achieve a high degree
of accuracy if we just limited ourselves to Chicago. This would of course
depend on what address standardization tools the city departments are using
internally -- does anyone know about this?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-13476479.

mccc · 2013-02-13T07:35:40Z

Yes, that is true, it is likely that Cory's brain and/or command history is a rich resource here. I also assume that he knows which datasets have the most reliable addresses (i.e. something that has likely already gone through a CASS certified converter) and which are really unreliable. This may have an effect in the order that we import data, as well. (e.g. is the address data in building_footprints good or bad?)

fgregg · 2013-02-13T19:10:34Z

From what I remember reading in this area, there is no better approach than using a gazetteer (if available). For Chicago, we know all the street names and the their address ranges. https://data.cityofchicago.org/Transportation/Chicago-Street-Names/i6bp-fvbx

Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address.

That would standardize the street name, and often the direction.

If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go.

For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe.

This is more flexible and will tend to be much more accurate than regexp or similar tricks.

danxoneil · 2013-02-14T04:07:16Z

fwiw, the gazetteer approach was what we used at EveryBlock when we introduced address pages: http://blog.everyblock.com/2009/oct/14/addresspages/. I did a lot of user feedback monitoring/ response, and I never had anybody complain of mis-matching. It worked really well.

jpvelez · 2013-02-25T18:36:57Z

Via @daguar: Code for America's New Orleans team built a site called BlightStatus last year. It lets you look up whether a NOLA property is blighted and what the city is doing about it. The data originally came from a mess of disparate, super-dirty spreadsheets, so they spent a ton of time writing scripts to do address normalization. Check them out. CfA fellow @eddietejeda knows more.

Also here's a Ruby port of a Perl library that's supposed to be pretty good at address normalization.

fgregg · 2013-03-20T19:00:45Z

@danxoneil, do you know if any code from the everyblock gazateer ever made it across. If not, do you know if the person who worked on it would be willing to chat with me about the approach they took?

danxoneil · 2013-03-21T02:55:43Z

@fgregg https://github.com/paulsmith is yer man on dat.

fgregg · 2013-03-21T14:17:39Z

@paulsmith gave us some pointers

the EveryBlock code lives on in a project called OpenBlock. I would start with the geocoder we wrote:

https://github.com/openplans/openblock/tree/master/ebpub/ebpub/geocoder

especially the parser:

https://github.com/openplans/openblock/blob/master/ebpub/ebpub/geocoder/parser/parsing.py

The approach we took was roughly to parse the location into a set of possible matches, and then query the database to see which ones actually exist in the data.

https://github.com/openplans/openblock/blob/master/ebpub/ebpub/geocoder/base.py#L423

That should be a good starting point.

There's significant overlap between the parser and this python port of http://search.cpan.org/~timb/Geo-StreetAddress-US-1.03/US.pm

I think it's pretty interesting that the EveryBlock code did not do any on-the-fly fuzzy matching (there's a table of street misspellings that it looks at). We should follow up on why. Maybe EveryBlock just wanted to be very conservative and was willing to trade-off recall for precision.

derekeder mentioned this issue Mar 27, 2013

Address matching algorithm tplagge/benefice#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address matching algorithm #23

Address matching algorithm #23

derekeder commented Feb 13, 2013

matthewgee commented Feb 13, 2013

derekeder commented Feb 13, 2013

mccc commented Feb 13, 2013

jpvelez commented Feb 13, 2013

mccc commented Feb 13, 2013

fgregg commented Feb 13, 2013

danxoneil commented Feb 14, 2013

jpvelez commented Feb 25, 2013

fgregg commented Mar 20, 2013

danxoneil commented Mar 21, 2013

fgregg commented Mar 21, 2013

Address matching algorithm #23

Address matching algorithm #23

Comments

derekeder commented Feb 13, 2013

matthewgee commented Feb 13, 2013

derekeder commented Feb 13, 2013

mccc commented Feb 13, 2013

jpvelez commented Feb 13, 2013

mccc commented Feb 13, 2013

fgregg commented Feb 13, 2013

danxoneil commented Feb 14, 2013

jpvelez commented Feb 25, 2013

fgregg commented Mar 20, 2013

danxoneil commented Mar 21, 2013

fgregg commented Mar 21, 2013