Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address matching algorithm #23

Open
derekeder opened this issue Feb 13, 2013 · 11 comments
Open

Address matching algorithm #23

derekeder opened this issue Feb 13, 2013 · 11 comments
Milestone

Comments

@derekeder
Copy link
Collaborator

Determine a quick, accurate way of matching addresses across multiple datasets.

@matthewgee
Copy link
Collaborator

I'll do it! Pick me!

@derekeder
Copy link
Collaborator Author

related to #22

@mccc
Copy link
Collaborator

mccc commented Feb 13, 2013

Out of ignorance, I'll chime in and just note that there are a wide variety of commercial applications, bordering on cheap (I've seen anywhere from $100-$1000), for standardizing (and thereby matching) addresses. While I assume this kind of thing has never been an open-source project due to boring maintenance aspects (keeping track of new construction, changed zip codes, etc.) I wonder if it wouldn't be possible to achieve a high degree of accuracy if we just limited ourselves to Chicago. This would of course depend on what address standardization tools the city departments are using internally -- does anyone know about this?

@jpvelez
Copy link
Collaborator

jpvelez commented Feb 13, 2013

When he first assembled edifice, Cory used a bag of tricks to get addresses
to match. Getting these from him would save a ton of time.

On Wed, Feb 13, 2013 at 12:36 AM, Michael Castelle <[email protected]

wrote:

Out of ignorance, I'll chime in and just note that there are a wide
variety of commercial applications, bordering on cheap (I've seen anywhere
from $100-$1000), for standardizing (and thereby matching) addresses. While
I assume this kind of thing has never been an open-source project due to
boring maintenance aspects (keeping track of new construction, changed zip
codes, etc.) I wonder if it wouldn't be possible to achieve a high degree
of accuracy if we just limited ourselves to Chicago. This would of course
depend on what address standardization tools the city departments are using
internally -- does anyone know about this?


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-13476479.

@mccc
Copy link
Collaborator

mccc commented Feb 13, 2013

Yes, that is true, it is likely that Cory's brain and/or command history is a rich resource here. I also assume that he knows which datasets have the most reliable addresses (i.e. something that has likely already gone through a CASS certified converter) and which are really unreliable. This may have an effect in the order that we import data, as well. (e.g. is the address data in building_footprints good or bad?)

@fgregg
Copy link
Collaborator

fgregg commented Feb 13, 2013

From what I remember reading in this area, there is no better approach than using a gazetteer (if available). For Chicago, we know all the street names and the their address ranges. https://data.cityofchicago.org/Transportation/Chicago-Street-Names/i6bp-fvbx

Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address.

That would standardize the street name, and often the direction.

If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go.

For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe.

This is more flexible and will tend to be much more accurate than regexp or similar tricks.

@danxoneil
Copy link

fwiw, the gazetteer approach was what we used at EveryBlock when we introduced address pages: http://blog.everyblock.com/2009/oct/14/addresspages/. I did a lot of user feedback monitoring/ response, and I never had anybody complain of mis-matching. It worked really well.

@jpvelez
Copy link
Collaborator

jpvelez commented Feb 25, 2013

Via @daguar: Code for America's New Orleans team built a site called BlightStatus last year. It lets you look up whether a NOLA property is blighted and what the city is doing about it. The data originally came from a mess of disparate, super-dirty spreadsheets, so they spent a ton of time writing scripts to do address normalization. Check them out. CfA fellow @eddietejeda knows more.

Also here's a Ruby port of a Perl library that's supposed to be pretty good at address normalization.

@fgregg
Copy link
Collaborator

fgregg commented Mar 20, 2013

@danxoneil, do you know if any code from the everyblock gazateer ever made it across. If not, do you know if the person who worked on it would be willing to chat with me about the approach they took?

@danxoneil
Copy link

@fgregg https://github.com/paulsmith is yer man on dat.

@fgregg
Copy link
Collaborator

fgregg commented Mar 21, 2013

@paulsmith gave us some pointers

the EveryBlock code lives on in a project called OpenBlock. I would start with the geocoder we wrote:

https://github.com/openplans/openblock/tree/master/ebpub/ebpub/geocoder

especially the parser:

https://github.com/openplans/openblock/blob/master/ebpub/ebpub/geocoder/parser/parsing.py

The approach we took was roughly to parse the location into a set of possible matches, and then query the database to see which ones actually exist in the data.

https://github.com/openplans/openblock/blob/master/ebpub/ebpub/geocoder/base.py#L423

That should be a good starting point.

There's significant overlap between the parser and this python port of http://search.cpan.org/~timb/Geo-StreetAddress-US-1.03/US.pm

I think it's pretty interesting that the EveryBlock code did not do any on-the-fly fuzzy matching (there's a table of street misspellings that it looks at). We should follow up on why. Maybe EveryBlock just wanted to be very conservative and was willing to trade-off recall for precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants