-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address matching algorithm #23
Comments
I'll do it! Pick me! |
related to #22 |
Out of ignorance, I'll chime in and just note that there are a wide variety of commercial applications, bordering on cheap (I've seen anywhere from $100-$1000), for standardizing (and thereby matching) addresses. While I assume this kind of thing has never been an open-source project due to boring maintenance aspects (keeping track of new construction, changed zip codes, etc.) I wonder if it wouldn't be possible to achieve a high degree of accuracy if we just limited ourselves to Chicago. This would of course depend on what address standardization tools the city departments are using internally -- does anyone know about this? |
When he first assembled edifice, Cory used a bag of tricks to get addresses On Wed, Feb 13, 2013 at 12:36 AM, Michael Castelle <[email protected]
|
Yes, that is true, it is likely that Cory's brain and/or command history is a rich resource here. I also assume that he knows which datasets have the most reliable addresses (i.e. something that has likely already gone through a CASS certified converter) and which are really unreliable. This may have an effect in the order that we import data, as well. (e.g. is the address data in building_footprints good or bad?) |
From what I remember reading in this area, there is no better approach than using a gazetteer (if available). For Chicago, we know all the street names and the their address ranges. https://data.cityofchicago.org/Transportation/Chicago-Street-Names/i6bp-fvbx Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address. That would standardize the street name, and often the direction. If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go. For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe. This is more flexible and will tend to be much more accurate than regexp or similar tricks. |
fwiw, the gazetteer approach was what we used at EveryBlock when we introduced address pages: http://blog.everyblock.com/2009/oct/14/addresspages/. I did a lot of user feedback monitoring/ response, and I never had anybody complain of mis-matching. It worked really well. |
Via @daguar: Code for America's New Orleans team built a site called BlightStatus last year. It lets you look up whether a NOLA property is blighted and what the city is doing about it. The data originally came from a mess of disparate, super-dirty spreadsheets, so they spent a ton of time writing scripts to do address normalization. Check them out. CfA fellow @eddietejeda knows more. Also here's a Ruby port of a Perl library that's supposed to be pretty good at address normalization. |
@danxoneil, do you know if any code from the everyblock gazateer ever made it across. If not, do you know if the person who worked on it would be willing to chat with me about the approach they took? |
@fgregg https://github.com/paulsmith is yer man on dat. |
@paulsmith gave us some pointers
There's significant overlap between the parser and this python port of http://search.cpan.org/~timb/Geo-StreetAddress-US-1.03/US.pm I think it's pretty interesting that the EveryBlock code did not do any on-the-fly fuzzy matching (there's a table of street misspellings that it looks at). We should follow up on why. Maybe EveryBlock just wanted to be very conservative and was willing to trade-off recall for precision. |
Determine a quick, accurate way of matching addresses across multiple datasets.
The text was updated successfully, but these errors were encountered: