Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-dupe donor and vendor data #10

Open
waldoj opened this issue Jun 5, 2013 · 2 comments
Open

De-dupe donor and vendor data #10

waldoj opened this issue Jun 5, 2013 · 2 comments

Comments

@waldoj
Copy link
Member

waldoj commented Jun 5, 2013

There are a few good tools for this:

https://github.com/cjdd3b/fec-standardizer
https://github.com/huffpostdata/campfin-linker
https://github.com/open-city/dedupe

@waldoj
Copy link
Member Author

waldoj commented Oct 3, 2013

After researching all three, I'm inclined to go with @open-city's library. @cjdd3b's appeals to me, but I think the former is going to be simpler to get running.

@waldoj
Copy link
Member Author

waldoj commented Dec 29, 2013

A few months later, I'm not so sure. I'm keeping these data in ElasticSearch now, and I'm persuaded that it's the proper vehicle for manipulating this data. But of course it remains essential to de-dupe donor and vendor records.

ElasticSearch's FuzzyLikeThis query is promising. Here's a blog entry on the topic.

I'm also thinking that this problem could be offloaded, by geocoding every address (basically farming out the problem to a more intelligent service), and then using the lat/lon pair combined with the name to figure out if it's the same vendor / contributor. At .8¢/query via Yahoo, that could get expensive fast. (Over $900, by my math!) Google sells the same service, but it's apparently so expensive that they're not even naming the price. :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant