Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in processed data #324

Open
pnoll1 opened this issue Feb 19, 2023 · 2 comments
Open

Duplicates in processed data #324

pnoll1 opened this issue Feb 19, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@pnoll1
Copy link

pnoll1 commented Feb 19, 2023

Describe the bug
There's multiple copies of same records in processed data

To Reproduce

  • View us ri providence from 1/15/23
  • view duplicates based on hash
    select hash, count(*) from us_ri_providence_addresses_city group by hash having count(*)>1;
    • 1aab47288c5ab757 is duplicated 21 times

Expected behavior
Duplicates removed after hashing

Additional context
It's unclear if this an intended limitation since I haven't seen any documentation on what guarantees are given for data

@pnoll1 pnoll1 added the bug Something isn't working label Feb 19, 2023
@iandees
Copy link
Member

iandees commented Feb 20, 2023

The hash attribute is a hash of the other attributes in the row, so if the row is empty then the hash will be the same.

@pnoll1
Copy link
Author

pnoll1 commented Feb 21, 2023

Thanks for the link, very helpful.

The data I’m talking about above are valid addresses with all the fields filled out. I didn’t expect this because I thought addresses between files were deduped which is why so many files are hashes only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants