Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check-locations: Columns have mixed types #249

Open
ivan-aksamentov opened this issue Dec 10, 2021 · 0 comments
Open

check-locations: Columns have mixed types #249

ivan-aksamentov opened this issue Dec 10, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Dec 10, 2021

A snippet from the GISAID ingest log:

+ ./bin/check-locations data/gisaid/metadata.tsv data/gisaid/location_hierarchy.tsv gisaid_epi_isl
sys:1: DtypeWarning: Columns (9,28,37,39,43) have mixed types.Specify dtype option on import or set low_memory=False

This is probably unexpected types in the data, either medatata.tsv or location_hierarchy.tsv. For example we might be assuming a column contains numbers, but in reality it contains mostly numbers and then some runaway strings. But might be somethign more sophisticated also.

How to investigate:

This can be investigated in isolation from the pipeline, by running the ./bin/check-locations script.

  • download data/gisaid/metadata.tsv from S3
  • review input data files for obvious defects
  • review ./bin/check-locations for obvious defects
  • try to do binary search on the input data files to find rows that trigger the issue: delete half of the file, if problem goes away then the problem is in the other half, if not, remove half of what's remaining. Repeat until the minimal set of rows is found that reproduces the issue. Inspect these rows.

There is a few places throughout ingest scripts where these warnigns were silenced by setting low_memory=False as proposed in the warnign message. But this might not be what we need and might just hide programming mistakes and generate bogus outputs. We might need to search for occurrences of low_memory in the codebase to see what's going on there.

@ivan-aksamentov ivan-aksamentov added the bug Something isn't working label Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant