Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run ./bin/check-annotations as CI (or similar) #39

Open
jameshadfield opened this issue May 1, 2020 · 11 comments · Fixed by #40
Open

Run ./bin/check-annotations as CI (or similar) #39

jameshadfield opened this issue May 1, 2020 · 11 comments · Fixed by #40
Assignees
Labels
enhancement New feature or request

Comments

@jameshadfield
Copy link
Member

#38 introduced ./bin/check-annotations which helps identify problems with the contents of annotations.tsv. This highlights issues that may only become evident stochastically after running the auspice build.

It would be nice to run this as a CI (or similar). For this to happen the script would have to be modified to exit appropriately which it doesn't currently do.

@jameshadfield jameshadfield added the enhancement New feature or request label May 1, 2020
@emmahodcroft
Copy link
Member

emmahodcroft commented May 1, 2020

This would be really good. Additionally - just realised another check we should do - if a sequence name listed in annotations.tsv is not found in the metadata.

This can indicate a sequence name has changed, and thus the annotations file needs to be updated. I'll see if I can add this if I have time.
Would clearly be most valuable if this is run automatically as James suggests above!

@jstoja
Copy link

jstoja commented May 1, 2020

Hey @emmahodcroft I just saw your suggestion about sequences checking.
I'm just wondering:

Thanks

@kairstenfay
Copy link
Contributor

Hey @emmahodcroft I just saw your suggestion about sequences checking.
I'm just wondering:

* Should it check the sequences against from [`ncov/data/metadata.tsv`](https://github.com/nextstrain/ncov/blob/master/data/metadata.tsv) or `ncov-ingest/data/metadata.tsv` (this repository) ?

* Since I don't have the `metadata.tsv` file from this repository, can I base myself from [`ncov/data/metadata.tsv`](https://github.com/nextstrain/ncov/blob/master/data/metadata.tsv) to implement this additional check in #40 ?

Thanks

Hi @jstoja. Please run the checks against the metadata.tsv file in this repo (ncov-ingest). You can base yourself from another copy of the metadata.tsv (e.g. one from the ncov repo) for development, because both files have the same structure.

@emmahodcroft
Copy link
Member

Sorry for the delay. Thanks for answering @kairstenfay ! Important that we check against the metadata.tsv file in this repo, as ncov/data/metadata.tsv no longer exists on Github. If you are a registered GISAID user, you can download this file from their website - if that's helpful let me know and I can guide you in how to do that.

@kairstenfay
Copy link
Contributor

Reopening this issue that was automatically closed by #40 because the additional idea proposed by @emmahodcroft of making sure strains listed in annotations.tsv appear in metadata.tsv has not yet been implemented.

source

@kairstenfay kairstenfay reopened this May 14, 2020
@jstoja
Copy link

jstoja commented May 14, 2020

Hello @kairstenfay In #40 I didn't add bin/check-annotations to the CI (I didn't add it back with our last discussions on where to call it).
I'll open another PR to add the suggestion from @emmahodcroft , and at the same time, add the check as CI.
Annotations are used by bin/transform, but metadata is generated by this step, therefore, there are 2 options:

  1. Check annotations before the transform, and then check them again after the metadata have been generated by bin/transform.
  2. Check annotations with and without metadata after the transform.

I think it'd make more sense to use option 1. (do 2 checks that are different), because we could prevent even fetching and uploading data/gisaid.ndjson if the annotations file has issues.

@kairstenfay
Copy link
Contributor

kairstenfay commented May 14, 2020

Hello @kairstenfay In #40 I didn't add bin/check-annotations to the CI (I didn't add it back with our last discussions on where to call it).

No worries, @jstoja. I went ahead and added it before merging.

I'll open another PR to add the suggestion from @emmahodcroft , and at the same time, add the check as CI.

That would be awesome if you wanted to open another PR for the final suggestion in this thread, but I think a little more discussion needs to happen about where the logic in check-annotations should happen. As you saw in #47, adding check-annotations to the CI caused builds to fail. This is no fault of yours. I neglected to rebase your PR onto master and catch that check-annotations became outdated compared to the logic in transform. I'm going to post either here or in #47 a summary of a discussion I had last night with fellow Nextstrain team members about what went wrong and why.

Annotations are used by bin/transform, but metadata is generated by this step, therefore, there are 2 options:

1. Check annotations before the transform, and then check them again after the metadata have been generated by `bin/transform`.

2. Check annotations with and without metadata after the transform.

I think it'd make more sense to use option 1. (do 2 checks that are different), because we could prevent even fetching and uploading data/gisaid.ndjson if the annotations file has issues.

Thank you for clearly stating these two options. I think a third option, checking annotations inside of transform, should be explored in addition to these.

@kairstenfay
Copy link
Contributor

kairstenfay commented May 14, 2020

A summary of the latest issue with ./bin/check-annotations:

check-annotations has become out of date compared to the logic in transform. This is evident by the fact that ./bin/check-annotations currently fails but builds succeed. The problem experienced by users when running this script, initially summarized in #47, proposes improved error handling describing exactly what a user may need to change. However, the real issue comes down to checking if the actual logic performed in check-annotations is still valid given transform or when and how often we run check-annotations (or the logic therein if we roll it into transform).

Further evidence that check-annotations and transform are not playing nicely together is found in these conditional statements in transform. For example, in #47, an annotation for France/CIBU200107/2020 failed that specified division and division_exposure but not country or country_exposure. check-annotations stopped failing on this strain when the following action was taken:

This issue was resolved by removing division_exposure annotations for France/CIBU200107/2020.

Note that in transform, France/CIBU200107/2020's division_exposure gets set to the same value as its division anyway. So, the error generated by check-annotations here can be considered a false positive.

@kairstenfay kairstenfay self-assigned this May 14, 2020
@kairstenfay
Copy link
Contributor

@jameshadfield after a discussion with @tsibley, I think I finally understand why you implemented several of the tests in check-annotations. For example, when you wrote the check that if a division_exposure is set, a country_exposure must also be set, was it to enforce that the resulting hierarchy is valid? I can imagine a situation where someone annotates an incorrect location but may forget to update the other parts of the location hierarchy (country, region, or division).

If this was your thinking, I believe location hierarchies could be validated against a "truth" we maintain in either in JSON, CSV, or PostgreSQL.

@jameshadfield
Copy link
Member Author

jameshadfield commented May 14, 2020

@kairstenfay yes - see explanation here: #47 (comment) and the original PR here: #38. Would like to reiterate that I'm always happy to explain in more detail if there's things you don't understand 😄

@kairstenfay
Copy link
Contributor

@jstoja It would be fine for now to open another PR checking that the annotation strains all exist in the metadata.tsv (as Emma suggsted). But don't worry about adding check-annotations back to the CI. The recent behavior of check-annotations against the latest annotations.tsv made @tsibley and I rethink where the logic inside check-annotations should be invoked. So, I'll be integrating those checks into transform soon. Your recent refactoring of the code will be super helpful here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants