Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit fetch from GISAID #242

Open
joverlee521 opened this issue Dec 3, 2021 · 2 comments
Open

Revisit fetch from GISAID #242

joverlee521 opened this issue Dec 3, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Context
On Dec 2, 2021, multiple fetch-and-ingest runs for GISAID failed. The failure pattern was we would download for a while and the transfer would get closed before it's completed. Subsequent attempts to fetch would hit a 503 error. We manually triggered fetch-and-ingest two more times and saw the same failure pattern.

Possible solution
The scheduled run today had no issues, so this may have just been unfortunate timing of our runs being interrupted by GISAID's reboots. We can revisit the following solutions in anticipation of similar future issues:

  1. Manual downloads from the same API endpoint were able to complete successfully when done without streaming decompression. We can update fetch-from-gisaid to stop decompression during streaming to lower the open connection time. However, decompressing in a separate step this would increase the total time to run fetch-and-ingest.
  2. Switch to an endpoint with xz, which has better compression ratio and decompression time than bzip2. Regardless of errors, this would be a huge improvement for us and dramatically decrease fetch-and-ingest runtime.
@joverlee521 joverlee521 added the enhancement New feature or request label Dec 3, 2021
ivan-aksamentov added a commit that referenced this issue Dec 10, 2021
I don't know if it's any faster, but why now.

The results are correct in my local testing.

Locally, it does use multiple threads, but not too many. We might be bound by download speed rather then decompression though.

Related: #242
@ivan-aksamentov
Copy link
Member

@joverlee521

Switch to an endpoint with xz.

I did not know it exists. Do you know the URL? Does it have the same data in it?

In the meantime we could try parallel bzip also: #247

@tsibley
Copy link
Member

tsibley commented Dec 17, 2021

I did not know it exists. Do you know the URL? Does it have the same data in it?

Ah, it does not exist, as far as we know. This would be asking GISAID to switch to xz for us for the current export we get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Prioritized
Development

No branches or pull requests

3 participants