Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change db and cache path #71

Open
bergalu opened this issue Oct 26, 2023 · 5 comments · May be fixed by #72
Open

change db and cache path #71

bergalu opened this issue Oct 26, 2023 · 5 comments · May be fixed by #72

Comments

@bergalu
Copy link

bergalu commented Oct 26, 2023

Good afternoon to everybody,

I would need to change the location of the cache folder and I am wondering if there is a way to do that.

Moreover I have to place the different databases in folders outside the cache one, how can I realize it?
I have downloaded the database with another computer, put it in the workstation to be used, but then I am at a loss to tell taxizedb where to pick up the database when needed.

I'm newbie to R and taxizedb, so I apologise if my questions sound trivial.

Many thanks in advance,
Luca

@stitam
Copy link
Collaborator

stitam commented Oct 26, 2023

Hi @bergalu, thanks for raising this issue, this is not trivial.

In taxizedb caching is managed through the hoardr package (https://github.com/ropensci/hoardr). In short, you can get the current cache path using tdb_cache$cache_path_get() and set it using tdb_cache$cache_path_set(). You can access the help page with ?tdb_cache or visit the github page for hoardr for more information. Does this help?

@bergalu
Copy link
Author

bergalu commented Oct 30, 2023

Hi @stitam , many thanks for your prompt response.

By default, the cache path in the workstation where I need to run taxizedb is:

tdb_cache$cache_path_get()
[1] "~/.cache/R/taxizedb"

I want the chace folder to be:
/gscratch/databases/

I had a look at the hoardr documentation and I succeeded in changing the absolute path by executing:

tdb_cache$cache_path_set(full_path = '/gscratch/databases')
[1] "/gscratch/databases"

Even so, when I exit R, I enter it later on and I load the taxizedb package again, the chace path is reset to the default one.
1) Is there a way to make R retain the wanted cache path (/gscratch/databases)?

Moreover, I tried to download the ncbi database with the db_download_ncbi command, but it fails:

db_download_ncbi(verbose = TRUE, overwrite = FALSE)
downloading...
Error in curl::curl_download(db_url, db_path_file, quiet = TRUE) :
Timeout was reached: [] Failed to connect to ftp.ncbi.nih.gov port 21 after 7984 ms: Connection timed out

I verified that I can connect to and download from (with wget):
https://ftp.ncbi.nih.gov/pub/taxonomy/
but not
ftp://ftp.ncbi.nih.gov/pub/taxonomy/
maybe this is the problem in my case.

Anyway, I tried to work it around by downloading myself the file:
https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip

but then,
2) which steps should I follow in order to build the sql database and link it properly to taxizedb?

Many thanks in advance,
Luca

@arendsee
Copy link
Collaborator

@bergalu @stitam I've run into the same issue with the curl command timing out. So like Luca, I downloaded the NCBI taxonomy dump myself and hit the same problem with figuring out how to make taxizedb process the zip file.

To solve the problem, I forked the repo and added a path option to each of the db_download_* functions (these include the db_download_ncbi function we are both using). With path we can specify our own input file and it will be passed into the same setup code as the file retrieved by default through curl.

So you can do the following:

taxizedb::db_download_ncbi(path="taxdmp.zip")

Where "taxdmp.zip" is your locally downloaded file. The zip file will be processed into an sqlite database and managed under hoardr.

My fork is at https://github.com/arendsee/taxizedb. If this looks good, I can make a PR.

@stitam
Copy link
Collaborator

stitam commented Jan 23, 2024

Thanks @arendsee for working on this, I looked at your commit and it looks good. I was wondering if it is good practice to (optionally) eliminate the "download" part from functions what have "download" in their names, but it's probably fine. This is also good for reproducibility, if someone wants to store the downloaded raw files as well, they can.

Can you please open the PR?

@bergalu
Copy link
Author

bergalu commented Nov 6, 2024

Many thanks @arendsee and @stitam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants