Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ref db indexes and tmp files should not be stored in HOME directory by default #234

Open
ppericard opened this issue Apr 20, 2020 · 4 comments
Milestone

Comments

@ppericard
Copy link
Contributor

ppericard commented Apr 20, 2020

By default, SMR should not write in any location other than the current working directory or the directory pointed to by the --aligned or the --workdir parameter.
In most bioinformatics HPC systems, users need to work and write data in specific project directories, designed for fast access and with large amounts of storage. In these configurations, HOME directories usually don't have a lot of storage, and it is forbidden to run jobs in those because they are not designed for analysis.
The same could be said for bioinformaticians working on their personal computer and having a drive dedicated to analysis and storage and their HOME directory set up in a different partition/drive with limited storage capacity.
In both situations, having a program storing data in the HOME directory by default is a real problem and could lead to crashes or job interruption.

I've addressed in another issue (#233 ) how index files can be stored in specific locations and how users need complete control over naming and storage location.

As for tmp files (the kvdb dir for example) the standard and expected way of dealing with them is to write them in the output directory (which would be CWD by default, or the directory set by the output parameter, --aligned for SMR).

Moreover, these tmp files should never prevent users from re-running a job. This has already been addressed here #212, but I'm not sure the proposed solution of automatically removing these files on a new run would work either.

By definition, tmp files should be removed automatically at the end of a run that generated them.

The standard way to deal with tmp files (at least in bioinformatics) is to create them in the output directory using a unique name for each run. The name could be composed from the output basename and a unique id (from pid or random number, etc) and tmp files could be either files in this directory, or a sub-directory.
Something like:

sortmerna --ref ref.fa --reads reads.fq --aligned path/to/output/basename

and files in path/to/output:

basename.fq
basename.sam
basename.163596781325.kvdb/
basename.163596781325.tmp_0

These tmp files need to be cleaned by SMR when completing a run, and in case the run crashes, a new job could be run without being blocked because of the tmp files.

@ppericard
Copy link
Contributor Author

I just realized that SMR output files are also stored in the HOME directory by default. This is even worse. Everything should be written in the CWD by default, or the --workdir path manually set by the user.

@ppericard
Copy link
Contributor Author

ppericard commented Apr 20, 2020

I suggest that the default --workdir be $(pwd)/sortmerna_wkdir_xxxxxxxxxxxx/
xxxxxxxxxxxx being a unique string that can be sorted by successive runs (maybe using the date and precise time + pid in case several jobs start in the same second)

Then the default structure can be:

WORKDIR/
                                                 kvdb/
                                                 out/

@mweberr
Copy link

mweberr commented Apr 13, 2022

I had a similar issue when running multiple parallel SMR runs using Snakemake. It is important to specify a separate working directory for each run, otherwise the default working directory will fail

    sortmerna -a {threads} -ref {input.silva_fasta} -reads {input.fastq} \
     -aligned {params.aligned} -other {params.depleted} -workdir {params.wdir} -fastx -v

@ppericard
Copy link
Contributor Author

Here in #372, another example where things can go wrong by using the home directory by default to store tmp files. When integrated into a workflow such as Nextflow, the default location leads to cache problems and confusion between re-runs of the same SortMeRNA step with different inputs.

@cmith44 cmith44 added this to the v5.0.0 milestone Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants