Ref db indexes and tmp files should not be stored in HOME directory by default #234

ppericard · 2020-04-20T08:18:16Z

By default, SMR should not write in any location other than the current working directory or the directory pointed to by the --aligned or the --workdir parameter.
In most bioinformatics HPC systems, users need to work and write data in specific project directories, designed for fast access and with large amounts of storage. In these configurations, HOME directories usually don't have a lot of storage, and it is forbidden to run jobs in those because they are not designed for analysis.
The same could be said for bioinformaticians working on their personal computer and having a drive dedicated to analysis and storage and their HOME directory set up in a different partition/drive with limited storage capacity.
In both situations, having a program storing data in the HOME directory by default is a real problem and could lead to crashes or job interruption.

I've addressed in another issue (#233 ) how index files can be stored in specific locations and how users need complete control over naming and storage location.

As for tmp files (the kvdb dir for example) the standard and expected way of dealing with them is to write them in the output directory (which would be CWD by default, or the directory set by the output parameter, --aligned for SMR).

Moreover, these tmp files should never prevent users from re-running a job. This has already been addressed here #212, but I'm not sure the proposed solution of automatically removing these files on a new run would work either.

By definition, tmp files should be removed automatically at the end of a run that generated them.

The standard way to deal with tmp files (at least in bioinformatics) is to create them in the output directory using a unique name for each run. The name could be composed from the output basename and a unique id (from pid or random number, etc) and tmp files could be either files in this directory, or a sub-directory.
Something like:

sortmerna --ref ref.fa --reads reads.fq --aligned path/to/output/basename

and files in path/to/output:

basename.fq
basename.sam
basename.163596781325.kvdb/
basename.163596781325.tmp_0

These tmp files need to be cleaned by SMR when completing a run, and in case the run crashes, a new job could be run without being blocked because of the tmp files.

The text was updated successfully, but these errors were encountered:

ppericard · 2020-04-20T10:07:33Z

I just realized that SMR output files are also stored in the HOME directory by default. This is even worse. Everything should be written in the CWD by default, or the --workdir path manually set by the user.

ppericard · 2020-04-20T10:13:30Z

I suggest that the default --workdir be $(pwd)/sortmerna_wkdir_xxxxxxxxxxxx/
xxxxxxxxxxxx being a unique string that can be sorted by successive runs (maybe using the date and precise time + pid in case several jobs start in the same second)

Then the default structure can be:

WORKDIR/
                                                 kvdb/
                                                 out/

mweberr · 2022-04-13T13:37:47Z

I had a similar issue when running multiple parallel SMR runs using Snakemake. It is important to specify a separate working directory for each run, otherwise the default working directory will fail

    sortmerna -a {threads} -ref {input.silva_fasta} -reads {input.fastq} \
     -aligned {params.aligned} -other {params.depleted} -workdir {params.wdir} -fastx -v

ppericard · 2023-05-09T07:44:01Z

Here in #372, another example where things can go wrong by using the home directory by default to store tmp files. When integrated into a workflow such as Nextflow, the default location leads to cache problems and confusion between re-runs of the same SortMeRNA step with different inputs.

ppericard mentioned this issue Apr 20, 2020

Ref DB index management is not up to current bioinformatics standards and is not adapted to large scale analysis in production workflows on HPC systems #233

Open

cmith44 added this to the v5.0.0 milestone Jun 22, 2023

ppericard added the enhancement label Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ref db indexes and tmp files should not be stored in HOME directory by default #234

Ref db indexes and tmp files should not be stored in HOME directory by default #234

ppericard commented Apr 20, 2020 •

edited

ppericard commented Apr 20, 2020

ppericard commented Apr 20, 2020 •

edited

mweberr commented Apr 13, 2022

ppericard commented May 9, 2023

Ref db indexes and tmp files should not be stored in HOME directory by default #234

Ref db indexes and tmp files should not be stored in HOME directory by default #234

Comments

ppericard commented Apr 20, 2020 • edited

ppericard commented Apr 20, 2020

ppericard commented Apr 20, 2020 • edited

mweberr commented Apr 13, 2022

ppericard commented May 9, 2023

ppericard commented Apr 20, 2020 •

edited

ppericard commented Apr 20, 2020 •

edited