Skip to content

Pipeline fails with "Cannot open file x.tsv for writing. Too many open files" #158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
imnuvi opened this issue Feb 28, 2025 · 8 comments

Comments

@imnuvi
Copy link

imnuvi commented Feb 28, 2025

Operating System

Other Linux (please specify below)

Other Linux

Red Hat Enterprise Linux 8.8 (Ootpa)

Workflow Version

v3.0.0

Workflow Execution

Command line (Cluster)

Other workflow execution

No response

EPI2ME Version

No response

CLI command run

nextflow run software_packages/wf-single-cell
--expected_cells 10000
--fastq gene_expression/library_2_reads_merged
--kit 3prime:v3
--ref_genome_dir data/RefGenome/
-profile standard
--out_dir runs/run5

Workflow Execution - CLI Execution Profile

singularity

What happened?

Setup:

kit: 3prime:v3
executor: singularity
environment: computing cluster
RAM: 128GB
cores: 12

Current Behaviour
Tried running the epi2me single cell workflow on our sequencing data and the pipeline fails at the cat_tags_by_chrom step. This command uses awk which seems to open a lot of files and exceeds the system limit for files.

Expected Behaviour
The pipeline runs fine without any issues and gives out the results matrix

Relevant log output

executor >  local (128)
[ee/7dc26d] fastcat (1)                    | 1 of 1 ✔
[9d/89cc49] parse_kit_metadata (1)         | 1 of 1 ✔
[24/e1d96e] pipeline:getVersions           | 1 of 1 ✔
[5c/1fec01] pipeline:getParams             | 1 of 1 ✔
[3e/925f97] pip…e:preprocess:call_paftools | 1 of 1 ✔
[af/04a6bb] pip…rocess:build_minimap_index | 1 of 1 ✔
[0a/3568bc] pip…cess:call_adapter_scan (5) | 11 of 11 ✔
[c8/4b9212] pip…ummarize_adapter_table (1) | 1 of 1 ✔
[c0/bf5b0f] pip…s_bams:split_gtf_by_chroms | 1 of 1 ✔
[f0/1a15a2] pip…ams:generate_whitelist (1) | 1 of 1 ✔
[b4/4de3de] pip…_bams:assign_barcodes (11) | 11 of 11 ✔
[f9/5bcfab] pip…:merge_and_publish_tsv (1) | 1 of 1 ✔
[99/ddf8c2] pip…bams:cat_tags_by_chrom (1) | 1 of 1, failed: 1 ✘
[0d/ae8f94] pip…rocess_bams:merge_bams (1) | 1 of 1 ✔
[61/453413] pip…rocess_bams:stringtie (46) | 47 of 47 ✔
[22/f19512] pip…lign_to_transcriptome (47) | 47 of 47 ✔
[-        ] pip…ocess_bams:assign_features -
[-        ] pip…process_bams:create_matrix -
[-        ] pip…rocess_bams:process_matrix -
[-        ] pip…s_bams:merge_transcriptome -
[-        ] pip…e:process_bams:pack_images | 0 of 1
Plus 5 more processes waiting for tasks…
ERROR ~ Error executing process > 'pipeline:process_bams:cat_tags_by_chrom (1)'

Caused by:
  Process `pipeline:process_bams:cat_tags_by_chrom (1)` terminated with an error exit status (2)


Command executed:

  mkdir chr_tags
  # Find the chr column number
  files=(tags/*)
  chr_col=$(awk -v RS=' ' '/chr/{print NR; exit}' "${files[0]}")

  # merge the tags TSVs, keep header from first file and split entries by chromosome
  awk -F'       ' -v chr_col=$chr_col 'FNR==1{hdr=$0; next}     {if (!seen[$chr_col]++)         print hdr>"chr_tags/"$chr_col".tsv";         print>"chr_tags/"$chr_col".tsv"}' tags/*

Command exit status:
  2

Command output:
  (empty)

Command error:
  awk: cannot open "chr_tags/ENST00000575475.2.tsv" for output (Too many open files)

Application activity log entry

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

@nrhorner
Copy link
Contributor

nrhorner commented Mar 3, 2025

Hi @imnuvi

It appears that in cat_tags_by_chrom process, the files are being aggregated by transcript ID not chromosome.

This is strange because the tags files should not contain any transcript info at this point

Is it possible that you have have supplied a transcriptome sequence instead of a genome sequence?

What is the content of the following file?

<ref_genome_dir>/fasta/genome.fa

@imnuvi
Copy link
Author

imnuvi commented Mar 3, 2025

Hi @nrhorner,

Thanks for your reply!
You could be right on the transcriptome part.

Attaching the first few lines from <refgenome_dir>/fasta/genome.fa

`>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG

ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
ENST00000434970.2 cdna chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]
CCTTCCTAC
ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
ENST00000633010.1 cdna chromosome:GRCh38:CHR_HSCHR14_3_CTG1:105895279:105895294:-1 gene:ENSG00000282274.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD4-17 description:immunoglobulin heavy diversity 4-17 [Source:HGNC Symbol;Acc:HGNC:5503]
TGACTACGGTGACTAC
ENST00000632968.1 cdna chromosome:GRCh38:CHR_HSCHR14_3_CTG1:105891962:105891978:-1 gene:ENSG00000282592.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD1-20 description:immunoglobulin heavy diversity 1-20 [Source:HGNC Symbol;Acc:HGNC:5484]
GGTATAACTGGAACGAC
ENST00000603693.1 cdna chromosome:GRCh38:15:21011451:21011469:-1 gene:ENSG00000270451.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD4OR15-4B description:immunoglobulin heavy diversity 4/OR15-4B (non-functional) [Source:HGNC Symbol;Acc:HGNC:5507]
TGACTATGGTGCTAACTAC
ENST00000452198.1 cdna chromosome:GRCh38:14:105881539:105881556:-1 gene:ENSG00000225825.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD6-25 description:immunoglobulin heavy diversity 6-25 [Source:HGNC Symbol;Acc:HGNC:5516]
GGGTATAGCAGCGGCTAC
ENST00000632609.1 cdna chromosome:GRCh38:CHR_HSCHR14_3_CTG1:105905268:105905298:-1 gene:ENSG00000282373.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD3-10 description:immunoglobulin heavy diversity 3-10 [Source:HGNC Symbol;Acc:HGNC:5495]
GTATTACTATGGTTCGGGGAGTTATTATAAC`

@nrhorner
Copy link
Contributor

nrhorner commented Mar 3, 2025

Hi @imnuvi

This is the issue. That file should be a genomic DNA sequence.

@imnuvi
Copy link
Author

imnuvi commented Mar 5, 2025

Hi @nrhorner,

I ran the pipeline with full genome sequence and all steps have run except the process matrix step. Attaching the error below

executor > local (225)
[c6/bd05c1] fastcat (1) | 1 of 1 ✔
[e5/6bb04a] parse_kit_metadata (1) | 1 of 1 ✔
[29/aa8e5f] pipeline:getVersions | 1 of 1 ✔
[99/f688b2] pipeline:getParams | 1 of 1 ✔
[3b/2cbd25] pip…e:preprocess:call_paftools | 1 of 1 ✔
[9f/ede70b] pip…rocess:build_minimap_index | 1 of 1 ✔
[be/e4cc24] pip…cess:call_adapter_scan (6) | 11 of 11 ✔
[3e/c6fcd3] pip…ummarize_adapter_table (1) | 1 of 1 ✔
[b0/18d39a] pip…s_bams:split_gtf_by_chroms | 1 of 1 ✔
[9a/8d9722] pip…ams:generate_whitelist (1) | 1 of 1 ✔
[0f/89d8fc] pip…s_bams:assign_barcodes (6) | 11 of 11 ✔
[b4/6f15bd] pip…:merge_and_publish_tsv (1) | 1 of 1 ✔
[12/3f40da] pip…bams:cat_tags_by_chrom (1) | 1 of 1 ✔
[42/df7c6a] pip…rocess_bams:merge_bams (1) | 1 of 1 ✔
[6f/077c91] pip…rocess_bams:stringtie (46) | 47 of 47 ✔
[37/845e70] pip…lign_to_transcriptome (47) | 47 of 47 ✔
[d5/03da76] pip…_bams:assign_features (10) | 45 of 45 ✔
[12/97f316] pip…ss_bams:create_matrix (45) | 45 of 45 ✔
[f1/eedee3] pip…ss_bams:process_matrix (1) | 1 of 2, failed: 1
[e6/28b9ea] pip…ms:merge_transcriptome (1) | 1 of 1 ✔
[38/b74138] pip…ombine_final_tag_files (1) | 1 of 1 ✔
[50/8c8bb2] pip…e:process_bams:tag_bam (1) | 0 of 1
[d5/f702b0] pip…ms:umi_gene_saturation (1) | 1 of 1 ✔
[a2/5e4f55] pip…ocess_bams:pack_images (1) | 1 of 1 ✔
Plus 2 more processes waiting for tasks…
ERROR ~ Error executing process > 'pipeline:process_bams:process_matrix (1)'

Caused by:
Process pipeline:process_bams:process_matrix (1) terminated with an error exit status (1)

Command executed:

export NUMBA_NUM_THREADS=1
workflow-glue process_matrix inputs/matrix*.hdf --feature gene --raw "89c43cbf223240f7a4c162d939ddbd7d.gene_raw_feature_bc_matrix" --processed "89c43cbf223240f7a4c162d939ddbd7d.gene_processed_feature_bc_matrix" --per_cell_mito "89c43cbf223240f7a4c162d939ddbd7d.gene_expression_mito_per_cell.tsv" --per_cell_expr "89c43cbf223240f7a4c162d939ddbd7d.gene_expression_mean_per_cell.tsv" --umap_tsv "89c43cbf223240f7a4c162d939ddbd7d.gene_expression_umap_REPEAT.tsv" --enable_filtering --min_features 200 --min_cells 3 --max_mito 20 --mito_prefixes MT- --norm_count 10000 --enable_umap --replicates 3

Command exit status:
1

Command output:
(empty)

Command error:
[18:14:26 - workflow_glue] Bootstrapping CLI.
/home/epi2melabs/conda/lib/python3.8/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/epi2melabs/conda/lib/python3.8/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/epi2melabs/conda/lib/python3.8/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
Traceback (most recent call last):
File "/home/nuvi/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow-glue", line 7, in
cli()
File "/home/nuvi/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow_glue/init.py", line 66, in cli
components = get_components(allowed_components=[sys.argv[1]])
File "/home/nuvi/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow_glue/init.py", line 29, in get_components
mod = importlib.import_module(f"{_package_name}.{name}")
File "/home/epi2melabs/conda/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in load_unlocked
File "", line 843, in exec_module
File "", line 219, in call_with_frames_removed
File "/home/nuvi/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow_glue/process_matrix.py", line 8, in
import umap
File "/home/epi2melabs/conda/lib/python3.8/site-packages/umap/init.py", line 2, in
from .umap
import UMAP
File "/home/epi2melabs/conda/lib/python3.8/site-packages/umap/umap
.py", line 41, in
from umap.layouts import (
File "/home/epi2melabs/conda/lib/python3.8/site-packages/umap/layouts.py", line 40, in
def rdist(x, y):
File "/home/epi2melabs/conda/lib/python3.8/site-packages/numba/core/decorators.py", line 234, in wrapper
disp.enable_caching()
File "/home/epi2melabs/conda/lib/python3.8/site-packages/numba/core/dispatcher.py", line 863, in enable_caching
self._cache = FunctionCache(self.py_func)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/numba/core/caching.py", line 601, in init
self._impl = self._impl_class(py_func)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/numba/core/caching.py", line 337, in init
raise RuntimeError("cannot cache function %r: no locator available "
RuntimeError: cannot cache function 'rdist': no locator available for file '/home/epi2melabs/conda/lib/python3.8/site-packages/umap/layouts.py'

@nrhorner
Copy link
Contributor

nrhorner commented Mar 5, 2025

Hi @imnuvi please see the troubleshooting section in the Readme that relates to this.

Thanks

@nrhorner
Copy link
Contributor

@imnuvi In v3.1.0 the numba cache is set to the process directory, which should fix this issue

@nrhorner
Copy link
Contributor

@imnuvi Did you get round to trying out the new version?

@imnuvi
Copy link
Author

imnuvi commented May 16, 2025

Hi @nrhorner I haven't tried it out yet. Will try it out and update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants