Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hicBuildMatrix cannot take space-separated list for Sequence of the restriction site or Dangling sequence #6505

Open
bskubi opened this issue Oct 30, 2024 · 8 comments · Fixed by #6519

Comments

@bskubi
Copy link

bskubi commented Oct 30, 2024

Describe the bug
I am using HicExplorer Galaxy on https://hicexplorer.usegalaxy.eu. Using the hicBuildMatrix tool on Hi-C data prepped with two restriction enzymes, I need to put both sequences in for Sequence of the restriction site and Dangling sequence. HiCExplorer's documentation says it can handle space-separated lists for these arguments. However, when I enter a space-separated list on the Galaxy form when designing my workflow (prior to the run), the box turns red and says that non-numeric characters are not allowed, preventing me from inputting a space-separated list.

Galaxy Version and/or server at which you observed the bug
{"version_major":"24.1","version_minor":"2.dev0"}

Browser and Operating System
Operating System: Linux Jammy Jellyfish
Browser: Chrome

To Reproduce
Steps to reproduce the behavior:

  1. Go to hicBuildMatrix
  2. Enter "AAGCTT GATC" in the Sequence of the restriction site or "AGCT GATC" Dangling sequence text input boxes

Expected behavior
It should permit the above as a valid input to either of these boxes (the exact string content doesn't matter, as long as there's a space separating two strings).

Additional context
HicExplorer's documentation states that space-separated lists are permitted, and I'm not able to run the workflow due to this bug, so I'm assuming the issue lies with Galaxy's input validation rather than with HicExplorer.

@bernt-matthias bernt-matthias transferred this issue from galaxyproject/galaxy Nov 1, 2024
@bernt-matthias
Copy link
Contributor

Thanks for the report. Do you have a link to the docs at hand?

I guess space separated list of strings consisting of [atcgATCG] (and maybe N) should be fine, or?

@bskubi
Copy link
Author

bskubi commented Nov 3, 2024

Here's the relevant page from the docs.

See the first sentence of the --restrictionSequence and --danglingSequence arguments.

I didn't create the HiCExplorer tool, I'm just trying to benchmark against it, so unfortunately that is all the info I have!

@bskubi
Copy link
Author

bskubi commented Nov 3, 2024

@bernt-matthias I would also note that the multi-bin feature on hicBuildMatrix appears to be broken, and possibly hicNormalize as well. Here's what I tried:

I specified multiple bin resolutions using a single hicBuildMatrix step (10kb, 20kb, 50kb, 100kb), expecting it to produce a single multi-res cooler file (.mcool) containing all 4 resolutions.

I then fed the output from this single hicBuildMatrix step into a hicNormalize step, expecting it to add normalizations to the 0-1 range to all four resolutions.

Instead, the result was that hicBuildMatrix produced a single-resolution cooler file (I believe at 10kb resolution), and the hicNormalize function produced a 0-byte empty output.

I'm not sure if hicNormalize is broken or if it only failed because of the issue with hicBuildMatrix. I'm trying it another way calling hicNormalize separately on each individual resolution. However, I currently can't figure out a way to produce a multi-res .mcool matrix using Galaxy HiCExplorer (I know how to make them using other tools, I'm just trying to figure out if it's currently possible on Galaxy HiCExplorer specifically).

@bernt-matthias
Copy link
Contributor

The original problem should be fixed in #6519

For the other problem: Can you check the produced command line(s)? Maybe it's an upstream problem?

@bernt-matthias
Copy link
Contributor

@bskubi please feel free to reopen if needed or open a new issue. Thanks again for the report.

@bskubi
Copy link
Author

bskubi commented Nov 14, 2024

@bernt-matthias

The command line generated by hicBuildMatrix when trying to build multiple bin sizes is:

mkdir ./QCfolder && mkdir '/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files' && hicBuildMatrix --samFiles '/data/dnb10/galaxy_db/files/a/2/4/dataset_a24ca246-f55d-40ed-8d2b-b3dabe741ff4.dat' '/data/dnb10/galaxy_db/files/c/3/c/dataset_c3c4f136-6c80-4e5e-b775-fd62f08e84ef.dat'  --restrictionCutFile '/data/dnb10/galaxy_db/files/d/a/1/dataset_da1a0a55-ebfa-4e1d-ad03-7828b4bec739.dat'  --restrictionSequence 'AAGCTT' --danglingSequence 'AGCT'  --binSize '10000' '20000' '50000' '100000'  --chromosomeSizes '/data/dnb10/galaxy_db/files/7/4/9/dataset_749e98a5-04c7-4ba4-a385-8309db2d3053.dat' --genomeAssembly 'hg38'   --outFileName 'matrix.cool'       --minMappingQuality 30  --threads ${GALAXY_SLOTS:-4}  --QCfolder ./QCfolder && mv ./QCfolder/* /data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files/ && mv '/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files/hicQC.html' '/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e.dat' && mv "/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files"/*.log raw_qc && mv matrix.cool matrix

The relevant bit is --binSize '10000' '20000' '50000' '100000'

According to the documentation for hicBuildMatrix:

--binSize, -bs
Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. If not given, matrices of restriction site resolution will be built. Optionally for mcool file format: Define multiple resolutions which are all a multiple of the first value. Example: –binSize 10000 20000 50000 will create a mcool file formate containing the three defined resolutions.

So it seems like the --binSize argument is correctly formatted. I'm not sure if this is an issue with hicBuildMatrix or what. It does produce a usable single-resolution matrix at the smallest binSize, and I'm also unsure of why it is not being successfully normalized by the subsequent hicNormalizeMatrix step.

@bskubi
Copy link
Author

bskubi commented Nov 16, 2024

@bernt-matthias

It appears that the reason that the attempt to build a multi-resolution .mcool file is not working is that Galaxy hardcodes the output file name as having a .cool extension rather than a .mcool extension. However, the hicBuildMatrix command-line utility infers whether a .cool or .mcool file ought to be built based on this extension. So even if multiple --binSize parameters are passed, only a .cool file will be built.

For this understanding, I am referring to the documentation here.

hicBuildMatrix supports building multicooler matrices which are for example needed for visualization with HiGlass. To do so, use as out file format either .cool or .mcool and define the desired resolutions as –binSize.

A potential solution would be to add an additional output file type '.mcool' in addition to '.cool' and '.h5' and select the filename extension accordingly.

@bernt-matthias
Copy link
Contributor

Seems to be the easiest solution (maybe plus some docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants