Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with collate function #9

Open
kitzcode opened this issue Sep 17, 2018 · 6 comments
Open

Problem with collate function #9

kitzcode opened this issue Sep 17, 2018 · 6 comments

Comments

@kitzcode
Copy link

Hi! Thanks for this wonderful application, it helped me a lot.
However, I ran into an issue with the collate function I can't figure out to solve:

picardmetrics collate filename /path/filename/
picardmetrics version 0.2.4 2016-07-06
2018-09-17 10:47:47 START filename
2018-09-17 10:47:47 Collating 96 alignment_summary_metrics files
2018-09-17 10:47:53 Collating 96 quality_distribution_metrics files
2018-09-17 10:47:57 Collating 96 rnaseq_metrics files (summary)
2018-09-17 10:48:01 Collating 96 rnaseq_metrics files (coverage)
2018-09-17 10:48:05 Collating 96 gc_bias_metrics files
2018-09-17 10:48:07 Collating 96 gc_bias_histogram files
2018-09-17 10:48:12 Collating 96 duplicate_metrics files
2018-09-17 10:48:16 Collating 96 insert_size_metrics files
2018-09-17 10:48:21 Collating 96 insert_size_metrics files (histogram)
2018-09-17 10:48:25 Collating 96 base_distribution_by_cycle files
2018-09-17 10:48:32 Collating 96 library_complexity files
2018-09-17 10:48:36 Collating 96 library_complexity files (histogram)
2018-09-17 10:48:40 Collating 96 mapq_stats files
2018-09-17 10:48:41 Joining all files into 'filename-all-metrics.tsv'
Error: all(dat_align_metrics$SAMPLE == dat_duplicate_metrics$SAMPLE) is not TRUE
Execution halted
2018-09-17 10:48:42 DONE filename

I don't understand where the problem lies as there are 96 files in each category. When I count the lines in the intermediate files, it give 97 for duplicate_metrics and 289 for alignment_metrics.

Thanks
Alex

@slowkow
Copy link
Owner

slowkow commented Sep 20, 2018

Could I ask you to inspect the contents of the two files?

  • alignment_summary_metrics
  • duplicate_metrics

I wonder if one of the files is missing a result.

@kitzcode
Copy link
Author

Hi! I inspected the files and they seem valid. Each sample has one line in duplicate_metrics and 3 lines in alignment_metrics. No empty lines, no duplicates... I even realigned my samples in case something went wrong there, but still get the same error.

Anything else I could check?

@slowkow
Copy link
Owner

slowkow commented Oct 11, 2018

If you could share the output files, I might be able to fix the code to work with your files.

@kitzcode
Copy link
Author

@slowkow
Copy link
Owner

slowkow commented Oct 11, 2018

I was confused because you added .xlsx to the file names. These files are not Microsoft Excel spreadsheets, they're just plain text files.

You cannot open them with Excel:

screen shot 2018-10-11 at 1 02 37 pm

You can read the contents anyway:

$ head 8Fat-alignment-metrics.xlsx | cut -f1-3 | column -t
SAMPLE                                                                   CATEGORY        TOTAL_READS
scratch60/ercc/8Fat/picardmetrics/8FTreg10_S57Aligned.sortedByCoord.out  FIRST_OF_PAIR   333
scratch60/ercc/8Fat/picardmetrics/8FTreg10_S57Aligned.sortedByCoord.out  SECOND_OF_PAIR  333
scratch60/ercc/8Fat/picardmetrics/8FTreg10_S57Aligned.sortedByCoord.out  PAIR            666
scratch60/ercc/8Fat/picardmetrics/8FTreg11_S82Aligned.sortedByCoord.out  FIRST_OF_PAIR   964252
scratch60/ercc/8Fat/picardmetrics/8FTreg11_S82Aligned.sortedByCoord.out  SECOND_OF_PAIR  964252
scratch60/ercc/8Fat/picardmetrics/8FTreg11_S82Aligned.sortedByCoord.out  PAIR            1928504
scratch60/ercc/8Fat/picardmetrics/8FTreg12_S84Aligned.sortedByCoord.out  FIRST_OF_PAIR   723032
scratch60/ercc/8Fat/picardmetrics/8FTreg12_S84Aligned.sortedByCoord.out  SECOND_OF_PAIR  723032
scratch60/ercc/8Fat/picardmetrics/8FTreg12_S84Aligned.sortedByCoord.out  PAIR            1446064

I tried running the R code in the picardmetrics script:

picardmetrics/picardmetrics

Lines 929 to 932 in 94cb651

if (!is.null(dat_duplicate_metrics)) {
stopifnot( all(dat_align_metrics$SAMPLE == dat_duplicate_metrics$SAMPLE) )
dat_align_metrics <- merge(dat_align_metrics, dat_duplicate_metrics, by = "SAMPLE")
}

It looks like your files are ok:

setwd("~/Downloads/8Fat")
read_tsv <- function(filename, ...) {
  if (!file.exists(filename)) {
    warning("File does not exist: ", filename)
    return(NULL)
  }
  dat <- read.delim(filename, stringsAsFactors = FALSE, ...)
  return(dat)
}

dat_align_metrics <- read_tsv("8Fat-alignment-metrics.tsv")
dat_duplicate_metrics <- read_tsv("8Fat-duplicate-metrics.tsv")
idx = !dat_align_metrics$CATEGORY %in% c("FIRST_OF_PAIR", "SECOND_OF_PAIR")
dat_align_metrics = dat_align_metrics[idx, ]
all(dat_align_metrics$SAMPLE == dat_duplicate_metrics$SAMPLE)
#> [1] TRUE

Created on 2018-10-11 by the reprex package (v0.2.0).

Here are my suggestions:

  1. Ensure you are using the latest version of the picardmetrics script.

  2. Copy the R code in the picardmetrics script and paste it into a new file. Then run each line of R code by yourself in your own R session. You might find the problem that way.

  3. Don't bother collating all of the files into 1 file. Just work with the files you have.

Good luck!

@kitzcode
Copy link
Author

Ok, I'll try! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants