Problem with collate function #9

kitzcode · 2018-09-17T15:06:11Z

Hi! Thanks for this wonderful application, it helped me a lot.
However, I ran into an issue with the collate function I can't figure out to solve:

picardmetrics collate filename /path/filename/
picardmetrics version 0.2.4 2016-07-06
2018-09-17 10:47:47 START filename
2018-09-17 10:47:47 Collating 96 alignment_summary_metrics files
2018-09-17 10:47:53 Collating 96 quality_distribution_metrics files
2018-09-17 10:47:57 Collating 96 rnaseq_metrics files (summary)
2018-09-17 10:48:01 Collating 96 rnaseq_metrics files (coverage)
2018-09-17 10:48:05 Collating 96 gc_bias_metrics files
2018-09-17 10:48:07 Collating 96 gc_bias_histogram files
2018-09-17 10:48:12 Collating 96 duplicate_metrics files
2018-09-17 10:48:16 Collating 96 insert_size_metrics files
2018-09-17 10:48:21 Collating 96 insert_size_metrics files (histogram)
2018-09-17 10:48:25 Collating 96 base_distribution_by_cycle files
2018-09-17 10:48:32 Collating 96 library_complexity files
2018-09-17 10:48:36 Collating 96 library_complexity files (histogram)
2018-09-17 10:48:40 Collating 96 mapq_stats files
2018-09-17 10:48:41 Joining all files into 'filename-all-metrics.tsv'
Error: all(dat_align_metrics$SAMPLE == dat_duplicate_metrics$SAMPLE) is not TRUE
Execution halted
2018-09-17 10:48:42 DONE filename

I don't understand where the problem lies as there are 96 files in each category. When I count the lines in the intermediate files, it give 97 for duplicate_metrics and 289 for alignment_metrics.

Thanks
Alex

slowkow · 2018-09-20T01:03:42Z

Could I ask you to inspect the contents of the two files?

alignment_summary_metrics
duplicate_metrics

I wonder if one of the files is missing a result.

kitzcode · 2018-10-11T14:45:43Z

Hi! I inspected the files and they seem valid. Each sample has one line in duplicate_metrics and 3 lines in alignment_metrics. No empty lines, no duplicates... I even realigned my samples in case something went wrong there, but still get the same error.

Anything else I could check?

slowkow · 2018-10-11T14:57:51Z

If you could share the output files, I might be able to fix the code to work with your files.

kitzcode · 2018-10-11T16:56:56Z

8Fat-alignment-metrics.xlsx
8Fat-duplicate-metrics..xlsx

slowkow · 2018-10-11T17:22:00Z

I was confused because you added .xlsx to the file names. These files are not Microsoft Excel spreadsheets, they're just plain text files.

You cannot open them with Excel:

You can read the contents anyway:

$ head 8Fat-alignment-metrics.xlsx | cut -f1-3 | column -t
SAMPLE                                                                   CATEGORY        TOTAL_READS
scratch60/ercc/8Fat/picardmetrics/8FTreg10_S57Aligned.sortedByCoord.out  FIRST_OF_PAIR   333
scratch60/ercc/8Fat/picardmetrics/8FTreg10_S57Aligned.sortedByCoord.out  SECOND_OF_PAIR  333
scratch60/ercc/8Fat/picardmetrics/8FTreg10_S57Aligned.sortedByCoord.out  PAIR            666
scratch60/ercc/8Fat/picardmetrics/8FTreg11_S82Aligned.sortedByCoord.out  FIRST_OF_PAIR   964252
scratch60/ercc/8Fat/picardmetrics/8FTreg11_S82Aligned.sortedByCoord.out  SECOND_OF_PAIR  964252
scratch60/ercc/8Fat/picardmetrics/8FTreg11_S82Aligned.sortedByCoord.out  PAIR            1928504
scratch60/ercc/8Fat/picardmetrics/8FTreg12_S84Aligned.sortedByCoord.out  FIRST_OF_PAIR   723032
scratch60/ercc/8Fat/picardmetrics/8FTreg12_S84Aligned.sortedByCoord.out  SECOND_OF_PAIR  723032
scratch60/ercc/8Fat/picardmetrics/8FTreg12_S84Aligned.sortedByCoord.out  PAIR            1446064

I tried running the R code in the picardmetrics script:

picardmetrics/picardmetrics

Lines 929 to 932 in 94cb651

    
             if (!is.null(dat_duplicate_metrics)) { 
        
               stopifnot( all(dat_align_metrics$SAMPLE == dat_duplicate_metrics$SAMPLE) ) 
        
               dat_align_metrics <- merge(dat_align_metrics, dat_duplicate_metrics, by = "SAMPLE") 
        
             }

It looks like your files are ok:

setwd("~/Downloads/8Fat")
read_tsv <- function(filename, ...) {
  if (!file.exists(filename)) {
    warning("File does not exist: ", filename)
    return(NULL)
  }
  dat <- read.delim(filename, stringsAsFactors = FALSE, ...)
  return(dat)
}

dat_align_metrics <- read_tsv("8Fat-alignment-metrics.tsv")
dat_duplicate_metrics <- read_tsv("8Fat-duplicate-metrics.tsv")
idx = !dat_align_metrics$CATEGORY %in% c("FIRST_OF_PAIR", "SECOND_OF_PAIR")
dat_align_metrics = dat_align_metrics[idx, ]
all(dat_align_metrics$SAMPLE == dat_duplicate_metrics$SAMPLE)
#> [1] TRUE

Created on 2018-10-11 by the reprex package (v0.2.0).

Here are my suggestions:

Ensure you are using the latest version of the picardmetrics script.
Copy the R code in the picardmetrics script and paste it into a new file. Then run each line of R code by yourself in your own R session. You might find the problem that way.
Don't bother collating all of the files into 1 file. Just work with the files you have.

Good luck!

kitzcode · 2018-10-11T17:36:20Z

Ok, I'll try! Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with collate function #9

Problem with collate function #9

kitzcode commented Sep 17, 2018

slowkow commented Sep 20, 2018

kitzcode commented Oct 11, 2018

slowkow commented Oct 11, 2018

kitzcode commented Oct 11, 2018

slowkow commented Oct 11, 2018

kitzcode commented Oct 11, 2018

Problem with collate function #9

Problem with collate function #9

Comments

kitzcode commented Sep 17, 2018

slowkow commented Sep 20, 2018

kitzcode commented Oct 11, 2018

slowkow commented Oct 11, 2018

kitzcode commented Oct 11, 2018

slowkow commented Oct 11, 2018

kitzcode commented Oct 11, 2018