Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Bug: GDC Prepare does not work for breast cancer data #619

Open
fabianjkrueger opened this issue Feb 19, 2024 · 1 comment
Open

Comments

@fabianjkrueger
Copy link
Contributor

Hello!

There seems to be an issue with preparing of certain data sets for analysis. It's weird, since if works for some of the projects, but it doesn't work for others. One of the projects causing issues here is breast cancer ("BRCA"). I queried and downloaded the data for the different projects in a script like shown below.

GDCquery(project = "TCGA-BRCA",
                           data.category = "Simple Nucleotide Variation",
                           data.type = "Masked Somatic Mutation")

# this is the step that just wont work for breast cancer...
mutationDataBRCA <- GDCprepare(mutationQueryBRCA, # specify which query to use
                           save = TRUE, # save the output as as a file
                           save.filename = file.path(prepared_path, "BRCA_SNVMSM.RData"),
                           directory = dl_path, # directory where downloaded files are stored
                           remove.files.prepared = FALSE) 

All paths are stored in variables, so this is not the issue. This code works for almost all the other cancer types, for example colon adenocarcinoma (project "COAD").

This is the error message I get:

Error in `dplyr::bind_rows()`:
! Can't combine `..151$Tumor_Seq_Allele2` <character> and `..152$Tumor_Seq_Allele2` <logical>.
Backtrace:
 1. TCGAbiolinks::GDCprepare(...)
 2. TCGAbiolinks:::readSimpleNucleotideVariationMaf(files)
 3. purrr::map_dfr(...)
 4. dplyr::bind_rows(res, .id = .id)

To me, it looks like there is a problem with data types, but I don't know how to fix it.

Is there anything else I might be missing? Are there temporary files that depend on loading a specific library for reading them? If not, there might be a bug.

@DzmitryGB
Copy link

I encountered a similar bug while preparing query for TCGA-UCEC. To do with TCGAbiolinks:::readSimpleNucleotideVariationMaf call where an empty table leads to incompatible column type. My workaround uses data.table::fread instead ot readr:

query <- GDCquery(
    project = "TCGA-UCEC", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation",
    data.format = "MAF"
)
GDCdownload(query)
# query_results <- GDCprepare(query) # this errors out
files <- file.path(
    "GDCdata",
    query$results[[1]]$project,
    gsub(" ", "_", query$results[[1]]$data_category),
    gsub(" ", "_", query$results[[1]]$data_type),
    gsub(" ", "_", query$results[[1]]$file_id), 
    gsub(" ", "_", query$results[[1]]$file_name)
)
maf_data <- do.call(rbind, lapply(files, fread, header = T, skip = "#", sep = "\t"))

TCGAbiolinks v2.32.0, readr v2.1.5, R version 4.4.1 (2024-06-14)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants