Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different sample size between TCGA portal and TCGAbiolinks package #605

Open
tyasird opened this issue Oct 12, 2023 · 4 comments
Open

different sample size between TCGA portal and TCGAbiolinks package #605

tyasird opened this issue Oct 12, 2023 · 4 comments

Comments

@tyasird
Copy link

tyasird commented Oct 12, 2023

I was looking for the mutation data through TCGA portal using TCGAbiolinks and I have realized that sample size are not the same.

for instance TCGA-OV case TCGA data portal shows 419 cases, however TCGAbiolinks shows 462 samples. File counts are the same for both it is 482.

so why it is different?

this my query in TCGA data portal:

cases.project.project_id in ["TCGA-OV"] and files.analysis.workflow_type in ["Aliquot Ensemble Somatic Variant Merging and Masking"] and files.data_category in ["Simple Nucleotide Variation"] and files.data_type in ["Masked Somatic Mutation"]

this is same query in the TCGAbiolinks package:

#query
query <- GDCquery(
  project = "TCGA-OV", 
  data.category = "Simple Nucleotide Variation", 
  access = "open",
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)

#download & read
GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
mutations = mafSummary(mafr)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))
@tiagochst
Copy link
Contributor

tiagochst commented Oct 12, 2023 via email

@tyasird
Copy link
Author

tyasird commented Oct 13, 2023

@tiagochst I still don't understand, Are counts not suppose to be higher in the TCGA portal? Why it is higher in the TCGAbiolinks results? Or another way to ask this question, how I can reach the equal sample size in the TCGA portal?

@tiagochst
Copy link
Contributor

tiagochst commented Oct 13, 2023 via email

@tyasird
Copy link
Author

tyasird commented Oct 26, 2023

@tiagochst

Thanks for your answer.
I use maftools for that, and there is a summary variable/table inside of the read.maf function.
So I just open that table and for TCGA-OV it shows 462 sample. I am sharing the screenshot with you.
Also this is the TCGA query

GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants