Questions about MSstats file #1994

SofiaFarkona · 2025-01-24T01:30:56Z

Hello Fengchao and everybody here.
We have just started using Fragpipe. We are using it to process data from phosphorylated peptides enriched data and from the respective non enriched data (proteome data). These are NOT labeled so we want to do label free quantification. The important here is that I need FragPipe to create MSstatsPTM compatible csv files to analyze them using obviously the aforementioned package.
So, with FragPipe we

processed the data from our NON phospho enriched data (proteome data)
and then to
processed the data from our phosho enriched data
…………………
While I am getting the MSstats csv after processing each set of data, those csv files seem to miss necessary information. Luckily, I have some examples of appropriate MSstats csv with all the information we need for further processing (so I can compare and also show to you).
What I did here is this. I attached small versions of the MSstats csv I got after our analysis (first 100 rows from each csv). I have also attached small versions of the example MSstats csv files (again first 100 rows). These small versions are enough to show that the csv produced after our run with FragPipe are missing lots of information.

About the attached files:
msstats_from_our_PhoshoEnrichedrun_SMALL.csv
msstats_ptm_from_our_PhosphoEnrichedrun_SMALL.csv
both of these two above were produced when we processed our enriched data

msstats_ptm_from_our_proteome_run_small.csv
msstats_from_our_proteome_run_small.csv
both of these two above were produced when we processed our proteome data (non -enriched)

msstats_ptm_example_small.csv
msstats_proteome_example_small.csv
both of the above are the example msstats csv files. The first one I assume from running enriched samples second from running the respective non enriched samples (proteome).

Why did this happen? Why are we missing so much valuable information?
In order to give some information on how we ran PragPipe I attached the log files from processing the enriched -phospho-run and from processing the non enriched run.

I would really appreciate some help since we just switched to FragPipe for a smoother analysis of our data with MSstatsPTM.

Best,
Sofia

msstats_from_our_PhoshoEnrichedrun_SMALL.csv

msstats_ptm_from_our_PhosphoEnrichedrun_SMALL.csv

msstats_ptm_from_our_proteome_run_small.csv

msstats_from_our_proteome_run_small.csv

msstats_ptm_example_small.csv

msstats_proteome_example_small.csv

log_2025-01-23_13-03-41.txt

log_2025-01-21_11-04-49.txt

fcyu · 2025-01-24T01:38:35Z

Hi Sofia,

Thank you for reaching out.

Yours are label-free quantification, while the example files you found were from the TMT analysis, which have a very different format.

I think an easy way to see if FragPipe's msstats files are good is running MSstats to see if it works well.

Best,

Fengchao

SofiaFarkona · 2025-01-24T04:36:10Z

Fengchao! Thanks so much for fast reply! I see your point. I will run it eventually, but I would really need to understand what I should expect from my results. As I mentioned we used to process these data with another software and now that we switched, I am trying to figure out where all these information is (and of course to make sure I have the appropriate MSstats files). So yes, I do understand that our data are NOT TMT that's why we processed them with a default workflow from the dropdown menu that included LFQ. We also chose IonQuant from Quant (MS1) tab and then LFQ, MaxLFQ.

So, my first question:
Were we supposed to have columns in the output MSstats csv indicating LFQ intensities (at least in the output MSstats csv from processing the Proteome data)? Here please let me say that we would always get this information when we previously processed data with our previous software. I pasted here a screenshot from a resulted CSV from the previous software just to show what I meant.

Again, I am trying to understand the files we are getting back and what we should expect based on how we processed them with FragPipe.

Second question: There are some extra columns/variables that seem to miss from our output MSstats_ptm csv (from our enriched runs). I am including here str(msstats_ptm file) showing what we got from our analysis with FragPipe versus how it looks from the example msstats file.
The following is our file:
> str(msstats_ptm_from_our_PhosphoEnrichedrun)
'data.frame': 174312 obs. of 14 variables:
$ ProteinName : chr "contam_sp|P00761|TRYP_PIG" "contam_sp|P00761|TRYP_PIG" "contam_sp|P00761|TRYP_PIG" "contam_sp|P00761|TRYP_PIG" ...
$ PeptideSequence : chr "LSSPATLNSR" "LSSPATLNSR" "LSSPATLNSR" "LSSPATLNSR" ...
$ Protein.Start : int 98 98 98 98 98 98 98 98 98 98 ...
$ Protein.End : int 107 107 107 107 107 107 107 107 107 107 ...
$ PrecursorCharge : int 2 2 2 2 2 2 2 2 2 2 ...
$ FragmentIon : logi NA NA NA NA NA NA ...
$ ProductCharge : logi NA NA NA NA NA NA ...
$ IsotopeLabelType: chr "L" "L" "L" "L" ...
$ Condition : chr "siRNA_V" "siRNA_T" "NT_V" "NT_T" ...
$ BioReplicate : int 3 3 3 3 2 2 2 2 1 1 ...
$ Run : chr "TiO2enrichSept_1" "TiO2enrichSept_2" "TiO2enrichSept_3" "TiO2enrichSept_4" ...
$ Intensity : num 10205626 3246423 3884939 4841823 NA ...
$ STY.79.9663 : chr NA NA NA NA ...
$ M.15.9949 : chr NA NA NA NA ...

The next one is the example file:
> str(msstats_ptm_example)
'data.frame': 17177 obs. of 29 variables:
$ Spectrum.Name : chr "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.02743.02743.4" "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.02755.02755.4" "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.02812.02812.3" "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.02913.02913.3" ...
$ Spectrum.File : chr "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.mzML" "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.mzML" "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.mzML" "16CPTAC_CCRCC_P_JHU_20180326_LUMOS_f01.mzML" ...
$ Peptide.Sequence : chr "RRHSHSHSPMSTR" "RHSHSHSPMSTR" "HSHSHSPMSTR" "HTRDSEAQR" ...
$ Modified.Peptide.Sequence: chr "n[230]RRHSHS[167]HS[167]PM[147]STR" "n[230]RHSHS[167]HS[167]PM[147]STR" "n[230]HSHS[167]HS[167]PM[147]STR" "n[230]HTRDS[167]EAQR" ...
$ Probability : num 0.995 0.996 0.996 0.978 0.998 ...
$ Charge : int 4 4 3 3 5 3 3 3 3 3 ...
$ Protein.Start : int 92 93 94 6 123 636 1272 38 356 711 ...
$ Protein.End : int 104 104 104 14 134 645 1284 46 362 720 ...
$ Gene : chr "TRA2B" "TRA2B" "TRA2B" "STOM" ...
$ Mapped.Genes : chr "" "" "" "" ...
$ Protein : chr "sp|P62995|TRA2B_HUMAN" "sp|P62995|TRA2B_HUMAN" "sp|P62995|TRA2B_HUMAN" "sp|P27105|STOM_HUMAN" ...
$ Protein.ID : chr "P62995" "P62995" "P62995" "P27105" ...
$ Mapped.Proteins : chr "" "" "" "" ...
$ Protein.Description : chr "Transformer-2 protein homolog beta" "Transformer-2 protein homolog beta" "Transformer-2 protein homolog beta" "Stomatin" ...
$ Is.Unique : chr "true" "true" "true" "true" ...
$ Purity : num 0 1 1 0.53 0.84 1 0 1 1 1 ...
$ Intensity : num 0 93998240 58713048 0 32706532 ...
$ M.15.9949 : chr "RRHSHSHSPM(1.0000)STR" "RHSHSHSPM(1.0000)STR" "HSHSHSPM(1.0000)STR" "" ...
$ STY.79.966331 : chr "RRHS(0.1780)HS(0.8463)HS(0.8392)PMS(0.0709)T(0.0656)R" "RHS(0.1497)HS(0.8741)HS(0.8698)PMS(0.0523)T(0.0542)R" "HS(0.0582)HS(0.9325)HS(0.9327)PMS(0.0387)T(0.0380)R" "HT(0.3995)RDS(0.6005)EAQR" ...
$ Channel.126 : num 5579 19046 18499 13825 13346 ...
$ Channel.127N : num 8280 25292 24321 15934 24715 ...
$ Channel.127C : num 7035 38327 33518 8398 11790 ...
$ Channel.128N : num 10747 34385 31882 8001 18234 ...
$ Channel.128C : num 14872 42118 36766 12493 34780 ...
$ Channel.129N : num 17204 72898 60230 22852 12546 ...
$ Channel.129C : num 15444 33277 31367 11002 29433 ...
$ Channel.130N : num 11443 50291 41944 8395 18376 ...
$ Channel.130C : num 12986 49429 47435 10015 14490 ...
$ Channel.131N : num 11236 26750 20533 14174 16897 ...

If we exclude the columns referring to the TMT channels why is our file missing so many columns?
Why our file does not have the variable " $ Modified.Peptide.Sequence: ", "$ Probability", "$ Charge", "$Gene", "$Protein.ID", "$Is.Unique" , "$Protein.Description" etc

Thanks so much,
Best,
Sofia

fcyu · 2025-01-24T14:50:49Z

Were we supposed to have columns in the output MSstats csv indicating LFQ intensities (at least in the output MSstats csv from processing the Proteome data)?

Of course, the msstats.csv file generated by FragPipe has the intensity. The difference between the one in your screenshot is that FragPipe's is in the format that can be directly processed by MSstats using the dataProcess or other R functions, while the screenshot one needs to be converted into one of the *toMSstatsFormat R functions first.

If we exclude the columns referring to the TMT channels why is our file missing so many columns?

Similar to the previous reason. The msstats_ptm_example is from the TMT workflow, not the native MSstats format, and needs to be converted first (if I remember the TMT workflow correctly). Those additional columns are not needed or have a different name in the native msstats.csv format, such as Charge vs PrecursorCharge, and Modified.Peptide.Sequence vs PeptideSequence.......

Best,

Fengchao

SofiaFarkona · 2025-01-24T21:08:13Z

OK I see.

So, from my understanding the LFQ intensity for each sample or run is summarized in this column (screenshot below) for our msstats_ptm_from_our_PhosphoEnrichedrun

and for our msstats_from_our_proteome_run (screenshot below)

is this right?

About the example files I shared: Both msstats_ptm_example file and msstats_proteome_example are both ready to be run by the MSstats converter (through "FragPipetoMSstatsPTMFormat") (after we include the respective annotation files of course). No other change is required. I know because I have run the conversion, and it was successful w/o any issues.

And so I assumed that the MSstats csv we produced after running our data with FragPipe would be identical to the examples (minus the columns referring to the TMT files of course).

Anyhow I will use what I have and try to run the code for the conversion.

Best,
Sofia

fcyu · 2025-01-24T21:21:02Z

is this right?

Not correct. One is in the "long" format: different samples are listed in different rows. And the other is in "wide" format: different samples are listed in different columns.

About the example files I shared: Both msstats_ptm_example file and msstats_proteome_example are both ready to be run by the MSstats converter (through "FragPipetoMSstatsPTMFormat") (after we include the respective annotation files of course). No other change is required. I know because I have run the conversion, and it was successful w/o any issues.

This is exactly what I have talked about: "...while the screenshot one needs to be converted into one of the *toMSstatsFormat R functions first.". Here, you need to run the FragPipetoMSstatsPTMFormat R function before running the real MSstats analysis function.

And so I assumed that the MSstats csv we produced after running our data with FragPipe would be identical to the examples (minus the columns referring to the TMT files of course).

No, they are not. One is from the LFQ workflow whose msstats file is generated by IonQuant, and the other is from the TMT workflow whose msstats file is generated by Philosopher.

Best,

Fengchao

fcyu self-assigned this Jan 24, 2025

fcyu changed the title ~~MSstats files missing important information (for downstream analysis with MSstatsPTM)~~ Questions about MSstats file Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about MSstats file #1994

Questions about MSstats file #1994

SofiaFarkona commented Jan 24, 2025

fcyu commented Jan 24, 2025

SofiaFarkona commented Jan 24, 2025

fcyu commented Jan 24, 2025

SofiaFarkona commented Jan 24, 2025

fcyu commented Jan 24, 2025

Questions about MSstats file #1994

Questions about MSstats file #1994

Comments

SofiaFarkona commented Jan 24, 2025

fcyu commented Jan 24, 2025

SofiaFarkona commented Jan 24, 2025

fcyu commented Jan 24, 2025

SofiaFarkona commented Jan 24, 2025

fcyu commented Jan 24, 2025