Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add match error explanation to docs #311

Merged
merged 2 commits into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/_templates/globaltoc.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ <h3>Contents</h3>
<li><a href="{{ pathto('explanation/index') }}">Explanations</a></li>
<ul>
<li><a href="{{ pathto('explanation/plink2') }}">Why not use plink2?</a></li>
<li><a href="{{ pathto('explanation/match') }}">Match rate errors</a></li>
<li><a href="{{ pathto('explanation/geneticancestry') }}">Adjusting PGS with genetic ancestry</a></li>
<li><a href="{{ pathto('explanation/output') }}">Outputs & report</a></li>
</ul>
Expand Down
1 change: 1 addition & 0 deletions docs/explanation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ Explanation
:maxdepth: 1

output
match
geneticancestry
plink2
63 changes: 63 additions & 0 deletions docs/explanation/match.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
.. _matchrates:

Why do I get match rate errors?
===============================

When you're running the PGS Catalog Calculator you might see errors like:

.. code-block:: console

pgscatalog.core.lib.pgsexceptions.ZeroMatchesError: All scores fail to meet match threshold 0.75

You might also see some scoring files in the report are coloured red, and are excluded from the output.

By default pgsc_calc will continue calculating if at least one score passes the **match rate threshold**, which is controlled by the ``--min_overlap`` parameter.

The default parameter is 0.75, this was chosen because on our experiences applying PGS to new cohorts where most scores will score better than this threshold.

If scores match your target genome poorly it's typically because a problem with input data (target genomes or scoring files).

What is matching?
-----------------

The calculator carefully checks that variants (rows) in a scoring file are present in your target genomes.

The matching procedure `is described in the preprint supplement <https://www.medrxiv.org/content/10.1101/2024.05.29.24307783v1.supplementary-material>`_.

The matching procedure never makes any changes to target genome data and only seeks to match variants in the scoring file to the genome.

Adjusting ``--min_overlap`` is a bad idea
------------------------------------------

The aim of the PGS Catalog Calculator is to faithfully recalculate scores submitted by authors to the PGS Catalog on new target genomes.

If few variants in a published scoring file are present in a target genome, then the calculated score isn't a good representation of the original published score.

When you evaluate the predictive performance of a score with low match rates it will be less likely to reproduce the metrics reported in the PGS Catalog.

If you reduce ``--min_overlap`` then the calculator will output scores calculated with the remaining variants, **but these scores may not be representative of the original data submitted to the PGS Catalog.**

Are your target genomes imputed? Are they WGS?
----------------------------------------------

The calculator assumes that target genotyping data were called from a limited number of markers on a genotyping array and imputed using a larger reference panel to increase variant density.

WGS data are not natively supported by the calculator (as homozygous REF sites are excluded from the variant sites). However, it's `possible to create compatible gVCFs from WGS data. <https://github.com/PGScatalog/pgsc_calc/discussions/123#discussioncomment-6469422>`_

In the future we plan to improve support for WGS.

Did you set the correct genome build?
-------------------------------------

The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the `--liftover` command may have been omitted.

I'm still getting match rate errors. How do I figure out what's wrong?
----------------------------------------------------------------------

Problems with matching are normally because of problems with input data rather than the matching procedure.

If you're trying to reproduce a specific score and are experiencing problems, then some manual work is required.

Try checking the full variant matching log to see which variants are missing, which will be present in the work directory reported in the Nextflow error.

It can be a good idea to manually search your target genotypes for missing variants to see what's happening.
Loading