Skip to content

Commit

Permalink
docs review
Browse files Browse the repository at this point in the history
  • Loading branch information
nebfield committed Jul 30, 2024
1 parent f5e507d commit 3453918
Show file tree
Hide file tree
Showing 10 changed files with 162 additions and 54 deletions.
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@

project = 'Polygenic Score (PGS) Catalog Calculator'
copyright = 'Polygenic Score (PGS) Catalog team (licensed under Apache License V2)'
# author = 'Polygenic Score (PGS) Catalog team'
author = 'Polygenic Score (PGS) Catalog team'


# -- General configuration ---------------------------------------------------
Expand Down
4 changes: 3 additions & 1 deletion docs/explanation/match.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ When you evaluate the predictive performance of a score with low match rates it

If you reduce ``--min_overlap`` then the calculator will output scores calculated with the remaining variants, **but these scores may not be representative of the original data submitted to the PGS Catalog.**

.. _wgs:

Are your target genomes imputed? Are they WGS?
----------------------------------------------

Expand All @@ -49,7 +51,7 @@ In the future we plan to improve support for WGS.
Did you set the correct genome build?
-------------------------------------

The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the `--liftover` command may have been omitted.
The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the ``--liftover`` command may have been omitted.

I'm still getting match rate errors. How do I figure out what's wrong?
----------------------------------------------------------------------
Expand Down
2 changes: 2 additions & 0 deletions docs/explanation/output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Calculated scores are stored in a gzipped-text space-delimted text file called
seperate row (``length = n_samples*n_pgs``), and there will be at least four columns with the following headers:

- ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
- ``IID``: the identifier of each sample within the dataset.
- ``PGS``: the accession ID of the PGS being reported.
- ``SUM``: reports the weighted sum of *effect_allele* dosages multiplied by their *effect_weight*
Expand Down Expand Up @@ -56,6 +57,7 @@ describing the analysis of the target samples in relation to the reference panel
following headers:

- ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
- ``IID``: the identifier of each sample within the dataset.
- ``[PC1 ... PCN]``: The projection of the sample within the PCA space defined by the reference panel. There will be as
many PC columns as there are PCs calculated (default: 10).
Expand Down
162 changes: 127 additions & 35 deletions docs/how-to/bigjob.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,43 +74,132 @@ limits.
.. warning:: You'll probably want to use ``-profile singularity`` on a HPC. The
pipeline requires Singularity v3.7 minimum.

However, in general you will have to adjust the ``executor`` options and job resource
allocations (e.g. ``process_low``). Here's an example for an LSF cluster:
Here's an example configuration running about 100 scores in parallel
on UK Biobank with a SLURM cluster:

.. code-block:: text
process {
queue = 'short'
clusterOptions = ''
scratch = true
errorStrategy = 'retry'
maxRetries = 3
maxErrors = '-1'
executor = 'slurm'
withName: 'DOWNLOAD_SCOREFILES' {
cpus = 1
memory = { 1.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withLabel:process_low {
cpus = 2
memory = 8.GB
time = 1.h
withName: 'COMBINE_SCOREFILES' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 2.hour * task.attempt }
}
withLabel:process_medium {
cpus = 8
memory = 64.GB
time = 4.h
withName: 'PLINK2_MAKEBED' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
}
executor {
name = 'lsf'
jobName = { "$task.hash" }
}
withName: 'RELABEL_IDS' {
cpus = 1
memory = { 16.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'PLINK2_ORIENT' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'DUMPSOFTWAREVERSIONS' {
cpus = 1
memory = { 1.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'ANCESTRY_ANALYSIS' {
cpus = { 1 * task.attempt }
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'SCORE_REPORT' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
In SLURM, queue is equivalent to a partition. Specific cluster parameters can be
provided by modifying ``clusterOptions``. You should change ``cpus``,
``memory``, and ``time`` to match the amount of resources used. Assuming the
configuration file you set up is saved as ``my_custom.config`` in your current
working directory, you're ready to run pgsc_calc. Instead of running nextflow
directly on the shell, save a bash script (``run_pgscalc.sh``) to a file
instead:
withName: 'EXTRACT_DATABASE' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'PLINK2_RELABELPVAR' {
cpus = 2
memory = { 16.GB * task.attempt }
time = { 2.hour * task.attempt }
}
withName: 'INTERSECT_VARIANTS' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'MATCH_VARIANTS' {
cpus = 2
memory = { 32.GB * task.attempt }
time = { 6.hour * task.attempt }
}
withName: 'FILTER_VARIANTS' {
cpus = 2
memory = { 16.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'MATCH_COMBINE' {
cpus = 4
memory = { 64.GB * task.attempt }
time = { 6.hour * task.attempt }
}
withName: 'FRAPOSA_PCA' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'PLINK2_SCORE' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 12.hour * task.attempt }
}
withName: 'SCORE_AGGREGATE' {
cpus = 2
memory = { 16.GB * task.attempt }
time = { 4.hour * task.attempt }
}
}
Assuming the configuration file you set up is saved as
``my_custom.config`` in your current working directory, you're ready
to run pgsc_calc. Instead of running nextflow directly on the shell,
save a bash script (``run_pgscalc.sh``) to a file instead:

.. code-block:: bash
#SBATCH -J ukbiobank_pgs
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem=2G
export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"
Expand All @@ -126,20 +215,23 @@ instead:
.. note:: The name of the nextflow and singularity modules will be different in
your local environment

.. warning:: Make sure to copy input data to fast storage, and run the pipeline
on the same fast storage area. You might include these steps in your
bash script. Ask your sysadmin for help if you're not sure what this
means.
.. warning:: Make sure to copy input data to fast storage, and run the
pipeline on the same fast storage area. You might include
these steps in your bash script. Ask your sysadmin for
help if you're not sure what this means.

.. code-block:: console
$ bsub -M 2GB -q short -o output.txt < run_pgscalc.sh
$ sbatch run_pgsc_calc.sh
This will submit a nextflow driver job, which will submit additional jobs for
each process in the workflow. The nextflow driver requires up to 4GB of RAM
(bsub's ``-M`` parameter) and 2 CPUs to use (see a guide for `HPC users`_ here).
each process in the workflow. The nextflow driver requires up to 4GB of RAM and 2 CPUs to use (see a guide for `HPC users`_ here).

.. _`LSF and PBS`: https://nextflow.io/docs/latest/executor.html#slurm
.. _`HPC users`: https://www.nextflow.io/blog/2021/5_tips_for_hpc_users.html
.. _`a nextflow profile`: https://github.com/nf-core/configs


Cloud deployments
-----------------

We've deployed the calculator to Google Cloud Batch but some :doc:`special configuration is required<cloud>`.
25 changes: 14 additions & 11 deletions docs/how-to/cache.rst
Original file line number Diff line number Diff line change
@@ -1,23 +1,26 @@
.. _cache:

How do I speed up `pgsc_calc` computation times and avoid re-running code?
==========================================================================
How do I speed up computation times and avoid re-running code?
==============================================================

If you intend to run `pgsc_calc` multiple times on the same target samples (e.g.
If you intend to run ``pgsc_calc`` multiple times on the same target samples (e.g.
on different sets of PGS, with different variant matching flags) it is worth cacheing
information on invariant steps of the pipeline:

- Genotype harmonzation (variant relabeling steps)
- Steps of `--run_ancestry` that: match variants between the target and reference panel and
- Steps of ``--run_ancestry`` that: match variants between the target and reference panel and
generate PCA loadings that can be used to adjust the PGS for ancestry.

To do this you must specify a directory that can store these information across runs using the
`--genotypes_cache` flag to the nextflow command (also see :ref:`param ref`). Future runs of the
pipeline that use the same cache directory should then skip these steps and proceed to run only the
steps needed to calculate new PGS. This is slightly different than using the `-resume command in
nextflow <https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_ which mainly checks the
`work` directory and is more often used for restarting the pipeline when a specific step has failed
(e.g. for exceeding memory limits).
To do this you must specify a directory that can store these
information across runs using the ``--genotypes_cache`` flag to the
nextflow command (also see :ref:`param ref`). Future runs of the
pipeline that use the same cache directory should then skip these
steps and proceed to run only the steps needed to calculate new PGS.
This is slightly different than using the `-resume command in nextflow
<https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_
which mainly checks the ``work`` directory and is more often used for
restarting the pipeline when a specific step has failed (e.g. for
exceeding memory limits).

.. warning:: Always use a new cache directory for different samplesets, as redundant names may clash across runs.

2 changes: 1 addition & 1 deletion docs/how-to/calculate_custom.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ minimal header in the following format:
Header::

#pgs_name=metaGRS_CAD
#pgs_name=metaGRS_CAD
#pgs_id=metaGRS_CAD
#trait_reported=Coronary artery disease
#genome_build=GRCh37

Expand Down
8 changes: 6 additions & 2 deletions docs/how-to/offline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,12 @@ panel too. See :ref:`norm`.
Download scoring files
----------------------

It's best to manually download scoring files from the PGS Catalog in the correct
genome build. Using PGS001229 as an example:
.. tip:: Use our CLI application ``pgscatalog-download`` to `download multiple scoring`_ files in parallel and the correct genome build

.. _download multiple scoring: https://pygscatalog.readthedocs.io/en/latest/how-to/guides/download.html

You'll need to preload scoring files in the correct genome build.
Using PGS001229 as an example:

https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS001229/ScoringFiles/

Expand Down
2 changes: 2 additions & 0 deletions docs/how-to/prepare.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ VCF from WGS
See https://github.com/PGScatalog/pgsc_calc/discussions/123 for discussion about tools
to convert the VCF files into ones suitable for calculating PGS.

If you input WGS data to the calculator without following the steps above then you will probably encounter match rate errors. For more information, see: :ref:`wgs`


``plink`` binary fileset (bfile)
--------------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/how-to/samplesheet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ download here <../../assets/examples/samplesheet.csv>`.

There are four mandatory columns:

- **sampleset**: A text string (no spaces, or reserved characters [ '.' or '_' ]) referring
- **sampleset**: A text string (no spaces, or reserved characters [ ``.`` or ``_`` ]) referring
to the name of a :term:`target dataset` of genotyping data containing at least one
sample/individual (however cohort datasets will often contain many individuals with
combined genotyped/imputed data). Data from a sampleset may be input as a single file,
Expand Down
7 changes: 5 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ The workflow relies on open source scientific software, including:
A full description of included software is described in :ref:`containers`.

.. _PLINK 2: https://www.cog-genomics.org/plink/2.0/
.. _PGS Catalog Utilities: https://github.com/PGScatalog/pgscatalog_utils
.. _PGS Catalog Utilities: https://github.com/PGScatalog/pygscatalog
.. _FRAPOSA: https://github.com/PGScatalog/fraposa_pgsc


Expand Down Expand Up @@ -120,7 +120,10 @@ Documentation
Changelog
---------

The :doc:`Changelog page<changelog>` describes fixes and enhancements for each version.
The `Changelog page`_ describes fixes and enhancements for each version.

.. _`Changelog page`: https://github.com/PGScatalog/pgsc_calc/releases


Features under development
--------------------------
Expand Down

0 comments on commit 3453918

Please sign in to comment.