docs review

PGScatalog · Jul 30, 2024 · 3453918 · 3453918
1 parent f5e507d
commit 3453918
Show file tree

Hide file tree

Showing 10 changed files with 162 additions and 54 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -22,7 +22,7 @@
 
 project = 'Polygenic Score (PGS) Catalog Calculator'
 copyright = 'Polygenic Score (PGS) Catalog team (licensed under Apache License V2)'
-# author = 'Polygenic Score (PGS) Catalog team'
+author = 'Polygenic Score (PGS) Catalog team'
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/docs/explanation/match.rst b/docs/explanation/match.rst
@@ -37,6 +37,8 @@ When you evaluate the predictive performance of a score with low match rates it
 
 If you reduce ``--min_overlap`` then the calculator will output scores calculated with the remaining variants, **but these scores may not be representative of the original data submitted to the PGS Catalog.**
 
+.. _wgs:
+
 Are your target genomes imputed? Are they WGS?
 ----------------------------------------------
 
@@ -49,7 +51,7 @@ In the future we plan to improve support for WGS.
 Did you set the correct genome build?
 -------------------------------------
 
-The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the `--liftover` command may have been omitted. 
+The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the ``--liftover`` command may have been omitted. 
 
 I'm still getting match rate errors. How do I figure out what's wrong?
 ----------------------------------------------------------------------

diff --git a/docs/explanation/output.rst b/docs/explanation/output.rst
@@ -23,6 +23,7 @@ Calculated scores are stored in a gzipped-text space-delimted text file called
 seperate row (``length = n_samples*n_pgs``), and there will be at least four columns with the following headers:
 
 - ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
+- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
 - ``IID``: the identifier of each sample within the dataset.
 - ``PGS``: the accession ID of the PGS being reported.
 - ``SUM``: reports the weighted sum of *effect_allele* dosages multiplied by their *effect_weight*
@@ -56,6 +57,7 @@ describing the analysis of the target samples in relation to the reference panel
 following headers:
 
 - ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
+- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
 - ``IID``: the identifier of each sample within the dataset.
 - ``[PC1 ... PCN]``: The projection of the sample within the PCA space defined by the reference panel. There will be as
   many PC columns as there are PCs calculated (default: 10).

diff --git a/docs/how-to/bigjob.rst b/docs/how-to/bigjob.rst
@@ -74,43 +74,132 @@ limits.
 .. warning:: You'll probably want to use ``-profile singularity`` on a HPC. The
           pipeline requires Singularity v3.7 minimum.
 
-However, in general you will have to adjust the ``executor`` options and job resource
-allocations (e.g. ``process_low``). Here's an example for an LSF cluster:
+Here's an example configuration running about 100 scores in parallel
+on UK Biobank with a SLURM cluster:
 
 .. code-block:: text
 
     process {
-        queue = 'short'
-        clusterOptions = ''
-        scratch = true
+        errorStrategy = 'retry'
+        maxRetries = 3
+        maxErrors = '-1'
+        executor = 'slurm'
+
+        withName: 'DOWNLOAD_SCOREFILES' {
+          cpus = 1
+          memory = { 1.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
 
-        withLabel:process_low {
-            cpus   = 2
-            memory = 8.GB
-            time   = 1.h
+        withName: 'COMBINE_SCOREFILES' {
+          cpus = 1
+          memory = { 8.GB * task.attempt }
+          time = { 2.hour * task.attempt }
         }
-        withLabel:process_medium {
-            cpus   = 8
-            memory = 64.GB
-            time   = 4.h
+
+        withName: 'PLINK2_MAKEBED' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
         }
-    }
 
-    executor {
-        name = 'lsf'
-        jobName = { "$task.hash" }
-    } 
+        withName: 'RELABEL_IDS' {
+          cpus = 1
+          memory = { 16.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'PLINK2_ORIENT' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'DUMPSOFTWAREVERSIONS' {
+          cpus = 1
+          memory = { 1.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'ANCESTRY_ANALYSIS' {
+          cpus = { 1 * task.attempt }
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'SCORE_REPORT' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
 
-In SLURM, queue is equivalent to a partition. Specific cluster parameters can be
-provided by modifying ``clusterOptions``. You should change ``cpus``,
-``memory``, and ``time`` to match the amount of resources used. Assuming the
-configuration file you set up is saved as ``my_custom.config`` in your current
-working directory, you're ready to run pgsc_calc. Instead of running nextflow
-directly on the shell, save a bash script (``run_pgscalc.sh``) to a file
-instead:
+        withName: 'EXTRACT_DATABASE' {
+          cpus = 1
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'PLINK2_RELABELPVAR' {
+          cpus = 2
+          memory = { 16.GB * task.attempt }
+          time = { 2.hour * task.attempt }
+        }
+
+        withName: 'INTERSECT_VARIANTS' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'MATCH_VARIANTS' {
+          cpus = 2
+          memory = { 32.GB * task.attempt }
+          time = { 6.hour * task.attempt }
+        }
+
+        withName: 'FILTER_VARIANTS' {
+          cpus = 2
+          memory = { 16.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'MATCH_COMBINE' {
+          cpus = 4
+          memory = { 64.GB * task.attempt }
+          time = { 6.hour * task.attempt }
+        }
+
+        withName: 'FRAPOSA_PCA' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'PLINK2_SCORE' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 12.hour * task.attempt }
+        }
+
+        withName: 'SCORE_AGGREGATE' {
+          cpus = 2
+          memory = { 16.GB * task.attempt }
+          time = { 4.hour * task.attempt }
+        }
+    }
+
+Assuming the configuration file you set up is saved as
+``my_custom.config`` in your current working directory, you're ready
+to run pgsc_calc. Instead of running nextflow directly on the shell,
+save a bash script (``run_pgscalc.sh``) to a file instead:
 
 .. code-block:: bash
-                
+
+    #SBATCH -J ukbiobank_pgs
+    #SBATCH -c 1
+    #SBATCH -t 24:00:00
+    #SBATCH --mem=2G
+    
     export NXF_ANSI_LOG=false
     export NXF_OPTS="-Xms500M -Xmx2G" 
     
@@ -126,20 +215,23 @@ instead:
 .. note:: The name of the nextflow and singularity modules will be different in
           your local environment
 
-          .. warning:: Make sure to copy input data to fast storage, and run the pipeline
-            on the same fast storage area. You might include these steps in your
-            bash script. Ask your sysadmin for help if you're not sure what this
-            means.
+.. warning:: Make sure to copy input data to fast storage, and run the
+            pipeline on the same fast storage area. You might include
+            these steps in your bash script. Ask your sysadmin for
+            help if you're not sure what this means.
 
 .. code-block:: console
             
-    $ bsub -M 2GB -q short -o output.txt < run_pgscalc.sh
-
+    $ sbatch run_pgsc_calc.sh
+    
 This will submit a nextflow driver job, which will submit additional jobs for
-each process in the workflow. The nextflow driver requires up to 4GB of RAM
-(bsub's ``-M`` parameter) and 2 CPUs to use (see a guide for `HPC users`_ here).
+each process in the workflow. The nextflow driver requires up to 4GB of RAM and 2 CPUs to use (see a guide for `HPC users`_ here).
 
-.. _`LSF and PBS`: https://nextflow.io/docs/latest/executor.html#slurm
 .. _`HPC users`: https://www.nextflow.io/blog/2021/5_tips_for_hpc_users.html
 .. _`a nextflow profile`: https://github.com/nf-core/configs
 
+
+Cloud deployments
+-----------------
+
+We've deployed the calculator to Google Cloud Batch but some :doc:`special configuration is required<cloud>`.
diff --git a/docs/how-to/cache.rst b/docs/how-to/cache.rst
@@ -1,23 +1,26 @@
 .. _cache:
 
-How do I speed up `pgsc_calc` computation times and avoid re-running code?
-==========================================================================
+How do I speed up computation times and avoid re-running code?
+==============================================================
 
-If you intend to run `pgsc_calc` multiple times on the same target samples (e.g.
+If you intend to run ``pgsc_calc`` multiple times on the same target samples (e.g.
 on different sets of PGS, with different variant matching flags) it is worth cacheing
 information on invariant steps of the pipeline:
 
 - Genotype harmonzation (variant relabeling steps)
-- Steps of `--run_ancestry` that: match variants between the target and reference panel and
+- Steps of ``--run_ancestry`` that: match variants between the target and reference panel and
   generate PCA loadings that can be used to adjust the PGS for ancestry.
 
-To do this you must specify a directory that can store these information across runs using the
-`--genotypes_cache` flag to the nextflow command (also see :ref:`param ref`). Future runs of the
-pipeline that use the same cache directory should then skip these steps and proceed to run only the
-steps needed to calculate new PGS. This is slightly different than using the `-resume command in
-nextflow <https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_ which mainly checks the
-`work` directory and is more often used for restarting the pipeline when a specific step has failed
-(e.g. for exceeding memory limits).
+To do this you must specify a directory that can store these
+information across runs using the ``--genotypes_cache`` flag to the
+nextflow command (also see :ref:`param ref`). Future runs of the
+pipeline that use the same cache directory should then skip these
+steps and proceed to run only the steps needed to calculate new PGS.
+This is slightly different than using the `-resume command in nextflow
+<https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_
+which mainly checks the ``work`` directory and is more often used for
+restarting the pipeline when a specific step has failed (e.g. for
+exceeding memory limits).
 
 .. warning:: Always use a new cache directory for different samplesets, as redundant names may clash across runs. 
 
diff --git a/docs/how-to/calculate_custom.rst b/docs/how-to/calculate_custom.rst
@@ -26,7 +26,7 @@ minimal header in the following format:
 Header::
 
     #pgs_name=metaGRS_CAD
-    #pgs_name=metaGRS_CAD    
+    #pgs_id=metaGRS_CAD    
     #trait_reported=Coronary artery disease
     #genome_build=GRCh37
 

diff --git a/docs/how-to/offline.rst b/docs/how-to/offline.rst
@@ -127,8 +127,12 @@ panel too. See :ref:`norm`.
 Download scoring files
 ----------------------
 
-It's best to manually download scoring files from the PGS Catalog in the correct
-genome build. Using PGS001229 as an example:
+.. tip:: Use our CLI application ``pgscatalog-download`` to `download multiple scoring`_ files in parallel and the correct genome build
+
+.. _download multiple scoring: https://pygscatalog.readthedocs.io/en/latest/how-to/guides/download.html
+
+You'll need to preload scoring files in the correct genome build.
+Using PGS001229 as an example:
 
 https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS001229/ScoringFiles/
 

diff --git a/docs/how-to/prepare.rst b/docs/how-to/prepare.rst
@@ -52,6 +52,8 @@ VCF from WGS
 See https://github.com/PGScatalog/pgsc_calc/discussions/123 for discussion about tools
 to convert the VCF files into ones suitable for calculating PGS.
 
+If you input WGS data to the calculator without following the steps above then you will probably encounter match rate errors. For more information, see: :ref:`wgs`
+
 
 ``plink`` binary fileset (bfile)
 --------------------------------

diff --git a/docs/how-to/samplesheet.rst b/docs/how-to/samplesheet.rst
@@ -27,7 +27,7 @@ download here <../../assets/examples/samplesheet.csv>`.
 
 There are four mandatory columns:
 
-- **sampleset**: A text string (no spaces, or reserved characters [ '.' or '_' ]) referring
+- **sampleset**: A text string (no spaces, or reserved characters [ ``.`` or ``_`` ]) referring
   to the name of a :term:`target dataset` of genotyping data containing at least one
   sample/individual (however cohort datasets will often contain many individuals with
   combined genotyped/imputed data). Data from a sampleset may be input as a single file,

diff --git a/docs/index.rst b/docs/index.rst
@@ -54,7 +54,7 @@ The workflow relies on open source scientific software, including:
 A full description of included software is described in :ref:`containers`.
 
 .. _PLINK 2: https://www.cog-genomics.org/plink/2.0/
-.. _PGS Catalog Utilities: https://github.com/PGScatalog/pgscatalog_utils
+.. _PGS Catalog Utilities: https://github.com/PGScatalog/pygscatalog
 .. _FRAPOSA: https://github.com/PGScatalog/fraposa_pgsc
 
 
@@ -120,7 +120,10 @@ Documentation
 Changelog
 ---------
 
-The :doc:`Changelog page<changelog>` describes fixes and enhancements for each version.
+The `Changelog page`_ describes fixes and enhancements for each version.
+
+.. _`Changelog page`: https://github.com/PGScatalog/pgsc_calc/releases
+
 
 Features under development
 --------------------------