Skip to content

BioTEA analyze options

Luca Visentin edited this page May 31, 2022 · 3 revisions

The configuration file created by biotea initialize follows the yaml format.

The options are divided into sections. Each section controls different aspects of the analysis. The possible types of the various options are annotated in parentheses. The available options are:

  • general: The general section contains general options to control plot size and type, whether to include annotations, and if to print extra data snippets during the analysis:
    • show_data_snippets (bool): If true, prints extra snippets of data during the analysis. This is useful to see if the analysis is doing anything wrong. The default is true.
    • annotation_database (bool or str): This value controls how the data is annotated. Annotating data allows for plots with gene names instead of probe IDs, and a more useful output. If true, the data is annotated with an internal annotation file covering most Agilent and Affymetrix Human chips. If false, the data is not annotated. Otherwise, a string can be passed, representing the complete name of any annotation package on Bioconductor, which will be downloaded and installed on the fly, and used to source annotations. Note: the package must contain annotations for SYMBOLs, ENSEMBL IDs, and GENENAMEs, or the annotation will fail. This might be addressed in a future version of bioTEA.
    • plots: Various options to control plots:
      • save_png (bool): If true, saves plots in .png format. Otherwise, saves in .pdf format.
      • plot_width (int): The width of the plots, in inches.
      • plot_height (int): The height of the plots, in inches.
      • png_resolution (int): The pixels per inch of the png plots.
      • enumerate_plots (bool): If true, each plot is marked with a number, in the order it is created.
  • switches: The switches section contains various parameters to turn parts of the analysis on or off:
    • dryrun (bool): Run the analysis, but do not save any output plots (with the exception of the log file). This can be useful to test out the analysis before committing to it, especially when used in combination with the slowmode and show_data_snippets options.
    • renormalize (bool): Run quantile-quantile normalization on the data. This can be useful to normalize "unruly" samples, for which the normalization steps in prepaffy and prepagil are not enough. Additional plots are saved to appreciate this extra normalization step.
    • limma (bool): Run DEA with limma.
    • rankproduct (bool): Run DEA with RankProduct.
    • convert_counts: If the input data is count data (e.g. RNAseq data), set this to true to use voom to transform the data to continuous values before the analysis. Count values by themselves are unsupported. Defaults to false.
  • design: The design section contains crucial options to set the experimental design that will be used to steer the analysis.
    • experimental_design (str): The experimental design of the experiment. It must be a comma-delimited set of values, of the same length as the number of input samples, with each value being the label for the experimental variable of interest. Numbers specified at the end of each group can be used to represent sample pairings. For a detailed guide on how to define this parameter, see the "Design Strings" wiki page.
    • contrasts (list of str): A list of strings of the type "group1-group2", where each "group" is a level in the experimental design. Each value in the list specifies a (different) contrast of interest. The second group in each contrast is considered the "background" or "control" status (i.e. "group2"). For a detailed guide on how to define this parameter, see the "Design Strings" wiki page.
    • batches (null OR str): If null, no batch effect will be corrected, assuming all samples derive from the same batch. If str, it is treated similar to the experimental_design string, and each level refers to a different batch. Note that RankProd cannot correct batch effects if, inside each batch, there are not at least than two samples for each experimental condition.
    • extra_limma_vars (null OR nested str: If null, nothing happens. If nested str, each string in the nested list is treated similar to the experimental_design string, adding an additional variable to the limma analysis. This can be useful to control for additional confounding variables in the experiment. For an example, see the default configuration file.
    • group_colors (list of str): A list of strings that can be understood by R to be a colour. Each colour will be paired with a different condition type in the experimental design, so at least that many colours must be specified. A list of possible colours can be found here.
    • filters: The parameters used by the analysis to filter the data:
      • log2_expression (float): Filter out any genes that have lower Log2 expression than this value. This value depends a lot on the experiment, and is generally higher for Agilent arrays. The default of "4" is good for Affymetrix arrays.
      • fold_change (float): Filter out (mark as non-differentially-expressed) any genes that have lower absolute Fold Change than this value. This is used to remove from the analysis all genes that even if detected to be a DEG have too little expression, as additional validation with other methods (such as PCR) would be impossible.
      • min_groupwise_presence (float): A value between 0 and 1, representing the proportion of samples in a single group of interest in which the fold_change filter threshold must be violated to be filtered out. If a gene passes the filter in at least one group, it is retained in the analysis. This allows for a more conservative filtering of the genes.
Clone this wiki locally