-
Notifications
You must be signed in to change notification settings - Fork 1
Home
NOTE: This page is a static reproduction of the tutorial vignette bundled with the package, as a backup way to access the documentation when the normal means fail. We recommended you view the vignette instead of this wiki. Be advised that, unlike the actual vignette, the wiki cannot display the output of the commands, so you will have to run the examples yourself in order for the tutorial to fully make sense.
This is an R package that builds on the functionality of sleuth. It is
aimed at people analysing gene expression and transcript abundance, particularly with RNA-Seq type
of data in mind. This R Markdown vignette constitutes the main documentation for the rats
package.
rats
aims to identify situations where genes with multiple transcript isoforms show differences in which of
these isoforms are in use (Differential Transcript Usage -- DTU) between two conditions. This is supplementary
to identifying which transcripts show Differential Expression (DTE). It is possible to have DTE without DTU,
meaning that the expression of all the isoforms changes similarly and is equivalent to plain differential gene
expression (DGE). It is also possible to have DTU without DGE, when the relative abundance of isoforms changes but
the overall output of the gene remains the same. However, it is not possible to have DTU without DTE in at least
one of the transcript isoforms.
To find which genes show change in relative abundance of their transcripts, the G test is used. This method is
sensitive to small changes, thanks to looking at all the transcripts of the gene simultaneously. As a result,
it cannot attribute the change to specific transcripts, but it may be possible to identify manually by inspection
of the relevant transcript expression levels.
As a complementary method, rats
also uses the test of proportion to look for change in each transcript
individually. This test is less sensitive, but can highlight which transcripts change in relative abundance. The
two tests are carried out independently of each other, meaning that a positive result in one is not a pre-requisite
for running the other and, indeed, the two tests lead to conflicting verdicts for some genes.
The input format recognised by rats
is the output of sleuth. See the
introduction to sleuth pages for details on how
to load the transcript abundance estimate data from kallisto into a sleuth
object, and see the wasabi tool for how to load the transcript abundance
estimate data from Sailfish or Salmon
into a sleuth object.
NOTE: For the purposes of the tutorial, we will be using the bundled data simulators, so you don't need to worry about all these other tools at this stage.
First, load the package (assuming it has been installed, as instructed in the README). Loading data.table
is optional but recommended, as it facilitates exploration of the results by providing more powerful syntax.
The package should be available on your system already, since it is a crucial dependency for rats
.
library(rats)
library(data.table)
In a real usage case, the input data would be the output of sleuth. For details on what that looks like and how to use sleuth you should refer to its proper documentation.
Instead, for simplicity, we'll just generate some data using functionality provided by rats
. This functionality
was created with code-testing in mind, however the simulated data is convenient for demonstrating how the package
works. The data is completely artificial and chosen so as to test/demonstrate specific scenarios.
simdat <- sim_sleuth_data(cnames = c("foo", "bar")) # foo and bar are the names to use for the conditions.
# For convenience let's assign the contents of the list to separate variables.
slo <- simdat$slo # Simulated minimal sleuth object.
annot <- simdat$annot # Annotation for the above data.
The simulation returns a list of 3 objects, only the first two of which are relevant here:
-
slo
emulates an R object you would obtain from calling DTE using sleuth. The real sleuth objects are quite large and complex nested lists. The simulated one contains only the bare minimum parts that are relevant to calling DTU withrats
. It has a couple of bootstraps of transcript abundance estimates for a handful of made-up genes, in two conditions.
# Let's have a look at the simulated data.
summary(slo)
# Names of the samples and conditions.
print( slo$sample_to_covariates )
# Estimated fragment counts in bootstrap 1 of sample 1.
print( slo$kal[[1]]$bootstrap[[1]] )
The values under condition
are the ones we specified when we called the simulator. They are internally assigned
to specific samples. There's also an additional variable, batch
, which is added by the simulator and can be ignored for now.
Your data may have more/fewer/different variables in the sample_to_covariates
table. rats
uses this table as a guide to group samples together for comparison.
kal
is a list of samples (in this case there are 4). Each sample is a list that normally contains various
tables and information. rats
only uses the bootstrapped counts, so the simulated data only contains
the element bootstrap
. This in turn is a list of tables, one per bootstrap iteration (in this case just
2 or 3 iterations). Each table lists the estimated counts for each transcript id. The real data would also
list TPMs.
-
annot
is an auxiliarydata.frame
required byrats
. It simply matches transcript identification codes to gene identification codes. The data simulator constructs such an annotation corresponding to the dataset it generates. When working with your own data, you will also need to provide an appropriate such dataframe.
# And this is what the annotation looks like.
print(annot)
With our simulated data at hand, we can now call DTU using rats:
mydtu <- call_DTU(slo, annot, "foo", "bar")
call_DTU()
takes 4 mandatory arguments: a sleuth object, an annotation dataframe matching transcripts to
genes, and the names of two conditions to compare. By default, the two names are expected to be values found
in the "condition" column of the slo$sample_to_conditions
table. You can override this if the column is
named differently in your data, or if you want to compare by a different variable. More on this in the
example that deals with the input options.
By default, rats
will print its progress through the task and give you a summary at the end, but you can
also choose to suppress these reports:
mydtu <- call_DTU(slo, annot, "foo", "bar", verbose = FALSE)
For your convenience, rats
provides a couple of functions to give you a quick summary of your results.
However, we do recommend you become familiar with the actual results structure and content.
# A really simple tally of the outcome.
print( dtu_summary(mydtu) )
# Gene and transcript IDs corresponding to the tally above.
ids <- get_dtu_ids(mydtu)
print( ids[["dtu-genes"]] ) # Also "dtu-transc", "ndtu-genes", "ndtu-transc", "na-genes", "na-transc"
The results structure is a list with 3 elements:
print( summary(mydtu) )
-
Parameters
is a list that contains information about the data and the settings.
# See the elements.
print( names(mydtu$Parameters) )
# See the value of one of the elements (i.e. the P-value threshold).
print( mydtu$Parameters[["p_thresh"]] )
-
Genes
is a data.table that contains gene-level test information.
# See the fields in the table.
print( names(mydtu$Genes) )
# See details of genes that show DTU.
print( mydtu$Genes[(DTU), ] )
# Or if you only want their codes:
print( mydtu$Genes[(DTU), parent_id] )
# NA are valid values for genes that didn't meet some pre-requisites:
# (more on this in the next two examples)
print( mydtu$Genes[, .(parent_id, elig, elig_fx, DTU)] )
-
Transcripts
is a data.table that contains transcript-level test information.
# See the fields in the table.
print( names(mydtu$Transcripts) )
# See the details of Transcripts that show relative change.
print( mydtu$Transcripts[(DTU), ] )
# Or if you only want their codes, without details:
print( mydtu$Transcripts[(DTU), .(target_id, parent_id)] )
# Not all transcripts meet the prerequisites:
# (more on this in the next two examples)
print( mydtu$Transcripts[, .(target_id, elig_xp, elig, elig_fx)] )
Both tables are by default keyed to the gene codes:
print( mydtu$Genes["MIX6", ] )
print( mydtu$Transcripts["MIX6", ] )
# To search by any other criteria (like transcript code), you do it like this:
print( mydtu$Transcripts[target_id == "MIX6.c4", ] )
- The
DTU
column in theGenes
table marks which genes show significant change in the proportions of their transcripts. It takes into account all the transcripts of the gene, which makes it more sensitive to changes but prevents it from attributing the change to specific transcripts in the gene. An expanded version of the column is copied into theTranscripts
table asgene_DTU
. - The
DTU
column in theTranscripts
table marks the transcripts whose proportion changes significantly. It inspects each transcript in isolation, which makes it less sensitive. It is aggregated into theGenes
table astransc_DTU
where it indicates that at least one transcript gave a positive individual result. -
parent_id
lists the gene IDs.target_id
lists the transcript IDs. - Most other columns offer additional information to help understand the result. Refer to the next example for more details.
The Transcripts$DTU
and Genes$DTU
calls can differ from one another, as they are based on different tests
and subject to different restrictions. They contain complementary information.
As seen in the previous example, the output structure contains two large tables. This example will focus on explaining what all the fields are.
We'll use the same simulated dataset as the previous example, but we'll add bootstrapping to the call of DTU, so we have information in all the output fields.
library(data.table)
library(rats)
# Same data simulation as previous example.
sim <- sim_sleuth_data(cnames = c("foo", "bar"))
# Calling and bootstrapping DTU.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", boots = "both", bootnum = 100)
The two extra arguments in the function-call specify that we want to bootstrap both the gene-level and transcript-level tests, and that we want 100 bootstrap iterations. More on bootstrapping when we address the input parameters in the next example.
Parameters
is a list that contains information about the data and the settings.
# Types of list's elements.
print( summary(mydtu$Parameters) )
-
var_name
The name of the variable by which the samples were grouped. This is a column name in theslo$sample_to_covariates
table. -
cond_A
andcond_B
The names of the two groups of samples to compare. These are values of the column specified above. -
num_replic_A
andnum_replic_B
The number of samples in each group. -
p_thresh
The significance level at which DTU is called, applicable to both tests. -
count_thresh
The minimum required fragment count per sample for each transcript. -
dprop_thresh
The minimum difference in proportion that is considered biologically significant. -
tests
Which of the two tests was carried out. -
bootstrap
Which of the two tests was bootstrapped. -
bootnum
The number of bootstrap iterations.
Genes
is a data.table with many fields, listing
results at the gene level. For your convenience, the transcript-level DTU calls are also summarised here.
The G-test is used for gene-level calls. As the test is designed for comparison of a set of counts against a theoretical set of proportions and, instead, we have two sets of counts, the test is run using in turn each condition as reference for the proportions. It is not possible to attribute the change to specific transcripts within the gene, but in return the test should be more sensitive to small changes than the transcript-level one.
# Types of table's elements.
print( lapply(mydtu$Genes, typeof) )
-
parent_id
The identification name/code of the gene. -
DTU
The gene collectively shows DTU, according to the gene-wise test. It is not possible to attribute this to specific transcripts. To be DTU, a gene needs both biological and statistical significance:DTU = (sig & elig_fx)
. -
transc_DTU
Aggregated fromTranscript$DTU
. Signifies that at least one transcript has changed significantly according to the transcript-wise test. Look up the gene in theTranscripts
table to see which transcripts are responsible. -
known_transc
The number of annotated transcripts for the gene. -
detect_transc
The number of transcripts with non-zero expression in at least one of the two conditions.known_transc >= detect_transc
. -
elig_transc
The number of transcripts eligible for DTU calling.detect_transc >= elig_transc
. See theTranscripts
table info below for details on how these are determined. -
elig
Eligible for testing: By definition, a gene needs at least 2 eligible transcripts in order for DTU to even be possible. -
elig_fx
Eligible effect size: At least one of the transcripts shows eligible effect size. (seeTranscripts
table explanations) -
pvalAB
andpvalBA
The raw P-values from the G tests, using in turn each condition as reference. -
pvalAB_corr
andpvalBA_corr
The above P-values adjusted for multiple testing (number of tested genes). -
sig
Statistically significant:pvalAB_corr < Parameters$p_thresh & pvalBA_corr < Parameters$p_thresh
. -
boot_freq
The fraction of bootstrap iterations in which the gene was called DTU. It can be used as a measure of confidence in the DTU call. Values closer to1
indicate strong tendency for DTU, whereas values nearer0
indicate lack of DTU. Values near0.5
indicate lack of confidence either way and should be treated carefully. -
boot_meanAB
andboot_meanBA
Mean P-values across the bootstraps. -
boot_stdevAB
andboot_stdevBA
Standard deviations of the mean P-values across the bootstraps. -
boot_minAB
andboot_minBA
The minimum (=most significant) P-values that occurred in the bootstraps. -
boot_maxAB
andboot_maxBA
The maximum (=least significant) P-values that occurred in the bootstraps. -
boot_na
The fraction of bootstrap iterations in which the gene was not eligible for DTU calling.
Note: The fields reporting on the bootstraps will not be shown when bootstrapping is disabled.
Transcripts
is a data.table with many fields, listing
results at the transcript level. For your convenience, the gene-level DTU calls are also included here.
The proportions test is used for the transcript-level calls. Changes can be attributed to specific transcripts, but in return the test is less sensitive as it uses less information that the gene-level one.
# Types of table's elements.
print( lapply(mydtu$Transcripts, typeof) )
-
target_id
andparent_id
The identification name/code of the transcript and gene, respectively. -
DTU
Whether the transcript's proportion changed significantly, according to the transcript-wise test. -
gene_DTU
Expanded fromGenes$DTU
. Indicates that the gene as a whole shows significant change in proportions. -
meanA
andmeanB
The mean of counts across samples, for each condition. -
stdevA
andstdevB
The standard deviation of the mean counts across samples, for each condition. -
sumA
andsumB
The sum of counts across samples for each condition. This is used for the tests, so that a higher number of samples leads to higher significance. -
totalA
andtotalB
The total counts for the gene.totalA = sum(sumA) # by gene
. -
elig_xp
Eligible expression level: A fragment count above the defined threshold in at least one of the two conditions is required.elig_xp = (meanA > Parameters$count_thresh | meanB > Parameters$count_thresh)
. -
elig
Eligible for testing: A transcript needs to meet the eligible expression level and its gene needs to have more than one eligible expression transcripts (by DTU definition), and finally the gene as a whole needs to have detectable expression in both conditions (otherwise the proportion cannot be defined).elig = (elig_xp & totalA != 0 & totalB != 0 & (sumA != totalA | sumB != totalB))
-
propA
andpropB
The proportion of the gene expression owed to this transcript, in each condition. -
Dprop
The difference in the proportion of the transcript between the two conditions. -
elig_fx
Eligible effect size: Proxy for biological significance. Meaninglessly small changes in the proportion can be statistically significant if transcript expression or sequencing depth are very high.elig_fx = (Dprop > Parameters$dprop_thresh)
. -
pval
The raw P-value of the proportions test for the transcript. -
pval_corr
The above P-value adjusted for multiple testing (number of transcripts). -
sig
Statistically significant:pval_corr < Parameters$p_thresh
. -
boot_freq
The fraction of bootstrap iterations in which the transcript was called DTU. It can be used as a measure of confidence in the DTU call. Values closer to1
indicate strong tendency for DTU, whereas values nearer0
indicate lack of DTU. Values near0.5
indicate lack of confidence either way and should be treated carefully. -
boot_mean
The mean P-value across the bootstraps. -
boot_stdev
The standard deviation of the mean P-value across the bootstraps. -
boot_min
The minimum (most significant) P-value that occurred in the bootstraps. -
boot_max
The maximum (most significant) P-value that occurred in the bootstraps. -
boot_na
The fraction of bootstrap iterations in which the transcripts were not eligible for DTU calling.
Note The fields reporting on the bootstraps will not be shown when bootstrapping is disabled.
Now that you know what all the fields are, let's revisit the output of the DTU call and look into some details of its behaviour and how to interpret the information.
# Let's check the info and settings.
print( mydtu$Parameters )
- We're comparing samples by
r mydtu$Parameters[["var_name"]]
, and the two values of that arer mydtu$Parameters[["cond_A"]]
(withr mydtu$Parameters[["num_replic_A"]]
samples) andr mydtu$Parameters[["cond_B"]]
(withr mydtu$Parameters[["num_replic_B"]]
samples). - The significance threshold is set to
r mydtu$Parameters[["p_thresh"]]
, the minimum count per sample is set tor mydtu$Parameters[["count_thresh"]]
fragments and the proportion has to change by at leastr mydtu$Parameters[["dprop_thresh"]]
to be considered biologically significant. - Are we doing gene-level or transcript-level tests?
r mydtu$Parameters[["tests"]]
- Are we bootstrapping the gene-level or transcript-level tests?
r mydtu$Parameters[["bootstrap"]]
. And we're doingr mydtu$Parameters[["bootnum"]]
iterations.
# Gene-level calls.
print( mydtu$Genes )
There are r dim(mydtu$Genes)[1]
genes in the annotation used. Here are some possible scenarios:
-
1A1N
has only one known transcript and is thus not eligible. If you look inslo$kal[[1]]$bootstrap[[1]]
, there are actually two recorded transcripts for this gene, but only one of them is recorded in the annotation.rats
uses the annotation as the basis for managing information. Any transcripts/genes present in the sleuth data, but missing from the annotation, will be ignored completely. -
1B1C
has two annotated transcripts. If you look inslo$kal[[1]]$bootstrap[[1]]
, only one of them is recorded in our data whereas the other is missing completely, so the gene does not have enough transcripts to be eligible. -
1D1C
has two transcripts, but only one was detected in our data, so it too is also not eligible. Both transcript are present in the sleuth data, but one of them has zero counts in both conditions. Thus this gene is also not eligible. -
CC
has two known transcripts, both of which are detected and eligible, therefore it is eligible for DTU. It is called as DTU on both the transcript and gene levels, with very high significances. The bootstraps show that the calls were not 100% positive, so there is a small level of uncertainty regarding this gene. -
LC
has two known transcripts, both of which are detected but only one of them is eligible, so it is not eligible for testing at the gene level. An inspection of theTranscripts
table shows that the reason is that one of the two transcripts does not meet the minimum fragment count requirement. -
MIX6
has six transcripts, 5 of which are detected and eligible. The gene is called as DTU with extremely high significance and all the bootstrap iterations were positive. Inspection of theTranscripts
table shows that the missing transcript is not detected in either condition. -
NIB
has a single transcript that is not detected. Looking intoslo$kal[[1]]$bootstrap[[1]]
it is evident that the entire gene is not recorded at all in the data. -
NN
has two transcripts, both of which are detected and eligible, but no significant change is detected, a result backed up by all the bootstrap iterations. -
ALLA
has one transcript and it is detected in only one of the conditions, but one transcript is not enough to be eligible. -
ALLB
has two transcripts, both detected, yet is still not eligible as both transcripts are not eligible. Inspection ofTranscripts
shows that neither transcript is detected in one of the conditions. Although the gene would likely be called as DTE and DGE by the appropriate tools, it is not possible to define proportions in that condition, so it is not possible to call DTU for it.
Note: The values shown in the bootstrap columns will differ in each run of the same command, especially for few iterations, due to random sampling.
# Transcript-level calls.
print( mydtu$Transcripts )
Several mentions of this table were made during analysis of the Genes
table. Here are some possible scenarios:
-
CC_a
andCC_b
both show counts (sumA
andsumB
) above the threshold (Parameters$counts_thresh
) so they are eligible. They are positive for DTU at both the gene and transcript level. The frequency of positive calls in the bootstraps, however, is not quite so good. Additionally, the change in proportion is only marginally above the already quite permissive threshold. So these transcripts are marginal and should be treated carefully. -
LC1
andLC2
both change in proportion above the threshold, butLC1
does not meet the counts threshold on average.LC2
is eligible for transcript-wise testing and comes out as non-DTU, with good support by the bootstraps. Gene-level DTU cannot be called sinceLC1
did not meet all the criteria. -
MIX.c1
,MIX.c2
andMIX.c4
all show significant individual change.MIX.c3
, despite statistical significance, does not meet the effect size threshold and is thus called non-DTU.MIX.nc
shows no significant change.MIX6.d
is not detected at all. -
1NN
and2NN
do not show any change. -
NIB.1
is not detected at all. In fact the entire gene is not detected (since this is the only transcript of the gene). Therefore it is not possible to calculate anything for this gene/transcript, leading toNA
in this row. -
ALLA1
andALLB1
/ALLAB2
are similar in that they are detected in one condition only. Proportions cannot be defined for the other condition (division by 0), so they are not eligible for testing.
Examples 1 and 2 mentioned the possibility that some transcripts/genes may be missing altogether from either the
annotation or the sleuth object. This is likely to be caused by use of different annotation versions during different
stages of the analysis. This is generally a bad idea, as there is no guarantee (at least for some annotations)
that the transcript identification codes will remain consistent between annotation versions. rats
will not provide
any warning about this, and will carry out the DTU calls in the most sensible way possible, assuming that the
intersection of transcript identifications between the sleuth data and annotation dataframe are consistent. All
internal operations and the output will be based on the annotation dataframe provided:
- Any transcripts/genes present in the counts data but missing from the annotation will be ignored completely and will not show up in the output.
- Any transcript/gene present in the annotation but missing from the count data will be padded with zero counts throughout and will be included in the output.
- If the samples appear to use different annotations from one another,
rats
will abort.
We'll still use the simulated data, as in the previous examples.
library(rats)
library(data.table)
sim <- sim_sleuth_data(cnames = c("foo", "bar"))
So far, we've always compared conditions (slo$sample_to_covariates[["condition"]]
). Your data could contain other
variables as well. As we saw in a previous example, our simulated data contains one called batch
.
# Compare by a different variable. In this case "batch".
mydtu <- call_DTU(sim$slo, sim$annot, "ba", "bb", varname= "batch", verbose = FALSE)
We already saw in the previous example how to activate bootstrapping of the DTU calls. It adds significant processing time, but we strongly recommend using this option with a decent number of iterations, as it provides a measure of confidence in the p-values. Current transcript quantification algorithms such as kallisto, salmon and sailfish use probabilistic approaches to estimate transcript abundance. Bootstrapping the DTU calls controls for the variability in these quantifications.
# Bootstrap everything.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", boots = "both", bootnum = 100, verbose = FALSE)
# Only bootstrap transcript calls.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", boots = "transc", verbose = FALSE)
# Only bootstrap gene calls.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", boots = "genes", verbose = FALSE)
-
boots
Although we recommend bootstrapping both gene-level and transcript-level DTU calls, dropping one would reduce computation time, so the option is provided to allow flexibility for special use cases. -
bootnum
Generally, greater is better but it takes longer. There is also an upper limit that depends on the size of your annotation. This is due to R's limit on the maximum size of matrices.
Three thresholds can be set in rats
.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", p_thresh = 0.01,
count_thresh = 10, dprop_thresh = 0.25, verbose = FALSE)
# You can view if/how this changes the result by using the commands shown in the
# previous two examples.
-
p_thresh
Significance level. P-values below this will be considered significant. 0.05 is a very permissive value. -
count_thresh
At least one of the two conditions needs to exceed this number of fragments per sample for the transcript to be considered. Low counts have high uncertainty and low significance. Removing such entries from the analysis when both conditions are low-count improves the confidence in the predictions of the remaining transcripts, and reduces computation time. -
dprop_thresh
This is a proxy for biological significance. A transcript's proportion must change by at least this much for it to be considered. The need for this is due to the fact that high counts will make even minute differences show up as statistically significant, however very small changes are likely to be biologically indifferent.
The default values for all three thresholds are rather lenient. You may want to set them higher.
To quickly check the results of higher thresholds AFTER a run of rats
,you can use normal table filtering operations,
but bear in mind that aggregated/expanded/derivative fields will not update and that some fields are determined by multiple others:
# Inspect higher statistical significance threshold.
new_thresh = 0.01
newgenecalls <- mydtu$Genes[, .(parent_id,
pvalAB_corr < new_thresh &
pvalBA_corr < new_thresh &
elig_fx)] # DTU requires both stats and biol significance
newtransccalls <- mydtu$Transcripts[, .(target_id, parent_id,
pval_corr < new_thresh & elig_fx)] # same
print( newgenecalls )
rats
runs gene-level calls (slower, more sensitive, less fine-grained) and transcript-level calls
(faster, less sensitive, more explicit report). They are complementary, and we recommend using them
together, but the option to skip either is provided for special use cases.
The fields of the skipped test will be filled with NA
.
# Transcripts only.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", testmode="transc", verbose = FALSE)
# Genes only.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", testmode="genes", verbose = FALSE)
Testing multiple null hypotheses increases the chance of one being falsely rejected. To keep the overall false rate at the
desired level, the raw p-values must be adjusted. This is achieved using R's build-in p.adjust()
. The default adjustment
method is BH
(Benjamini-Hoffmann). A full list of options is listed in R's p.adjust.methods
.
# Bonferroni correction.
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", correction = "bonferroni", verbose = FALSE)
rats
needs to pull information from different fields of the sleuth object and the annotation. For flexibility you can
change the names of these fields.
# Lets create some input with custom field names. The data is exactly the same as before.
sim <- sim_sleuth_data(varname="mouse", cnames=c("Splinter", "Mickey"), COUNTS_COL="the-counts",
TARGET_COL="transcript", PARENT_COL="gene", BS_TARGET_COL = "trscr")
print( sim$slo$sample_to_covariates )
print( sim$slo$kal[[1]]$bootstrap[[1]] )
print( sim$annot )
With the field names changed, we need to tell rats
where to find the data:
mydtu <- call_DTU(sim$slo, sim$annot, "Splinter", "Mickey", varname="mouse",
TARGET_COL="transcript", PARENT_COL="gene",
COUNTS_COL="the-counts", BS_TARGET_COL="trscr", verbose = FALSE)
# The output structure will always use the same field names, regardless of
# what the input names are.
print( names(mydtu$Transcripts) )
The dtu object's tables provide a host of information. The rats
package also includes some basic visualisation aids.
Let's start by setting up our usual simulated data.
library(rats)
sim <- sim_sleuth_data(cnames = c("foo", "bar"))
mydtu <- call_DTU(sim$slo, sim$annot, "foo", "bar", boots="both", bootnum=100, verbose = FALSE)
A table full of numbers may be heaven to some and hell to others. This function allows you to visualise what's going on in any particular gene.
# Proportion changes for all the transcripts of the "MIX6" gene.
# !!! The ERROR BARS represent the standard error for the estimated proportions.
plot_gene(mydtu, "MIX6")
A variation of the above is to plot the mean fragment counts instead of the proportions:
# Absolute expression changes for all the transcripts of the "MIX6" gene.
# !!! The ERROR BARS represent 2 standard deviations from the mean count across replicates.
plot_gene(mydtu, "MIX6", vals="counts")
This family of plots show the distribution of the calls. Our usual simulated dataset is too small to demonstrate what these plots typically would look like, but it is adequate to demonstrate the usage. A more illustrative dataset will be prepared and used in a future release. We are also planning to make these plots interactive.
Several of these plots are likely to display warnings about missing or non-finite values. These are due to the
presence of NA
in the tables, for reasons already discussed in the previous examples, and can be ignored.
# Proportion change VS significance. Probably the most common plot. Similar to volcano plot.
plot_overview(mydtu, type="dpropVsig")
# Proportion change VS transcript expression.
plot_overview(mydtu, type="dpropVcount")
# Proportion VS transcript expression.
plot_overview(mydtu, type="propVcount")
# Distribution of maximum proportion change.
plot_overview(mydtu, type="maxdprop")
# Distribution of bootstrapped confidence of transcript-level DTU.
plot_overview(mydtu, type="transc-conf")
# Distribution of bootstrapped confidence of gene-level DTU.
plot_overview(mydtu, type="gene-conf")
# Transcript-level confidence threshold VS. number of DTU positive calls.
plot_overview(mydtu, type="trconfVdtu")
# Gene-level confidence threshold VS. number of DTU positive calls.
plot_overview(mydtu, type="gconfVdtu")
You can save any of the plots as a ggplot2
object and use ggplot2 manipulations on it, such as changing the axis scales.
Other ggplot2
customisations include the axis tickmarks, axis values, labels, titles, colours... Consult the ggplot2
documentation for more
help on these.
library(ggplot2)
myplot <- plot_overview(mydtu, type="dpropVsig")
myplot # display
# Change scale of y axis.
myplot2 <- myplot + coord_trans(y="sqrt") # Or "log10", "log10", etc...
myplot2
# Revert to linear scale.
myplot2 <- myplot + coord_trans(y="identity")
myplot2