diff --git a/DESCRIPTION b/DESCRIPTION index 204cf7eb..59cccbcf 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,14 +1,14 @@ Package: immunarch Type: Package Title: Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires -Version: 0.6.9 +Version: 0.7.0 Authors@R: c( person("Vadim I.", "Nazarov", , "support@immunomind.io", c("aut", "cre")), person("Vasily O.", "Tsvetkov", , role = "aut"), person("Eugene", "Rumynskiy", , role = "aut"), person("Aleksandr A.", "Popov", , role = "aut"), person("Ivan", "Balashov", , role = "aut"), - person("Maria", "Volobueva", , role = "aut"), + person("Maria", "Samokhina", , role = "aut"), person("Anna", "Lorenc", , role = "ctb"), person("Daniel J.", "Moore", , role = "ctb"), person("Victor", "Greiff", , role = "ctb"), @@ -18,7 +18,7 @@ Contact: support@immunomind.io Description: A comprehensive framework for bioinformatics exploratory analysis of bulk and single-cell T-cell receptor and antibody repertoires. It provides seamless data loading, analysis and visualisation for AIRR (Adaptive Immune Receptor Repertoire) data, both bulk immunosequencing (RepSeq) - and single-cell sequencing (scRNAseq). It implements most of the widely used AIRR analysis methods, + and single-cell sequencing (scRNAseq). Immunarch implements most of the widely used AIRR analysis methods, such as: clonality analysis, estimation of repertoire similarities in distribution of clonotypes and gene segments, repertoire diversity analysis, annotation of clonotypes using external immune receptor databases and clonotype tracking in vaccination and cancer studies. A successor to our @@ -65,7 +65,8 @@ Imports: glue, phangorn, uuid, - stringi + stringi, + ggraph Depends: R (>= 4.0.0), ggplot2 (>= 3.1.0), diff --git a/NAMESPACE b/NAMESPACE index 94821559..930ad526 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,5 +1,7 @@ # Generated by roxygen2: do not edit by hand +S3method(vis,clonal_family) +S3method(vis,clonal_family_tree) S3method(vis,immunr_chao1) S3method(vis,immunr_clonal_prop) S3method(vis,immunr_dbscan) @@ -35,6 +37,7 @@ S3method(vis,immunr_spectr) S3method(vis,immunr_spectr_nogene) S3method(vis,immunr_top_prop) S3method(vis,immunr_tsne) +S3method(vis,step_failure_ignored) export(apply_asymm) export(apply_symm) export(bunch_translate) @@ -87,6 +90,7 @@ export(repOverlap) export(repOverlapAnalysis) export(repSample) export(repSave) +export(repSomaticHypermutation) export(select_barcodes) export(select_clusters) export(seqCluster) @@ -111,7 +115,7 @@ importFrom(Rcpp,sourceCpp) importFrom(UpSetR,fromExpression) importFrom(UpSetR,upset) importFrom(ape,as.DNAbin) -importFrom(ape,muscle) +importFrom(ape,clustal) importFrom(ape,read.tree) importFrom(circlize,chordDiagram) importFrom(data.table,":=") @@ -125,6 +129,7 @@ importFrom(data.table,setDT) importFrom(data.table,setcolorder) importFrom(data.table,setnames) importFrom(doParallel,registerDoParallel) +importFrom(doParallel,stopImplicitCluster) importFrom(dplyr,arrange) importFrom(dplyr,as_tibble) importFrom(dplyr,collect) @@ -166,6 +171,10 @@ importFrom(ggpubr,ggscatter) importFrom(ggpubr,rotate_x_text) importFrom(ggpubr,stat_compare_means) importFrom(ggpubr,theme_pubr) +importFrom(ggraph,geom_edge_diagonal) +importFrom(ggraph,geom_node_point) +importFrom(ggraph,ggraph) +importFrom(ggraph,theme_graph) importFrom(ggseqlogo,geom_logo) importFrom(ggseqlogo,theme_logo) importFrom(glue,glue) @@ -181,13 +190,18 @@ importFrom(magrittr,"%>%") importFrom(magrittr,extract2) importFrom(magrittr,set_attr) importFrom(methods,as) +importFrom(parallel,clusterExport) importFrom(parallel,detectCores) +importFrom(parallel,makeCluster) importFrom(parallel,mclapply) +importFrom(parallel,parApply) +importFrom(parallel,stopCluster) importFrom(patchwork,plot_annotation) importFrom(patchwork,wrap_plots) importFrom(phangorn,write.phyDat) importFrom(pheatmap,pheatmap) importFrom(plyr,.) +importFrom(plyr,adply) importFrom(plyr,dlply) importFrom(plyr,mapvalues) importFrom(purrr,imap) @@ -256,8 +270,10 @@ importFrom(stringdist,stringdistmatrix) importFrom(stringi,stri_replace_all_fixed) importFrom(stringr,boundary) importFrom(stringr,fixed) +importFrom(stringr,str_c) importFrom(stringr,str_count) importFrom(stringr,str_detect) +importFrom(stringr,str_extract) importFrom(stringr,str_extract_all) importFrom(stringr,str_length) importFrom(stringr,str_match) @@ -270,7 +286,9 @@ importFrom(stringr,str_sub) importFrom(stringr,str_trim) importFrom(tibble,rownames_to_column) importFrom(tibble,tibble) +importFrom(tidyr,drop_na) importFrom(tidyr,unite) +importFrom(tidyr,unnest) importFrom(tidyselect,starts_with) importFrom(utils,capture.output) importFrom(utils,packageVersion) diff --git a/R/align_lineage.R b/R/align_lineage.R index 33dfc08c..68af4c55 100644 --- a/R/align_lineage.R +++ b/R/align_lineage.R @@ -1,8 +1,4 @@ -#' This function aligns all sequences (incliding germline) that belong to one clonal lineage and one cluster. -#' After clustering and building the clonal lineage and germline, the next step is to analyze the degree of mutation -#' and maturity of each clonal lineage. This allows for finding high mature cells and cells with a large -#' number of offspring. The phylogenetic analysis will find mutations that increase the affinity of BCR. -#' Making alignment of the sequence is the first step towards sequence analysis including BCR. +#' Aligns all sequences incliding germline within each clonal lineage within each cluster #' #' @concept align_lineage #' @@ -14,11 +10,16 @@ #' @importFrom purrr map_dfr #' @importFrom rlist list.remove #' @importFrom utils str -#' @importFrom ape as.DNAbin muscle -#' @importFrom doParallel registerDoParallel +#' @importFrom ape as.DNAbin clustal +#' @importFrom doParallel registerDoParallel stopImplicitCluster #' @importFrom parallel mclapply -#' @description Aligns all sequences incliding germline within each clonal lineage within each cluster +#' @description This function aligns all sequences (incliding germline) that belong to one clonal +#' lineage and one cluster. After clustering and building the clonal lineage and germline, the next +#' step is to analyze the degree of mutation and maturity of each clonal lineage. This allows for +#' finding high mature cells and cells with a large number of offspring. The phylogenetic analysis +#' will find mutations that increase the affinity of BCR. Making alignment of the sequence +#' is the first step towards sequence analysis including BCR. #' #' @usage #' @@ -39,13 +40,13 @@ #' @param .align_threads Number of threads for lineage alignment. #' #' It must have columns in the immunarch compatible format \link{immunarch_data_format}, and also -#' must contain 'Cluster' column, which is added by seqCluster() function, and 'Sequence.germline' +#' must contain 'Cluster' column, which is added by seqCluster() function, and 'Germline.sequence' #' column, which is added by repGermline() function. #' #' @param .verbose_output If TRUE, all output dataframe columns will be included (see documentation about this #' function return), and unaligned clusters will be included in the output. Setting this to TRUE significantly -#' increases memory usage. If FALSE, only aligned clusters and columns required for repClonalFamily() calculation -#' will be included in the output. +#' increases memory usage. If FALSE, only aligned clusters and columns required for repClonalFamily() and +#' repSomaticHypermutation() calculation will be included in the output. #' #' @param .nofail Will return NA instead of stopping if Clustal W is not installed. #' Used to avoid raising errors in examples on computers where Clustal W is not installed. @@ -56,16 +57,21 @@ #' The dataframe has these columns: #' * Cluster: cluster name #' * Germline: germline sequence +#' * V.germline.nt: germline V gene sequence +#' * J.germline.nt: germline J gene sequence +#' * CDR3.germline.length: length of CDR3 in germline #' * Aligned (included if .verbose_output=TRUE): FALSE if this group of sequences was not aligned with lineage #' (.min_lineage_sequences is below the threshold); TRUE if it was aligned #' * Alignment: DNAbin object with alignment or DNAbin object with unaligned sequences (if Aligned=FALSE) -#' * V.length (included if .verbose_output=TRUE): shortest length of V gene part outside of CDR3 region in this +#' * V.length: shortest length of V gene part outside of CDR3 region in this #' group of sequences; longer V genes (including germline) are trimmed to this length before alignment -#' * J.length (included if .verbose_output=TRUE): shortest length of J gene part outside of CDR3 region in this +#' * J.length: shortest length of J gene part outside of CDR3 region in this #' group of sequences; longer J genes (including germline) are trimmed to this length before alignment -#' * Sequences (included if .verbose_output=TRUE): nested dataframe containing all sequences for this combination +#' * Sequences: nested dataframe containing all sequences for this combination #' of cluster and germline; it has columns -#' Sequence, V.end, J.start, CDR3.start, CDR3.end; all values taken from the input dataframe +#' Sequence, Clone.ID, Clones, CDR1.nt, CDR2.nt, CDR3.nt, FR1.nt, FR2.nt, FR3.nt, FR4.nt +#' and, if .verbose_output=TRUE, also V.end, J.start, CDR3.start, CDR3.end; +#' all values taken from the input dataframe #' #' @examples #' @@ -74,7 +80,7 @@ #' #' bcr_data %>% #' seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% -#' repGermline() %>% +#' repGermline(.threads = 1) %>% #' repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) #' @export repAlignLineage repAlignLineage <- function(.data, @@ -88,7 +94,7 @@ repAlignLineage <- function(.data, "Please download it from here: http://www.clustal.org/download/current/\n", "or install it with your system package manager (such as apt or dnf)." ), .nofail)) { - return(NA) + return(get_empty_object_with_class("step_failure_ignored")) } doParallel::registerDoParallel(cores = .prepare_threads) @@ -96,14 +102,22 @@ repAlignLineage <- function(.data, apply_to_sample_or_list( align_single_df, .min_lineage_sequences = .min_lineage_sequences, + .parallel_prepare = .prepare_threads > 1, .align_threads = .align_threads, .verbose_output = .verbose_output ) + doParallel::stopImplicitCluster() return(.data) } -align_single_df <- function(data, .min_lineage_sequences, .align_threads, .verbose_output) { - for (required_column in c("Cluster", "Germline.sequence")) { +align_single_df <- function(data, + .min_lineage_sequences, + .parallel_prepare, + .align_threads, + .verbose_output) { + for (required_column in c( + "Cluster", "Germline.sequence", "V.germline.nt", "J.germline.nt", "CDR3.germline.length" + )) { if (!(required_column %in% colnames(data))) { stop( "Found dataframe without required column ", @@ -120,7 +134,7 @@ align_single_df <- function(data, .min_lineage_sequences, .align_threads, .verbo .fun = prepare_results_row, .min_lineage_sequences = .min_lineage_sequences, .verbose_output = .verbose_output, - .parallel = TRUE + .parallel = .parallel_prepare ) %>% `[`(!is.na(.)) %>% unname() @@ -142,7 +156,7 @@ align_single_df <- function(data, .min_lineage_sequences, .align_threads, .verbo mc.cores = .align_threads ) - return(convert_results_to_df(results, alignments, .verbose_output)) + return(convert_results_to_df(results, alignments)) } # this function accepts dataframe subset containing rows only for current lineage @@ -150,6 +164,9 @@ align_single_df <- function(data, .min_lineage_sequences, .align_threads, .verbo prepare_results_row <- function(lineage_subset, .min_lineage_sequences, .verbose_output) { cluster_name <- lineage_subset[[1, "Cluster"]] germline_seq <- lineage_subset[[1, "Germline.sequence"]] + germline_v <- lineage_subset[[1, "V.germline.nt"]] + germline_j <- lineage_subset[[1, "J.germline.nt"]] + germline_cdr3_len <- lineage_subset[[1, "CDR3.germline.length"]] aligned <- nrow(lineage_subset) >= .min_lineage_sequences if (!aligned & !.verbose_output) { @@ -163,13 +180,19 @@ prepare_results_row <- function(lineage_subset, .min_lineage_sequences, .verbose lineage_subset[["Sequence"]], lineage_subset[["J.start"]], lineage_subset[["CDR3.end"]] ) + sequences_columns <- c( + "Sequence", "Clone.ID", "Clones", + "CDR1.nt", "CDR2.nt", "CDR3.nt", "FR1.nt", "FR2.nt", "FR3.nt", "FR4.nt" + ) if (.verbose_output) { - sequences <- lineage_subset[c("Sequence", "V.end", "J.start", "CDR3.start", "CDR3.end")] + sequences_columns %<>% c("V.end", "J.start", "CDR3.start", "CDR3.end") } + sequences <- lineage_subset[sequences_columns] + sequences[["Clone.ID"]] %<>% as.integer() + sequences[["Clones"]] %<>% as.integer() - germline_parts <- strsplit(germline_seq, "N")[[1]] - germline_v_len <- stringr::str_length(germline_parts[1]) - germline_j_len <- stringr::str_length(tail(germline_parts, 1)) + germline_v_len <- str_length(germline_v) + germline_j_len <- str_length(germline_j) v_min_len <- min(lineage_subset[["V.lengths"]], germline_v_len) j_min_len <- min(lineage_subset[["J.lengths"]], germline_j_len) @@ -181,12 +204,21 @@ prepare_results_row <- function(lineage_subset, .min_lineage_sequences, .verbose lineage_subset[["J.lengths"]], j_min_len ) - alignment <- convert_to_dnabin(germline_trimmed, clonotypes_trimmed) + + clonotypes_names <- sapply(lineage_subset[["Clone.ID"]], function(id) { + paste0("ID_", id) + }) + all_sequences_list <- c(list(germline_trimmed), as.list(clonotypes_trimmed)) + names(all_sequences_list) <- c("Germline", clonotypes_names) + alignment <- convert_seq_list_to_dnabin(all_sequences_list) if (.verbose_output) { return(list( Cluster = cluster_name, Germline = germline_seq, + V.germline.nt = germline_v, + J.germline.nt = germline_j, + CDR3.germline.length = germline_cdr3_len, Aligned = aligned, Alignment = alignment, V.length = v_min_len, @@ -197,43 +229,36 @@ prepare_results_row <- function(lineage_subset, .min_lineage_sequences, .verbose return(list( Cluster = cluster_name, Germline = germline_seq, - Alignment = alignment + V.germline.nt = germline_v, + J.germline.nt = germline_j, + CDR3.germline.length = germline_cdr3_len, + Alignment = alignment, + V.length = v_min_len, + J.length = j_min_len, + Sequences = sequences )) } } -convert_to_dnabin <- function(germline_seq, clonotypes) { - all_sequences_list <- c(list(germline = germline_seq), as.list(clonotypes)) - dnabin <- all_sequences_list %>% - lapply( - function(sequence) { - sequence %>% - stringr::str_extract_all(stringr::boundary("character")) %>% - unlist() - } - ) %>% - ape::as.DNAbin() - return(dnabin) -} - # trim V/J tails in sequence to the specified lenghts v_min, j_min trim_seq <- function(seq, v_len, v_min, j_len, j_min) { - stringr::str_sub(seq, v_len - v_min + 1, -(j_len - j_min + 1)) + str_sub(seq, v_len - v_min + 1, -(j_len - j_min + 1)) } -convert_results_to_df <- function(nested_results_list, nested_alignments_list, .verbose_output) { +convert_results_to_df <- function(nested_results_list, nested_alignments_list) { alignments <- nested_alignments_list %>% lapply(magrittr::extract2, "Alignment") %>% tibble(Alignment = .) + sequences <- nested_results_list %>% + lapply(magrittr::extract2, "Sequences") %>% + tibble(Sequences = .) df <- nested_results_list %>% lapply(rlist::list.remove, c("Alignment", "Sequences")) %>% purrr::map_dfr(~.) %>% - cbind(alignments) - if (.verbose_output) { - sequences <- nested_results_list %>% - lapply(magrittr::extract2, "Sequences") %>% - tibble(Sequences = .) - df %<>% cbind(sequences) + cbind(alignments, sequences) + # fix column types after dataframe rebuilding + for (column in c("CDR3.germline.length", "V.length", "J.length")) { + df[[column]] %<>% as.integer() } return(df) } diff --git a/R/distance.R b/R/distance.R index ec6c2f11..8fb316c9 100644 --- a/R/distance.R +++ b/R/distance.R @@ -1,5 +1,7 @@ #' Function for computing distance for sequences #' +#' @concept distance +#' #' @importFrom stringdist stringdistmatrix #' @importFrom purrr map pmap map2 #' @importFrom magrittr %>% %<>% set_attr diff --git a/R/germline.R b/R/germline.R index 487ec509..aa335523 100644 --- a/R/germline.R +++ b/R/germline.R @@ -1,80 +1,111 @@ -#' This function creates germlines for clonal lineages. B cell clonal lineage represents a set of B cells -#' that presumably have a common origin (arising from the same VDJ rearrangement event) and a common ancestor. -#' Each clonal lineage has its own germline sequence that represents the ancestral sequence -#' for each BCR in clonal lineage. In other words, germline sequence is a sequence of B-cells immediately -#' after VDJ recombination, before B-cell maturation and hypermutation process. Germline sequence is useful -#' for assessing the degree of mutation and maturity of the repertoire. +#' Creates germlines for clonal lineages #' #' @concept germline #' #' @aliases repGermline #' -#' @importFrom stringr str_sub str_length str_replace fixed -#' @importFrom purrr imap +#' @importFrom stringr str_sub str_length str_replace fixed str_extract_all str_extract boundary str_c +#' @importFrom purrr imap map_dfr #' @importFrom magrittr %>% %<>% extract2 #' @importFrom dplyr filter rowwise - -#' @description Creates germlines for clonal lineages +#' @importFrom parallel parApply detectCores makeCluster clusterExport stopCluster +#' @importFrom ape as.DNAbin clustal +#' +#' @description This function creates germlines for clonal lineages. B cell clonal lineage +#' represents a set of B cells that presumably have a common origin (arising from the same VDJ +#' rearrangement event) and a common ancestor. Each clonal lineage has its own germline sequence +#' that represents the ancestral sequence for each BCR in clonal lineage. In other words, +#' germline sequence is a sequence of B-cells immediately after VDJ recombination, before +#' B-cell maturation and hypermutation process. Germline sequence is useful for assessing +#' the degree of mutation and maturity of the repertoire. #' #' @usage #' -#' repGermline(.data, species, min_nuc_outside_cdr3, ref_only_first) +#' repGermline(.data, +#' .species, .align_j_gene, .min_nuc_outside_cdr3, .threads) #' #' @param .data The data to be processed. Can be \link{data.frame}, \link{data.table} #' or a list of these objects. #' #' It must have columns in the immunarch compatible format \link{immunarch_data_format}. #' -#' @param species Species from which the data was acquired. Available options: +#' @param .species Species from which the data was acquired. Available options: #' "HomoSapiens" (default), "MusMusculus", "BosTaurus", "CamelusDromedarius", #' "CanisLupusFamiliaris", "DanioRerio", "MacacaMulatta", "MusMusculusDomesticus", #' "MusMusculusCastaneus", "MusMusculusMolossinus", "MusMusculusMusculus", "MusSpretus", #' "OncorhynchusMykiss", "OrnithorhynchusAnatinus", "OryctolagusCuniculus", "RattusNorvegicus", #' "SusScrofa". #' -#' @param min_nuc_outside_cdr3 This parameter sets how many nucleotides should have V or J chain +#' @param .align_j_gene MiXCR provides the number of J indels only for 1 allele of J gene +#' in the output file, and a germline can contain another allele. Therefore, calculation of +#' J gene start in reference based on numbers from input file can be sometimes incorrect. +#' As result, J gene in the germline will be trimmed in the start or will contain some +#' nucleotides from CDR3. Setting this parameter to TRUE enables precise alignment of J genes +#' to detect the correct starting point, but it significantly reduces performance. +#' +#' @param .min_nuc_outside_cdr3 This parameter sets how many nucleotides should have V or J chain #' outside of CDR3 to be considered good for further alignment. #' -#' @param ref_only_first This parameter, if TRUE, means to take only first sequence from reference -#' for each allele name; if FALSE, all sequences will be taken, and the output table will -#' increase in size as a result. +#' @param .threads Number of threads to use. #' #' @return #' -#' Data with added columns V.first.allele, J.first.allele (with first alleles of V and J genes), -#' V.sequence, J.sequence (with V and J reference sequences), -#' Germline.sequence (with combined germline sequence) +#' Data with added columns: +#' * V.first.allele, J.first.allele (first alleles of V and J genes), +#' * V.ref.nt, J.ref.nt (V and J reference sequences), +#' * V.germline.nt, J.germline.nt (V and J germline sequences; they are references with +#' trimmed parts that are from CDR3), +#' * CDR3.germline.length (length of CDR3 in the germline), +#' * Germline.sequence (combined germline sequence) #' #' @examples #' #' data(bcrdata) #' #' bcrdata$data %>% -#' repGermline() +#' top(5) %>% +#' repGermline(.threads = 1) #' @export repGermline -repGermline <- function(.data, species = "HomoSapiens", min_nuc_outside_cdr3 = 5, ref_only_first = TRUE) { +repGermline <- function(.data, + .species = "HomoSapiens", + .align_j_gene = FALSE, + .min_nuc_outside_cdr3 = 5, + .threads = parallel::detectCores()) { + if (.align_j_gene) { + require_system_package("clustalw", error_message = paste0( + "repGermline with .align_j_gene = TRUE requires Clustal W app to be installed!\n", + "Please download it from here: http://www.clustal.org/download/current/\n", + "or install it with your system package manager (such as apt or dnf)." + )) + } + # prepare reference sequences for all alleles genesegments_env <- new.env() data("genesegments", envir = genesegments_env) - reference <- genesegments_env$GENE_SEGMENTS %>% filter(species == species) + reference <- genesegments_env$GENE_SEGMENTS %>% filter(species == .species) rm(genesegments_env) reference <- reference[c("sequence", "allele_id")] - if (ref_only_first) { - reference <- reference[!duplicated(reference$allele_id), ] - } .data %<>% apply_to_sample_or_list( germline_single_df, .with_names = TRUE, reference = reference, - species = species, - min_nuc_outside_cdr3 = min_nuc_outside_cdr3 + species = .species, + align_j_gene = .align_j_gene, + min_nuc_outside_cdr3 = .min_nuc_outside_cdr3, + threads = .threads ) return(.data) } -germline_single_df <- function(data, reference, species, min_nuc_outside_cdr3, sample_name = NA) { +germline_single_df <- function(data, + reference, + species, + align_j_gene, + min_nuc_outside_cdr3, + threads, + sample_name = NA) { data %<>% validate_genes_edges(sample_name) %>% add_column_with_first_gene( @@ -90,40 +121,78 @@ germline_single_df <- function(data, reference, species, min_nuc_outside_cdr3, s ) %>% merge_reference_sequences(reference, "J", species, sample_name) %>% validate_chains_length(min_nuc_outside_cdr3, sample_name) %>% - rowwise() %>% - mutate(Germline.sequence = generate_germline_sequence( - seq = get("Sequence"), - v_ref = get("V.sequence"), - j_ref = get("J.sequence"), - v_end = str_length(get("CDR1.nt")) + str_length(get("CDR2.nt")) - + str_length(get("FR1.nt")) + str_length(get("FR2.nt")) - + str_length(get("FR3.nt")), - cdr3_start = get("CDR3.start"), - cdr3_end = get("CDR3.end"), - j_start = get("J.start"), - j3_del = get("J3.Deletions"), - sample_name = sample_name - )) %>% + calculate_germlines_parallel(align_j_gene, threads, sample_name) %>% filter(!is.na(get("Germline.sequence"))) return(data) } -generate_germline_sequence <- function(seq, v_ref, j_ref, v_end, cdr3_start, cdr3_end, j_start, j3_del, sample_name) { - if (any(is.na(c(seq, v_ref, j_ref, v_end, cdr3_start, cdr3_end, j_start, j3_del))) || +calculate_germlines_parallel <- function(data, align_j_gene, threads, sample_name) { + cluster <- makeCluster(threads) + clusterExport(cluster, c("generate_germline_sequence", "align_and_find_j_start", "sample_name"), + envir = environment() + ) + + # rowwise parallel calculation of new columns that are added to data + data <- parApply(cluster, data, 1, function(row) { + generate_germline_sequence( + seq = row[["Sequence"]], + v_ref = row[["V.ref.nt"]], + j_ref = row[["J.ref.nt"]], + cdr1_nt = row[["CDR1.nt"]], + cdr2_nt = row[["CDR2.nt"]], + fr1_nt = row[["FR1.nt"]], + fr2_nt = row[["FR2.nt"]], + fr3_nt = row[["FR3.nt"]], + fr4_nt = row[["FR4.nt"]], + cdr3_start = as.integer(row[["CDR3.start"]]), + cdr3_end = as.integer(row[["CDR3.end"]]), + j_start = as.integer(row[["J.start"]]), + j3_del = as.integer(row[["J3.Deletions"]]), + align_j_gene = align_j_gene, + sample_name = sample_name + ) + }) %>% + map_dfr(~.) %>% + germline_handle_warnings() %>% + cbind(data, .) + + stopCluster(cluster) + return(data) +} + +generate_germline_sequence <- function(seq, v_ref, j_ref, cdr1_nt, cdr2_nt, + fr1_nt, fr2_nt, fr3_nt, fr4_nt, + cdr3_start, cdr3_end, j_start, j3_del, + align_j_gene, sample_name) { + if (any(is.na(c( + seq, v_ref, j_ref, cdr1_nt, cdr2_nt, fr1_nt, fr2_nt, fr3_nt, fr4_nt, + cdr3_start, cdr3_end, j_start, j3_del + ))) || (seq == "")) { - warning( + # warnings cannot be displayed from parApply; save them and display after finish + warn <- paste0( "Some of mandatory fields in a row ", optional_sample("from sample ", sample_name, " "), "contain unexpected NA or empty strings! Found values:\n", - "Sequence = \"", + "Sequence = ", seq, - "\",\nV.sequence = \"", + ",\nV.ref.nt = ", v_ref, - "\",\nJ.sequence = \"", + ",\nJ.ref.nt = ", j_ref, - "\",\nCalculated_V_end = ", - v_end, - ", CDR3.start = ", + ",\nCDR1.nt = ", + cdr1_nt, + ",\nCDR2.nt = ", + cdr2_nt, + ",\nFR1.nt = ", + fr1_nt, + ",\nFR2.nt = ", + fr2_nt, + ",\nFR3.nt = ", + fr3_nt, + ",\nFR4.nt = ", + fr4_nt, + ",\nCDR3.start = ", cdr3_start, ", CDR3.end = ", cdr3_end, @@ -133,28 +202,47 @@ generate_germline_sequence <- function(seq, v_ref, j_ref, v_end, cdr3_start, cdr j3_del, ".\nThe row will be dropped!" ) - return(NA) + return(list( + V.germline.nt = NA, + J.germline.nt = NA, + CDR3.germline.length = NA, + Germline.sequence = NA, + Warning = warn + )) } else { - cdr3_start %<>% as.numeric() - cdr3_end %<>% as.numeric() - j3_del %<>% as.numeric() + v_end <- str_length(cdr1_nt) + str_length(cdr2_nt) + + str_length(fr1_nt) + str_length(fr2_nt) + str_length(fr3_nt) + cdr3_length <- cdr3_end - cdr3_start # trim intersection of V and CDR3 from reference V gene v_part <- str_sub(v_ref, 1, v_end) - cdr3_part <- paste(rep("n", cdr3_end - cdr3_start), collapse = "") + cdr3_part <- paste(rep("n", cdr3_length), collapse = "") # trim intersection of J and CDR3 from reference J gene - j_part <- str_sub(j_ref, max(0, cdr3_end - j_start - j3_del + 1)) + if (align_j_gene) { + calculated_j_start <- align_and_find_j_start(j_ref, fr4_nt) + } else { + calculated_j_start <- max(0, cdr3_end - j_start - j3_del + 1) + } + j_part <- str_sub(j_ref, calculated_j_start) germline <- paste0(v_part, cdr3_part, j_part) %>% toupper() - return(germline) + + # return values for new calculated columns + return(list( + V.germline.nt = v_part, + J.germline.nt = j_part, + CDR3.germline.length = cdr3_length, + Germline.sequence = germline, + Warning = NA + )) } } merge_reference_sequences <- function(data, reference, chain_letter, species, sample_name) { - chain_seq_colname <- paste0(chain_letter, ".sequence") + chain_seq_colname <- paste0(chain_letter, ".ref.nt") chain_allele_colname <- paste0(chain_letter, ".first.allele") colnames(reference) <- c(chain_seq_colname, chain_allele_colname) @@ -284,3 +372,49 @@ validate_chains_length <- function(data, min_nuc_outside_cdr3, sample_name) { } return(data) } + +# align reference J gene and FR4 segment from clonotype to find start of J gene outside of CDR3 +align_and_find_j_start <- function(j_ref, fr4_seq, max_len_diff = 10) { + # max_len_diff is needed to prevent alignment of sequences that are very different in length; + # we are only interested in the left side of alignment + j_len <- str_length(j_ref) + fr4_len <- str_length(fr4_seq) + min_len <- min(j_len, fr4_len) + max_len <- max(j_len, fr4_len) + trim_len <- min(min_len + max_len_diff, max_len) + j_trimmed <- str_sub(j_ref, 1, trim_len) + fr4_trimmed <- str_sub(fr4_seq, 1, trim_len) + + # alignment will contain vector of 2 aligned strings where deletions are "-" characters + alignment <- list(J = j_trimmed, FR4 = fr4_trimmed) %>% + lapply( + function(sequence) { + sequence %>% + str_extract_all(boundary("character")) %>% + unlist() + } + ) %>% + ape::as.DNAbin() %>% + ape::clustal() %>% + as.character() %>% + apply(1, paste, collapse = "") %>% + lapply(toupper) + + num_deletions <- str_length(stringr::str_extract(alignment[["FR4"]], pattern = "^(-+)")) + if (is.na(num_deletions) | (str_length(alignment[["J"]]) <= num_deletions)) { + return(1) + } else { + return(num_deletions + 1) + } +} + +germline_handle_warnings <- function(df) { + warnings <- df$Warning + warnings <- warnings[!is.na(warnings)] + options(warning.length = 5000L) + for (warn in warnings) { + warning(warn) + } + df$Warning <- NULL + return(df) +} diff --git a/R/immunr_data_format.R b/R/immunr_data_format.R index c8e0b4a6..e3dc9c80 100644 --- a/R/immunr_data_format.R +++ b/R/immunr_data_format.R @@ -98,13 +98,15 @@ IMMCOL_EXT$fr4nt <- "FR4.nt" IMMCOL_EXT$fr4aa <- "FR4.aa" IMMCOL_EXT$v3del <- "V3.Deletions" IMMCOL_EXT$j3del <- "J3.Deletions" +IMMCOL_EXT$id <- "Clone.ID" IMMCOL_EXT$order <- c( IMMCOL_EXT$bestv, IMMCOL_EXT$bestj, IMMCOL_EXT$cdr3s, IMMCOL_EXT$cdr3e, IMMCOL_EXT$c, IMMCOL_EXT$cs, IMMCOL_EXT$ce, IMMCOL_EXT$cdr1nt, IMMCOL_EXT$cdr1aa, IMMCOL_EXT$cdr2nt, IMMCOL_EXT$cdr2aa, IMMCOL_EXT$fr1nt, IMMCOL_EXT$fr1aa, IMMCOL_EXT$fr2nt, IMMCOL_EXT$fr2aa, IMMCOL_EXT$fr3nt, IMMCOL_EXT$fr3aa, IMMCOL_EXT$fr4nt, IMMCOL_EXT$fr4aa, - IMMCOL_EXT$v3del, IMMCOL_EXT$j3del + IMMCOL_EXT$v3del, IMMCOL_EXT$j3del, + IMMCOL_EXT$id ) IMMCOL_EXT$type <- c( "character", "character", "integer", "integer", @@ -112,7 +114,8 @@ IMMCOL_EXT$type <- c( "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", - "integer", "integer" + "integer", "integer", + "integer" ) # Additional information diff --git a/R/io-parsers.R b/R/io-parsers.R index 0a6813cd..2d574741 100644 --- a/R/io-parsers.R +++ b/R/io-parsers.R @@ -26,7 +26,7 @@ parse_repertoire <- function(.filename, .mode, .nuc.seq, .aa.seq, .count, ) # IO_REFACTOR - suppressMessages(df <- readr::read_delim(.filename, + df <- quiet(readr::read_delim(.filename, col_names = TRUE, col_types = col.classes, delim = .sep, quote = "", escape_double = FALSE, @@ -401,6 +401,7 @@ parse_mitcr <- function(.filename, .mode) { parse_mixcr <- function(.filename, .mode) { .filename <- .filename + .id <- "cloneid" .count <- "clonecount" .sep <- "\t" .vd.insertions <- "VD.insertions" @@ -707,7 +708,8 @@ parse_mixcr <- function(.filename, .mode) { nuc_headers[[".nuc.seq.fr2"]], aa_headers[[".aa.seq.fr2"]], nuc_headers[[".nuc.seq.fr3"]], aa_headers[[".aa.seq.fr3"]], nuc_headers[[".nuc.seq.fr4"]], aa_headers[[".aa.seq.fr4"]], - pos_extra_headers[["v3del"]], pos_extra_headers[["j3del"]] + pos_extra_headers[["v3del"]], pos_extra_headers[["j3del"]], + .id ) df_ext_column_names <- IMMCOL_EXT$order diff --git a/R/io-utility.R b/R/io-utility.R index b3c51cb9..a43d4870 100644 --- a/R/io-utility.R +++ b/R/io-utility.R @@ -191,11 +191,13 @@ } } - for (col_i in seq_along(IMMCOL$order)) { - colname <- IMMCOL$order[col_i] - if (colname %in% colnames(.data)) { - if (!has_class(.data[[colname]], IMMCOL$type[col_i])) { - .data[[colname]] <- as(.data[[colname]], IMMCOL$type[col_i]) + for (immcol in c(IMMCOL, IMMCOL_EXT)) { + for (col_i in seq_along(immcol$order)) { + colname <- immcol$order[col_i] + if (colname %in% colnames(.data)) { + if (!has_class(.data[[colname]], immcol$type[col_i])) { + .data[[colname]] <- as(.data[[colname]], immcol$type[col_i]) + } } } } diff --git a/R/kmers.R b/R/kmers.R index 6dc65c46..6560565b 100644 --- a/R/kmers.R +++ b/R/kmers.R @@ -1,9 +1,9 @@ -#' Calculate the kmer statistics of immune repertoires +#' Calculate the k-mer statistics of immune repertoires #' #' @importFrom data.table data.table #' @importFrom dplyr as_tibble #' -#' @concept kmers +#' @concept k-mers #' #' @aliases getKmers get.kmers makeKmerTable #' @@ -19,13 +19,13 @@ #' #' Note: each connection must represent a separate repertoire. #' -#' @param .k Integer. Length of kmers. +#' @param .k Integer. Length of k-mers. #' @param .col Character. Which column to use, pass "aa" (by default) for CDR3 amino acid sequence, #' pass "nt" for CDR3 nucleotide sequences. #' @param .coding Logical. If TRUE (by default) then removes all non-coding sequences from input data first. #' #' @return -#' Data frame with two columns (kmers and their counts). +#' Data frame with two columns (k-mers and their counts). #' #' @examples #' data(immdata) @@ -85,7 +85,7 @@ getKmers <- function(.data, .k, .col = c("aa", "nt"), .coding = TRUE) { #' #' @importFrom dplyr as_tibble #' -#' @concept kmers +#' @concept k-mers #' #' @aliases split_to_kmers kmer_profile #' @@ -95,7 +95,7 @@ getKmers <- function(.data, .k, .col = c("aa", "nt"), .coding = TRUE) { #' kmer_profile(.data, .method = c("freq", "prob", "wei", "self"), .remove.stop = TRUE) #' #' @param .data Character vector or the output from \code{getKmers}. -#' @param .k Integer. Size of kmers. +#' @param .k Integer. Size of k-mers. #' @param .method Character vector of length one. If "freq" then returns a position frequency matrix (PFM) - #' a matrix with occurences of each amino acid in each position. #' @@ -110,7 +110,7 @@ getKmers <- function(.data, .k, .col = c("aa", "nt"), .coding = TRUE) { #' @param .remove.stop Logical. If TRUE (by default) remove stop codons. #' #' @return -#' \code{split_to_kmers} - Data frame with two columns (kmers and their counts). +#' \code{split_to_kmers} - Data frame with two columns (k-mers and their counts). #' #' \code{kmer_profile} - a matrix with per-position amino acid statistics. #' @@ -144,7 +144,7 @@ kmer_profile <- function(.data, .method = c("freq", "prob", "wei", "self"), .rem seq_vec <- .data$Kmer cnt_vec <- .data$Count } else if (length(table(nchar(.data))) > 1) { - stop("Not all kmers in the input data have the same length.") + stop("Not all k-mers in the input data have the same length.") } else { seq_vec <- .data cnt_vec <- rep.int(1, length(.data)) diff --git a/R/phylip.R b/R/phylip.R index f052ca73..bdd7c13b 100644 --- a/R/phylip.R +++ b/R/phylip.R @@ -1,5 +1,4 @@ -#' This function uses the PHYLIP package to make phylogenetic analysis. For making trees it uses -#' maximum parsimony methods. +#' Builds a phylogenetic tree using the sequences of a clonal lineage #' #' @concept phylip #' @@ -8,7 +7,7 @@ #' @importFrom magrittr %>% %<>% extract2 #' @importFrom purrr map_dfr #' @importFrom rlist list.remove -#' @importFrom stringr str_match str_count fixed +#' @importFrom stringr str_match str_count fixed str_extract_all str_length str_sub #' @importFrom stringi stri_replace_all_fixed #' @importFrom utils capture.output #' @importFrom parallel mclapply detectCores @@ -17,7 +16,8 @@ #' @importFrom uuid UUIDgenerate #' @importFrom data.table fread -#' @description This function builds a phylogenetic tree using the sequences of a clonal lineage +#' @description This function uses the PHYLIP package to make phylogenetic analysis. +#' For making trees it uses maximum parsimony methods. #' #' @usage #' @@ -37,6 +37,11 @@ #' The dataframe has these columns: #' * Cluster: cluster name #' * Germline.Input: germline sequence, like it was in the input; not trimmed and not aligned +#' * V.germline.nt: input germline V gene sequence +#' * J.germline.nt: input germline J gene sequence +#' * CDR3.germline.length: length of CDR3 in input germline +#' * V.length: length of V gene after trimming on repAlignLineage() step +#' * J.length: length of j gene after trimming on repAlignLineage() step #' * Germline.Output: germline sequence, parsed from PHYLIP dnapars function output; #' it contains difference of germline from the common ancestor; "." characters mean #' matching letters. It's usually shorter than Germline.Input, because germline and @@ -45,6 +50,10 @@ #' * Trunk.Length: mean trunk length, representing the distance between the most recent #' common ancestor and germline sequence as a measure of the maturity of a lineage #' * Tree: output tree in "phylo" format, loaded from by PHYLIP dnapars function output +#' * TreeStats: nested dataframe containing data about tree nodes, needed for visualization +#' * Sequences: nested dataframe containing all sequences for this combination of cluster +#' and germline; it contains regions from original sequences, saved for +#' repSomaticHypermutation() calculation, and also data needed for visualizations #' #' @examples #' @@ -53,30 +62,36 @@ #' #' bcr_data %>% #' seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% -#' repGermline() %>% +#' repGermline(.threads = 1) %>% #' repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% -#' repClonalFamily(.threads = 2, .nofail = TRUE) +#' repClonalFamily(.threads = 1, .nofail = TRUE) #' @export repClonalFamily repClonalFamily <- function(.data, .threads = parallel::detectCores(), .nofail = FALSE) { if (!require_system_package("phylip", error_message = paste0( "repLineagePhylogeny requires PHYLIP app to be installed!\n", "Please install it as described here:\n", "https://evolution.genetics.washington.edu/phylip/install.html" - ), .nofail, is.na(.data))) { - return(NA) + ), .nofail, has_class(.data, "step_failure_ignored"))) { + return(get_empty_object_with_class("step_failure_ignored")) } - results <- .data %>% apply_to_sample_or_list( - process_dataframe, - .with_names = TRUE, - .validate = FALSE, - .threads = .threads - ) + results <- .data %>% + apply_to_sample_or_list( + process_dataframe, + .with_names = TRUE, + .validate = FALSE, + .threads = .threads + ) %>% + add_class("clonal_family") return(results) } process_dataframe <- function(df, .threads, sample_name = NA) { - for (column in c("Cluster", "Germline", "Alignment")) { + required_columns <- c( + "Cluster", "Germline", "V.germline.nt", "J.germline.nt", "CDR3.germline.length", + "V.length", "J.length", "Alignment", "Sequences" + ) + for (column in required_columns) { if (!(column %in% colnames(df))) { stop( "Unrecognized input dataframe format for repClonalFamily: missing \"", @@ -100,7 +115,7 @@ process_dataframe <- function(df, .threads, sample_name = NA) { ) } - df <- df[c("Cluster", "Germline", "Alignment")] + df <- df[required_columns] clusters_list <- split(df, seq(nrow(df))) results <- parallel::mclapply( @@ -114,62 +129,230 @@ process_dataframe <- function(df, .threads, sample_name = NA) { } process_cluster <- function(cluster_row) { - cluster_name <- cluster_row[["Cluster"]] - cluster_germline <- cluster_row[["Germline"]] - # alignment should be extracted from 1-element list because of Alignment column format + # alignment and sequences should be extracted from 1-element lists because of these columns format alignment <- cluster_row[["Alignment"]][[1]] + sequences <- cluster_row[["Sequences"]][[1]] + cluster_name <- cluster_row[["Cluster"]] + cdr3_germline_length <- cluster_row[["CDR3.germline.length"]] + v_trimmed_length <- cluster_row[["V.length"]] + j_trimmed_length <- cluster_row[["J.length"]] + + # positions of starting nucleotide for translation of sequences to amino acids + v_full_length <- str_length(cluster_row[["V.germline.nt"]]) + v_aa_frame_start <- (v_full_length - v_trimmed_length) %% 3 + 1 + j_aa_frame_start <- (v_full_length + cdr3_germline_length) %% 3 + 1 temp_dir <- file.path(tempdir(check = TRUE), uuid::UUIDgenerate(use.time = FALSE)) dir.create(temp_dir) - # bugfix for phylip: it shows "Unexpected end-of-file" for too short sequence labels + # workaround for phylip: it shows "Unexpected end-of-file" for too short sequence labels; + # these \t are also used to read outfile as table rownames(alignment) %<>% paste0("\t") phangorn::write.phyDat(alignment, file.path(temp_dir, "infile")) system( paste0("sh -c \"cd ", temp_dir, "; phylip dnapars infile\""), - input = (c("V", 1, 5, "Y")) + input = (c("V", 1, 5, ".", "Y")) ) %>% - quiet() + quiet(capture_output = TRUE) tree <- ape::read.tree(file.path(temp_dir, "outtree")) + outfile_path <- file.path(temp_dir, "outfile") - outfile_lines <- scan(outfile_path, what = "", sep = "\n", quiet = TRUE) - common_ancestor <- outfile_lines[grepl(" 1 ", outfile_lines)] - germline <- outfile_lines[grepl("1 germline", outfile_lines)] - - common_ancestor %<>% - stringr::str_match(" +1 +(.+) *") %>% - join_to_string() - germline %<>% - stringr::str_match(" +1 +germline\\s+[a-z]* *(.+) *") %>% - join_to_string() - trunk_length <- nchar(germline) - stringr::str_count(germline, stringr::fixed(".")) + outfile_table <- read.table(outfile_path, + sep = "\t", header = FALSE, na.strings = "", stringsAsFactors = FALSE, + fill = TRUE, blank.lines.skip = TRUE + ) + outfile_rows <- split(outfile_table, seq(nrow(outfile_table))) + rm(outfile_table) + header_line_1_idx <- which(grepl("From To", outfile_rows, fixed = TRUE))[1] + # remove start of file, keep only lines with sequences + outfile_rows <- outfile_rows[-seq(1, header_line_1_idx + 1)] + + # iterate over rows, assemble sequences, fill table values + tree_stats <- data.frame(matrix( + ncol = 7, nrow = 0, + dimnames = list(NULL, c( + "Name", "Type", "Clones", "Ancestor", "DistanceNT", "DistanceAA", "Sequence" + )) + )) %>% + add_class("clonal_family_tree") + for (row in outfile_rows) { + clones <- 1 # for all sequences except clonotypes + col1_strings <- str_extract_all(row[[1]], "[[^\\s]]+")[[1]] + if (is.na(row[[2]])) { + # if second element is not a number, then ancestor is NA + if (is.na(quiet(as.numeric(col1_strings[2])))) { + ancestor <- NA + seq_name <- col1_strings[1] + seq <- paste(col1_strings[-1], collapse = "") + } else { + ancestor <- col1_strings[1] + seq_name <- col1_strings[2] + col1_strings <- col1_strings[-c(1, 2)] + if (col1_strings[1] %in% c("yes", "no", "maybe")) { + col1_strings <- col1_strings[-1] + } + seq <- paste(col1_strings, collapse = "") + } + } else { + col2_strings <- str_extract_all(row[[2]], "[[^\\s]]+")[[1]] + ancestor <- col1_strings[1] + seq_name <- col1_strings[2] + seq <- paste(col2_strings[-1], collapse = "") + } + + if (seq_name == "Germline") { + seq_type <- "Germline" + } else if (startsWith(seq_name, "ID_")) { + seq_type <- "Clonotype" + # find clones value by sequence ID + clones <- sequences[ + which(sequences["Clone.ID"] == as.integer(substring(seq_name, 4))), + "Clones" + ] + } else if (identical(ancestor, "Germline")) { + seq_type <- "CommonAncestor" + } else { + seq_type <- "Presumable" + } + + # add sequence to table if it's new, otherwise append nucleotides to the end + if (nrow(tree_stats[which(tree_stats["Name"] == seq_name), ]) == 0) { + tree_stats[nrow(tree_stats) + 1, ] <- c(seq_name, seq_type, clones, ancestor, 0, 0, seq) + } else { + tree_stats[which(tree_stats["Name"] == seq_name), "Sequence"] %<>% paste0(seq) + } + } + + # force germline to be root if phylip had set "1" as its ancestor + germline_ancestor <- tree_stats[which(tree_stats["Type"] == "Germline"), ][1, "Ancestor"] + if (!is.na(germline_ancestor)) { + ancestor_of_ancestor <- + tree_stats[which(tree_stats["Name"] == germline_ancestor), ][1, "Ancestor"] + if (is.na(ancestor_of_ancestor)) { + tree_stats[which(tree_stats["Name"] == "Germline"), ][1, "Ancestor"] <- NA + tree_stats[which(tree_stats["Name"] == germline_ancestor), ][1, "Type"] <- "CommonAncestor" + tree_stats[which(tree_stats["Name"] == germline_ancestor), ][1, "Ancestor"] <- "Germline" + } else { + warning( + "Germline's ancestor is ", + germline_ancestor, + ", and its' ancestor is ", + ancestor_of_ancestor, + ": tree of cluster ", + cluster_name, + " can't be corrected!" + ) + } + } + + # remove ancestors that are not in names + names <- tree_stats[["Name"]] + ancestors <- tree_stats[["Ancestor"]] + for (missing_name in ancestors[!ancestors %in% names]) { + tree_stats["Ancestor"][tree_stats["Ancestor"] == missing_name] <- NA + } + # rename CommonAncestor and Presumable nodes, both in Name and Ancestor columns + renamed_rows <- tree_stats[which(tree_stats[["Type"]] %in% c("CommonAncestor", "Presumable")), ] + old_names <- renamed_rows[["Name"]] + new_names <- as.list(paste0(renamed_rows[["Type"]], "_", renamed_rows[["Name"]])) + names(new_names) <- old_names + for (column in c("Name", "Ancestor")) { + for (row in which(tree_stats[[column]] %in% old_names)) { + tree_stats[[row, column]] <- new_names[[tree_stats[[row, column]]]] + } + } + + germline <- tree_stats[which(tree_stats["Type"] == "Germline"), ][1, "Sequence"] + common_ancestor <- tree_stats[which(tree_stats["Type"] == "CommonAncestor"), ][1, "Sequence"] + + # calculate distances of all sequences from germline + germline_v <- str_sub(germline, 1, v_trimmed_length) + germline_j <- str_sub(germline, -j_trimmed_length) + germline_nt_chars <- strsplit(paste0(germline_v, germline_j), "")[[1]] + + germline_v_aa <- bunch_translate(substring(germline_v, v_aa_frame_start), + .two.way = FALSE, .ignore.n = TRUE + ) + germline_j_aa <- bunch_translate(substring(germline_j, j_aa_frame_start), + .two.way = FALSE, .ignore.n = TRUE + ) + germline_aa_chars <- strsplit(paste0(germline_v_aa, germline_j_aa), "")[[1]] + for (row in seq_len(nrow(tree_stats))) { + if (tree_stats[row, "Type"] != "Germline") { + seq <- tree_stats[row, "Sequence"] + seq_v <- str_sub(seq, 1, v_trimmed_length) + seq_j <- str_sub(seq, -j_trimmed_length) + seq_nt_chars <- strsplit(paste0(seq_v, seq_j), "")[[1]] + if (length(germline_nt_chars) != length(seq_nt_chars)) { + warning( + "Germline and sequence lengths are different; ", + "something is wrong in repClonalFamily calculation!\n", + "germline_nt_chars = ", germline_nt_chars, "\nseq_nt_chars = ", seq_nt_chars + ) + } + tree_stats[row, "DistanceNT"] <- sum(germline_nt_chars != seq_nt_chars) + seq_v_aa <- bunch_translate(substring(seq_v, v_aa_frame_start), + .two.way = FALSE, .ignore.n = TRUE + ) + seq_j_aa <- bunch_translate(substring(seq_j, j_aa_frame_start), + .two.way = FALSE, .ignore.n = TRUE + ) + seq_aa_chars <- strsplit(paste0(seq_v_aa, seq_j_aa), "")[[1]] + if (length(germline_aa_chars) != length(seq_aa_chars)) { + warning( + "Germline and sequence lengths are different; ", + "something is wrong in repClonalFamily calculation!\n", + "germline_aa_chars = ", germline_aa_chars, "\nseq_aa_chars = ", seq_aa_chars + ) + } + tree_stats[row, "DistanceAA"] <- sum(germline_aa_chars != seq_aa_chars) + } + } + + for (column in c("Clones", "DistanceNT", "DistanceAA")) { + tree_stats[[column]] %<>% as.integer() + } + + trunk_length <- tree_stats[which(tree_stats["Ancestor"] == "Germline"), ][1, "DistanceNT"] unlink(temp_dir, recursive = TRUE) # return row of output dataframe as named list return(list( Cluster = cluster_name, - Germline.Input = cluster_germline, + Germline.Input = cluster_row[["Germline"]], + V.germline.nt = cluster_row[["V.germline.nt"]], + J.germline.nt = cluster_row[["J.germline.nt"]], + CDR3.germline.length = cdr3_germline_length, + V.length = v_trimmed_length, + J.length = j_trimmed_length, Germline.Output = germline, Common.Ancestor = common_ancestor, Trunk.Length = trunk_length, - Tree = tree + Tree = tree, + TreeStats = tree_stats, + Sequences = sequences )) } -join_to_string <- function(matching_results) { - matching_results[, 2] %>% - paste(collapse = "") %>% - stringi::stri_replace_all_fixed(" ", "") -} - convert_nested_to_df <- function(nested_results_list) { tree <- nested_results_list %>% lapply(magrittr::extract2, "Tree") %>% tibble(Tree = .) + tree_stats <- nested_results_list %>% + lapply(magrittr::extract2, "TreeStats") %>% + tibble(TreeStats = .) + sequences <- nested_results_list %>% + lapply(magrittr::extract2, "Sequences") %>% + tibble(Sequences = .) df <- nested_results_list %>% - lapply(rlist::list.remove, "Tree") %>% + lapply(rlist::list.remove, c("Tree", "TreeStats", "Sequences")) %>% purrr::map_dfr(~.) %>% - cbind(tree) + cbind(tree, tree_stats, sequences) + # fix column types after dataframe rebuilding + for (column in c("CDR3.germline.length", "V.length", "J.length", "Trunk.Length")) { + df[[column]] %<>% as.integer() + } + df %<>% add_class("clonal_family_df") return(df) } diff --git a/R/seqCluster.R b/R/seqCluster.R index b2e84297..d3d43d1d 100644 --- a/R/seqCluster.R +++ b/R/seqCluster.R @@ -1,5 +1,7 @@ #' Function for assigning clusters based on sequences similarity #' +#' @concept seq_cluster +#' #' @importFrom purrr map map_lgl map_chr map2 map2_chr map_df map2_lgl #' @importFrom magrittr %>% %<>% #' @importFrom reshape2 melt diff --git a/R/somatic_hypermutation.R b/R/somatic_hypermutation.R new file mode 100644 index 00000000..d8a2f91a --- /dev/null +++ b/R/somatic_hypermutation.R @@ -0,0 +1,140 @@ +#' Calculates number of mutations against the germline for each clonotype +#' +#' @concept somatic_hypermutation +#' +#' @aliases repSomaticHypermutation +#' +#' @importFrom magrittr %>% %<>% +#' @importFrom tidyr unnest +#' @importFrom plyr adply +#' @importFrom parallel detectCores +#' @importFrom doParallel registerDoParallel stopImplicitCluster +#' @importFrom ape as.DNAbin clustal +#' +#' @description This function aligns V and J genes from the germline in each cluster +#' with corresponding genes in each clonotype, saves the alignments for purpose of visualization, +#' and calculates number of mutations for each clonotype. +#' +#' @usage +#' +#' repSomaticHypermutation(.data, .threads, .nofail) +#' +#' @param .data The data to be processed: an output of repClonalFamily(); +#' variants with one sample and list of samples are both supported. +#' +#' @param .threads Number of threads to use. +#' +#' @param .nofail Will return NA instead of stopping if Clustal W is not installed. +#' Used to avoid raising errors in examples on computers where Clustal W is not installed. +#' +#' @return +#' +#' Dataframe or list of dataframes (if input is a list with multiple samples). +#' The dataframe has all the columns from repClonalFamily() output dataframe, with Sequence +#' column unnested: the resulting dataframe has one line per clonotype. Clone.ID column +#' contains original IDs for clonotypes, and can be used as dataframe key. +#' New columns are added: +#' * Germline.Alignment.V: contains V gene alignment of current clonotype with the germline +#' * Germline.Alignment.J: contains J gene alignment of current clonotype with the germline +#' * Substitutions: contains number of substitutions in the alignment (summary for V and J) +#' * Insertions: contains number of insertions in the clonotype relative to germline +#' (summary for V and J) +#' * Deletions: contains number of deletions in the clonotype relative to germline +#' (summary for V and J) +#' * Mutations: contains total number of mutations in the alignment (summary for V and J) +#' +#' @examples +#' +#' data(bcrdata) +#' bcr_data <- bcrdata$data +#' +#' bcr_data %>% +#' seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% +#' repGermline(.threads = 1) %>% +#' repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% +#' repClonalFamily(.threads = 1, .nofail = TRUE) %>% +#' repSomaticHypermutation(.threads = 1, .nofail = TRUE) +#' @export repSomaticHypermutation +repSomaticHypermutation <- function(.data, .threads = parallel::detectCores(), .nofail = FALSE) { + if (!require_system_package("clustalw", error_message = paste0( + "repSomaticHypermutation requires Clustal W app to be installed!\n", + "Please download it from here: http://www.clustal.org/download/current/\n", + "or install it with your system package manager (such as apt or dnf)." + ), .nofail, has_class(.data, "step_failure_ignored"))) { + return(get_empty_object_with_class("step_failure_ignored")) + } + + doParallel::registerDoParallel(cores = .threads) + results <- .data %>% apply_to_sample_or_list( + shm_process_dataframe, + .parallel = .threads > 1, + .validate = FALSE + ) + doParallel::stopImplicitCluster() + return(results) +} + +shm_process_dataframe <- function(df, .parallel) { + # convert dataframe from cluster-per-row to clonotype-per-row + # and add columns with alignment and number of mutations + df %<>% unnest("Sequences") %>% plyr::adply( + .fun = shm_process_clonotype_row, + .margins = 1, + .parallel = .parallel + ) + # fix column types after dataframe rebuilding + for (column in c( + "Clone.ID", "Clones", "CDR3.germline.length", "V.length", "J.length", "Trunk.Length", + "Substitutions", "Insertions", "Deletions", "Mutations" + )) { + df[[column]] %<>% as.integer() + } + return(df) +} + +shm_process_clonotype_row <- function(row) { + v_gene_germline <- row[["V.germline.nt"]] + j_gene_germline <- row[["J.germline.nt"]] + v_gene_clonotype <- paste0( + row[["FR1.nt"]], row[["CDR1.nt"]], row[["FR2.nt"]], row[["CDR2.nt"]], row[["FR3.nt"]] + ) + j_gene_clonotype <- row[["FR4.nt"]] + + alignments <- list( + V = list(v_gene_germline, v_gene_clonotype), + J = list(j_gene_germline, j_gene_clonotype) + ) + + substitutions <- 0 + insertions <- 0 + deletions <- 0 + for (gene in c("V", "J")) { + names(alignments[[gene]]) <- c("Germline", paste0("ID_", row[["Clone.ID"]])) + dnabin <- convert_seq_list_to_dnabin(alignments[[gene]]) + + # align gene from germline and gene from clonotype + alignments[[gene]] <- ape::clustal(dnabin) + + # calculate numbers of mutations for current gene, add them to the sums + alignment <- as.character(alignments[[gene]]) + germline <- alignment[1, ] + clonotype <- alignment[2, ] + substitutions <- substitutions + + sum((germline != clonotype) & (germline != "-") & (clonotype != "-")) + insertions <- insertions + sum(germline == "-") + deletions <- deletions + sum(clonotype == "-") + } + + # using list(NA) as workaround for "Assigned data must be compatible with existing data": + # writing list(NA), which is considered compatible by adply, then storing alignment in list + row[["Germline.Alignment.V"]] <- list(NA) + row[["Germline.Alignment.V"]][[1]] <- alignments[["V"]] + row[["Germline.Alignment.J"]] <- list(NA) + row[["Germline.Alignment.J"]][[1]] <- alignments[["J"]] + row[["Substitutions"]] <- substitutions + row[["Insertions"]] <- insertions + row[["Deletions"]] <- deletions + row[["Mutations"]] <- substitutions + insertions + deletions + + return(row) +} diff --git a/R/tools.R b/R/tools.R index 6e80d6dd..dfcda321 100644 --- a/R/tools.R +++ b/R/tools.R @@ -85,7 +85,6 @@ add_class <- function(.obj, .class) { .obj } - #' Check for the specific class #' #' @concept utility_private @@ -108,6 +107,9 @@ has_class <- function(.data, .class) { .class %in% class(.data) } +get_empty_object_with_class <- function(.class) { + add_class(NA, .class) +} #' Set and update progress bars #' @@ -204,6 +206,7 @@ matrixdiagcopy <- function(.mat) { #' #' @param .seq Vector or list of strings. #' @param .two.way Logical. If TRUE (default) then translate from the both ends (like MIXCR). +#' @param .ignore.n Logical. If FALSE (default) then return NA for sequences that have N, else parse triplets with N as ~ #' #' @return #' Character vector of translated input sequences. @@ -212,9 +215,11 @@ matrixdiagcopy <- function(.mat) { #' data(immdata) #' head(bunch_translate(immdata$data[[1]]$CDR3.nt)) #' @export -bunch_translate <- function(.seq, .two.way = TRUE) { +bunch_translate <- function(.seq, .two.way = TRUE, .ignore.n = FALSE) { .seq <- toupper(.seq) - .seq[grepl("N", .seq)] <- NA + if (!.ignore.n) { + .seq[grepl("N", .seq)] <- NA + } sapply(.seq, function(y) { if (!is.na(y)) { @@ -233,7 +238,9 @@ bunch_translate <- function(.seq, .two.way = TRUE) { } else { y <- substring(y, seq(1, nchar(y) - 2, 3), seq(3, nchar(y), 3)) } - paste0(AA_TABLE[unlist(strsplit(gsub("(...)", "\\1_", y), "_"))], collapse = "") + aa <- AA_TABLE[unlist(strsplit(gsub("(...)", "\\1_", y), "_"))] + aa %<>% replace(is.na(aa), "~") + paste0(aa, collapse = "") } else { NA } @@ -482,8 +489,15 @@ as_numeric_or_fail <- function(.string) { return(result) } +has_no_data <- function(.data) { + any(sapply(list(NA, NULL, NaN), identical, .data)) +} + # apply function to .data if it's a single sample or to each sample if .data is a list of samples apply_to_sample_or_list <- function(.data, .function, .with_names = FALSE, .validate = TRUE, ...) { + if (has_no_data(.data)) { + stop("Expected non-empty data; found: ", .data) + } if (inherits(.data, "list")) { if (.validate) { .validate_repertoires_data(.data) @@ -614,9 +628,31 @@ j_len_outside_cdr3 <- function(seq, j_start, cdr3_end) { stringr::str_length(seq) - pmax(j_start, as.numeric(cdr3_end)) } -quiet <- function(procedure) { - procedure %>% - capture.output() %>% - invisible() %>% - suppressMessages() +convert_seq_list_to_dnabin <- function(seq_list) { + dnabin <- seq_list %>% + lapply( + function(sequence) { + sequence %>% + stringr::str_extract_all(stringr::boundary("character")) %>% + unlist() + } + ) %>% + ape::as.DNAbin() + return(dnabin) +} + +# capture_output() suppresses stdout, but also drops returned value +quiet <- function(procedure, capture_output = FALSE) { + if (capture_output) { + procedure %>% + capture.output() %>% + invisible() %>% + suppressMessages() %>% + suppressWarnings() + } else { + procedure %>% + invisible() %>% + suppressMessages() %>% + suppressWarnings() + } } diff --git a/R/vis.R b/R/vis.R index 34bf3d98..4161dab1 100644 --- a/R/vis.R +++ b/R/vis.R @@ -115,6 +115,9 @@ theme_cleveland2 <- function(rotate = TRUE) { #' @import ggplot2 #' @importFrom factoextra fviz_cluster fviz_dend fviz_pca_ind #' @importFrom grDevices colorRampPalette +#' @importFrom tidyr drop_na +#' @importFrom igraph graph_from_data_frame +#' @importFrom ggraph ggraph geom_edge_diagonal geom_node_point theme_graph #' #' @description Output from every function in immunarch can be visualised with a #' single function - \code{vis}. The \code{vis} automatically detects @@ -157,6 +160,10 @@ theme_cleveland2 <- function(rotate = TRUE) { #' #' - Diversity estimations (from \link{repDiversity}) - see \link{vis.immunr_chao1}. #' +#' BCR analysis: +#' +#' - Clonal tree (from \link{repClonalFamily}) - see \link{vis.clonal_family} and \link{vis.clonal_family_tree}. +#' #' Advanced analysis: #' #' - Repertoire dynamics (from \link{trackClonotypes}) - see \link{vis.immunr_dynamics}; @@ -2955,13 +2962,104 @@ vis.immunr_dynamics <- function(.data, .plot = c("smooth", "area", "line"), .ord theme_pubr(legend = "right") + rotate_x_text(90) + theme_cleveland2() } +#' Visualise clonal family tree: wrapper for calling on the entire repClonalFamily output +#' +#' @concept phylip +#' +#' @param .data Clonal families from 1 or multiple samples: \code{\link{repClonalFamily}} output. +#' @param ... Not used here. +#' +#' @return +#' A ggraph object. +#' +#' @examples +#' data(bcrdata) +#' bcr_data <- bcrdata$data +#' +#' clonal_family <- bcr_data %>% +#' seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% +#' repGermline(.threads = 1) %>% +#' repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% +#' repClonalFamily(.threads = 1, .nofail = TRUE) %>% +#' vis() +#' @export +vis.clonal_family <- function(.data, ...) { + if (inherits(.data, "clonal_family_df")) { + if (nrow(.data) > 1) { + warning( + "Warning! Data has more than 1 cluster!\n", + "Clonal tree will be drawn for the 1st cluster.\n", + "To draw a tree for a specific cluster, use:\n", + "vis(data[[\"TreeStats\"]][[cluster_number]])" + ) + } + df <- .data + } else { + if ((length(.data) > 1) | (nrow(.data[[1]]) > 1)) { + warning( + "Warning! Data has more than 1 cluster!\n", + "Clonal tree will be drawn for the 1st cluster in the 1st sample.\n", + "To draw a tree for a specific cluster, use:\n", + "vis(data[[\"sample_name\"]][[\"TreeStats\"]][[cluster_number]])" + ) + } + df <- .data[[1]] + } + vis(df[["TreeStats"]][[1]], ...) +} +#' Visualise clonal family tree +#' +#' @concept phylip +#' +#' @param .data Single clonal family tree data from 1 cluster: 1 element from TreeStats column from \code{\link{repClonalFamily}} output. +#' @param ... Not used here. +#' +#' @return +#' A ggraph object. +#' +#' @examples +#' data(bcrdata) +#' bcr_data <- bcrdata$data +#' +#' clonal_family <- bcr_data %>% +#' seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% +#' repGermline(.threads = 1) %>% +#' repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% +#' repClonalFamily(.threads = 1, .nofail = TRUE) +#' +#' # This condition can be omitted; it prevents the example from crashing +#' # when ClustalW or PHYLIP are not installed +#' if (!("step_failure_ignored" %in% class(clonal_family))) { +#' vis(clonal_family[["full_clones"]][["TreeStats"]][[2]]) +#' } +#' @export +vis.clonal_family_tree <- function(.data, ...) { + links_df <- .data[c("Ancestor", "Name")] %>% + drop_na("Ancestor") + names(links_df) <- c("from", "to") + vertices_df <- .data[c("Name", "Type", "Clones")] + names(vertices_df)[1] <- "name" + + tree_graph <- graph_from_data_frame(links_df, vertices = vertices_df) %>% + ggraph("tree") + + geom_edge_diagonal() + + geom_node_point(aes(color = Type, size = Clones)) + + theme_graph(base_family = "sans") + + return(tree_graph) +} -# vis.immunr_mutation_network <- function (.data) { -# stop(IMMUNR_ERROR_NOT_IMPL) -# } - - -# vis.immunr_cdr_prop <- function (.data, .by = NA, .meta = NA, .plot = c("box", "hist")) { -# stop(IMMUNR_ERROR_NOT_IMPL) -# } +#' Handler for .nofail argument of pipeline steps that prevents examples from crashing +#' on computers where certain dependencies are not installed +#' +#' @param .data Not used here. +#' @param ... Not used here. +#' +#' @return +#' An empty object with "step_failure_ignored" class. +#' +#' @export +vis.step_failure_ignored <- function(.data, ...) { + return(get_empty_object_with_class("step_failure_ignored")) +} diff --git a/README.md b/README.md index 66b963a6..5929c6fd 100644 --- a/README.md +++ b/README.md @@ -40,18 +40,18 @@ repDiversity(immdata$data) %>% vis(.by = "Status", .meta = immdata$meta) # ## Stay connected!
-
-
- - - - -
- -
+
+
+ + + + +
+ +
- - + +
--- @@ -69,11 +69,11 @@ repDiversity(immdata$data) %>% vis(.by = "Status", .meta = immdata$meta) # ## Introduction -`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, aimed at medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. +`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, mainly tailored to medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. ## Contact -Create a ticket with a bug or question on [GitHub Issues](https://github.com/immunomind/immunarch/issues) to help the community help you and enrich it with your experience. If you need to send us a sensitive data, feel free to contact us via [support@immunomind.io](mailto:support@immunomind.io). +Create a ticket with a bug or question on [GitHub Issues](https://github.com/immunomind/immunarch/issues) to get help from the community and enrich it with your experience. If you need to send us sensitive data, feel free to contact us via [support@immunomind.io](mailto:support@immunomind.io). ## Installation @@ -85,7 +85,7 @@ In order to install `immunarch` execute the following command: install.packages("immunarch") ``` -That's it, you can start using `immunarch` now! See the [Quick Start](#quick-start) section below to dive into immune repertoire data analysis. If you run in any trouble with installation, take a look at the [Installation Troubleshooting](https://immunarch.com/articles/v1_introduction.html#installation-troubleshooting) section. +That's it, you can start using `immunarch` now! See the [Quick Start](#quick-start) section below to dive into immune repertoire data analysis. If you run in any trouble during installation, take a look at the [Installation Troubleshooting](https://immunarch.com/articles/v1_introduction.html#installation-troubleshooting) section. Note: there are quite a lot of dependencies to install with the package because it installs all the widely-used packages for data analysis and visualisation. You got both the AIRR data analysis framework and the full Data Science package ecosystem with only one command, making `immunarch` the entry-point for single-cell & immune repertoire Data Science. @@ -94,17 +94,19 @@ Note: there are quite a lot of dependencies to install with the package because If the above command doesn't work for any reason, try installing `immunarch` directly from its repository: ```r -install.packages("devtools") # skip this if you already installed devtools +install.packages(c("devtools", "pkgload")) # skip this if you already installed these packages devtools::install_github("immunomind/immunarch") +devtools::reload(pkgload::inst("immunarch")) ``` ### Latest pre-release on GitHub -Since releasing on CRAN is limited to one release per one-two months, you can install the latest pre-release version with bleeding edge features and optimisations directly from the code repository. In order to install the latest pre-release version, you need to execute only two commands: +Since releasing on CRAN is limited to one release per one or two months, you can install the latest pre-release version with all the bleeding edge and optimised features directly from the code repository. In order to install the latest pre-release version, you need to execute the following commands: ```r -install.packages("devtools") # skip this if you already installed devtools +install.packages(c("devtools", "pkgload")) # skip this if you already installed these packages devtools::install_github("immunomind/immunarch", ref="dev") +devtools::reload(pkgload::inst("immunarch")) ``` You can find the list of releases of `immunarch` here: https://github.com/immunomind/immunarch/releases @@ -124,7 +126,7 @@ You can find the list of releases of `immunarch` here: https://github.com/immuno 2. Beginner-friendly. Immune repertoire analysis made simple: - + Most methods are incorporated in a couple of main functions with clear naming---no more remembering tens and tens of functions with obscure names. For details see [link](https://immunarch.com/articles/web_only/v3_basic_analysis.html); + + Most methods are incorporated in a couple of main functions with clear naming---no more remembering dozens and dozens of functions with obscure names. For details see [link](https://immunarch.com/articles/web_only/v3_basic_analysis.html); + Repertoire overlap analysis *(common indices including overlap coefficient, Jaccard index and Morisita's overlap index)*. Tutorial is available [here](https://immunarch.com/articles/web_only/v4_overlap.html); @@ -134,7 +136,7 @@ You can find the list of releases of `immunarch` here: https://github.com/immuno + Tracking of clonotypes across time points, widely used in vaccination and cancer immunology domains. Tutorial is available [here](https://immunarch.com/articles/web_only/v8_tracking.html); - + Kmer distribution measures and statistics. Tutorial is available [here](https://immunarch.com/articles/web_only/v9_kmers.html); + + K-mer distribution measures and statistics. Tutorial is available [here](https://immunarch.com/articles/web_only/v9_kmers.html); + Coming in the next releases: CDR3 amino acid physical and chemical properties assessment, mutation networks. @@ -182,7 +184,7 @@ immdata <- repLoad("path/to/your/data") # Replace it with the path to your data ### Advanced methods -For advanced methods such as clonotype annotation, clonotype tracking, kmer analysis and public repertoire analysis see "Tutorials". +For advanced methods such as clonotype annotation, clonotype tracking, k-mer analysis and public repertoire analysis see "Tutorials". ## Bugs and Issues @@ -199,7 +201,7 @@ Bug reports must: ## Help the community -Have an aspiration to help the community build the ecosystem of scRNAseq & AIRR analysis tools? Found a bug? A typo? Would like to improve a documentation, add a method or optimise an algorithm? +Aspiring to help the community build the ecosystem of scRNAseq & AIRR analysis tools? Found a bug? A typo? Would like to improve documentation, add a method or optimise an algorithm? We are always open to contributions. There are two ways to contribute: diff --git a/_pkgdown.yml b/_pkgdown.yml index 812a8a7b..29b16ceb 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -19,7 +19,7 @@ authors: href: https://www.linkedin.com/in/aleksandr-popov-634878105/ Ivan Balashov: href: https://www.linkedin.com/in/ivan-balashov/ - Maria Volobueva: + Maria Samokhina: href: https://www.linkedin.com/in/maria-volobueva-b0856a223/ articles: - title: All tutorials @@ -41,6 +41,8 @@ articles: - '`web_only/community`' - '`web_only/repFilter_v3`' - '`web_only/v10_prop`' + - '`web_only/BCRpipeline`' + - '`web_only/clustering`' navbar: structure: left: [articles, reference, covid19] @@ -93,6 +95,12 @@ navbar: - text: Preparing to publication - text: Make your plots publication-ready with fixVis href: articles/web_only/v7_fixvis.html + - text: '---' + - text: BCR analysis + - text: BCR pipeline + href: articles/web_only/BCRpipeline.html + - text: Clustering + href: articles/web_only/clustering.html twitter: icon: fa-lg fa-twitter href: http://twitter.com/immunomind @@ -117,6 +125,16 @@ reference: - subtitle: Preprocessing contents: - has_concept("preprocessing") +- title: BCR data +- subtitle: Clustering + contents: + - has_concept("distance") + - has_concept("seq_cluster") +- subtitle: BCR pipeline + contents: + - has_concept("germline") + - has_concept("align_lineage") + - has_concept("phylip") - title: Basic immune repertoire statistics - subtitle: Exploratory data analysis contents: @@ -151,7 +169,7 @@ reference: - title: Advanced immune repertoire analysis - subtitle: Kmers analysis contents: - - has_concept("kmers") + - has_concept("k-mers") - subtitle: Post-analysis desc: Advanced methods for post-analysis of gene usage, overlap and other statistics. contents: diff --git a/data/bcrdata.rda b/data/bcrdata.rda index fd8cbdb7..d2377550 100644 Binary files a/data/bcrdata.rda and b/data/bcrdata.rda differ diff --git a/man/bunch_translate.Rd b/man/bunch_translate.Rd index 4a3aa8da..29acfaf9 100644 --- a/man/bunch_translate.Rd +++ b/man/bunch_translate.Rd @@ -5,12 +5,14 @@ \alias{translate_bunch} \title{Nucleotide to amino acid sequence translation} \usage{ -bunch_translate(.seq, .two.way = TRUE) +bunch_translate(.seq, .two.way = TRUE, .ignore.n = FALSE) } \arguments{ \item{.seq}{Vector or list of strings.} \item{.two.way}{Logical. If TRUE (default) then translate from the both ends (like MIXCR).} + +\item{.ignore.n}{Logical. If FALSE (default) then return NA for sequences that have N, else parse triplets with N as ~} } \value{ Character vector of translated input sequences. diff --git a/man/getKmers.Rd b/man/getKmers.Rd index 1239a0c1..bfa423d8 100644 --- a/man/getKmers.Rd +++ b/man/getKmers.Rd @@ -4,7 +4,7 @@ \alias{getKmers} \alias{get.kmers} \alias{makeKmerTable} -\title{Calculate the kmer statistics of immune repertoires} +\title{Calculate the k-mer statistics of immune repertoires} \usage{ getKmers(.data, .k, .col = c("aa", "nt"), .coding = TRUE) } @@ -21,7 +21,7 @@ of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire.} -\item{.k}{Integer. Length of kmers.} +\item{.k}{Integer. Length of k-mers.} \item{.col}{Character. Which column to use, pass "aa" (by default) for CDR3 amino acid sequence, pass "nt" for CDR3 nucleotide sequences.} @@ -29,14 +29,14 @@ pass "nt" for CDR3 nucleotide sequences.} \item{.coding}{Logical. If TRUE (by default) then removes all non-coding sequences from input data first.} } \value{ -Data frame with two columns (kmers and their counts). +Data frame with two columns (k-mers and their counts). } \description{ -Calculate the kmer statistics of immune repertoires +Calculate the k-mer statistics of immune repertoires } \examples{ data(immdata) kmers <- getKmers(immdata$data[[1]], 5) kmers \%>\% vis() } -\concept{kmers} +\concept{k-mers} diff --git a/man/repAlignLineage.Rd b/man/repAlignLineage.Rd index 39cf0d9f..2b39ba64 100644 --- a/man/repAlignLineage.Rd +++ b/man/repAlignLineage.Rd @@ -2,11 +2,7 @@ % Please edit documentation in R/align_lineage.R \name{repAlignLineage} \alias{repAlignLineage} -\title{This function aligns all sequences (incliding germline) that belong to one clonal lineage and one cluster. -After clustering and building the clonal lineage and germline, the next step is to analyze the degree of mutation -and maturity of each clonal lineage. This allows for finding high mature cells and cells with a large -number of offspring. The phylogenetic analysis will find mutations that increase the affinity of BCR. -Making alignment of the sequence is the first step towards sequence analysis including BCR.} +\title{Aligns all sequences incliding germline within each clonal lineage within each cluster} \usage{ repAlignLineage(.data, .min_lineage_sequences, .prepare_threads, .align_threads, .verbose_output, .nofail) @@ -26,13 +22,13 @@ Please note that high number can cause heavy memory usage!} \item{.align_threads}{Number of threads for lineage alignment. It must have columns in the immunarch compatible format \link{immunarch_data_format}, and also -must contain 'Cluster' column, which is added by seqCluster() function, and 'Sequence.germline' +must contain 'Cluster' column, which is added by seqCluster() function, and 'Germline.sequence' column, which is added by repGermline() function.} \item{.verbose_output}{If TRUE, all output dataframe columns will be included (see documentation about this function return), and unaligned clusters will be included in the output. Setting this to TRUE significantly -increases memory usage. If FALSE, only aligned clusters and columns required for repClonalFamily() calculation -will be included in the output.} +increases memory usage. If FALSE, only aligned clusters and columns required for repClonalFamily() and +repSomaticHypermutation() calculation will be included in the output.} \item{.nofail}{Will return NA instead of stopping if Clustal W is not installed. Used to avoid raising errors in examples on computers where Clustal W is not installed.} @@ -42,19 +38,29 @@ Dataframe or list of dataframes (if input is a list with multiple samples). The dataframe has these columns: * Cluster: cluster name * Germline: germline sequence +* V.germline.nt: germline V gene sequence +* J.germline.nt: germline J gene sequence +* CDR3.germline.length: length of CDR3 in germline * Aligned (included if .verbose_output=TRUE): FALSE if this group of sequences was not aligned with lineage (.min_lineage_sequences is below the threshold); TRUE if it was aligned * Alignment: DNAbin object with alignment or DNAbin object with unaligned sequences (if Aligned=FALSE) -* V.length (included if .verbose_output=TRUE): shortest length of V gene part outside of CDR3 region in this +* V.length: shortest length of V gene part outside of CDR3 region in this group of sequences; longer V genes (including germline) are trimmed to this length before alignment -* J.length (included if .verbose_output=TRUE): shortest length of J gene part outside of CDR3 region in this +* J.length: shortest length of J gene part outside of CDR3 region in this group of sequences; longer J genes (including germline) are trimmed to this length before alignment -* Sequences (included if .verbose_output=TRUE): nested dataframe containing all sequences for this combination +* Sequences: nested dataframe containing all sequences for this combination of cluster and germline; it has columns - Sequence, V.end, J.start, CDR3.start, CDR3.end; all values taken from the input dataframe + Sequence, Clone.ID, Clones, CDR1.nt, CDR2.nt, CDR3.nt, FR1.nt, FR2.nt, FR3.nt, FR4.nt + and, if .verbose_output=TRUE, also V.end, J.start, CDR3.start, CDR3.end; + all values taken from the input dataframe } \description{ -Aligns all sequences incliding germline within each clonal lineage within each cluster +This function aligns all sequences (incliding germline) that belong to one clonal +lineage and one cluster. After clustering and building the clonal lineage and germline, the next +step is to analyze the degree of mutation and maturity of each clonal lineage. This allows for +finding high mature cells and cells with a large number of offspring. The phylogenetic analysis +will find mutations that increase the affinity of BCR. Making alignment of the sequence +is the first step towards sequence analysis including BCR. } \examples{ @@ -63,7 +69,7 @@ bcr_data <- bcrdata$data bcr_data \%>\% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) \%>\% - repGermline() \%>\% + repGermline(.threads = 1) \%>\% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) } \concept{align_lineage} diff --git a/man/repClonalFamily.Rd b/man/repClonalFamily.Rd index c604dc76..3e769db3 100644 --- a/man/repClonalFamily.Rd +++ b/man/repClonalFamily.Rd @@ -2,8 +2,7 @@ % Please edit documentation in R/phylip.R \name{repClonalFamily} \alias{repClonalFamily} -\title{This function uses the PHYLIP package to make phylogenetic analysis. For making trees it uses -maximum parsimony methods.} +\title{Builds a phylogenetic tree using the sequences of a clonal lineage} \usage{ repClonalFamily(.data, .threads, .nofail) } @@ -21,6 +20,11 @@ Dataframe or list of dataframes (if input is a list with multiple samples). The dataframe has these columns: * Cluster: cluster name * Germline.Input: germline sequence, like it was in the input; not trimmed and not aligned +* V.germline.nt: input germline V gene sequence +* J.germline.nt: input germline J gene sequence +* CDR3.germline.length: length of CDR3 in input germline +* V.length: length of V gene after trimming on repAlignLineage() step +* J.length: length of j gene after trimming on repAlignLineage() step * Germline.Output: germline sequence, parsed from PHYLIP dnapars function output; it contains difference of germline from the common ancestor; "." characters mean matching letters. It's usually shorter than Germline.Input, because germline and @@ -29,9 +33,14 @@ The dataframe has these columns: * Trunk.Length: mean trunk length, representing the distance between the most recent common ancestor and germline sequence as a measure of the maturity of a lineage * Tree: output tree in "phylo" format, loaded from by PHYLIP dnapars function output +* TreeStats: nested dataframe containing data about tree nodes, needed for visualization +* Sequences: nested dataframe containing all sequences for this combination of cluster + and germline; it contains regions from original sequences, saved for + repSomaticHypermutation() calculation, and also data needed for visualizations } \description{ -This function builds a phylogenetic tree using the sequences of a clonal lineage +This function uses the PHYLIP package to make phylogenetic analysis. +For making trees it uses maximum parsimony methods. } \examples{ @@ -40,8 +49,8 @@ bcr_data <- bcrdata$data bcr_data \%>\% seqCluster(seqDist(bcr_data), .fixed_threshold = 3) \%>\% - repGermline() \%>\% + repGermline(.threads = 1) \%>\% repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) \%>\% - repClonalFamily(.threads = 2, .nofail = TRUE) + repClonalFamily(.threads = 1, .nofail = TRUE) } \concept{phylip} diff --git a/man/repGermline.Rd b/man/repGermline.Rd index ca646c45..244bf5c7 100644 --- a/man/repGermline.Rd +++ b/man/repGermline.Rd @@ -2,14 +2,10 @@ % Please edit documentation in R/germline.R \name{repGermline} \alias{repGermline} -\title{This function creates germlines for clonal lineages. B cell clonal lineage represents a set of B cells -that presumably have a common origin (arising from the same VDJ rearrangement event) and a common ancestor. -Each clonal lineage has its own germline sequence that represents the ancestral sequence -for each BCR in clonal lineage. In other words, germline sequence is a sequence of B-cells immediately -after VDJ recombination, before B-cell maturation and hypermutation process. Germline sequence is useful -for assessing the degree of mutation and maturity of the repertoire.} +\title{Creates germlines for clonal lineages} \usage{ -repGermline(.data, species, min_nuc_outside_cdr3, ref_only_first) +repGermline(.data, +.species, .align_j_gene, .min_nuc_outside_cdr3, .threads) } \arguments{ \item{.data}{The data to be processed. Can be \link{data.frame}, \link{data.table} @@ -17,33 +13,49 @@ or a list of these objects. It must have columns in the immunarch compatible format \link{immunarch_data_format}.} -\item{species}{Species from which the data was acquired. Available options: +\item{.species}{Species from which the data was acquired. Available options: "HomoSapiens" (default), "MusMusculus", "BosTaurus", "CamelusDromedarius", "CanisLupusFamiliaris", "DanioRerio", "MacacaMulatta", "MusMusculusDomesticus", "MusMusculusCastaneus", "MusMusculusMolossinus", "MusMusculusMusculus", "MusSpretus", "OncorhynchusMykiss", "OrnithorhynchusAnatinus", "OryctolagusCuniculus", "RattusNorvegicus", "SusScrofa".} -\item{min_nuc_outside_cdr3}{This parameter sets how many nucleotides should have V or J chain +\item{.align_j_gene}{MiXCR provides the number of J indels only for 1 allele of J gene +in the output file, and a germline can contain another allele. Therefore, calculation of +J gene start in reference based on numbers from input file can be sometimes incorrect. +As result, J gene in the germline will be trimmed in the start or will contain some +nucleotides from CDR3. Setting this parameter to TRUE enables precise alignment of J genes +to detect the correct starting point, but it significantly reduces performance.} + +\item{.min_nuc_outside_cdr3}{This parameter sets how many nucleotides should have V or J chain outside of CDR3 to be considered good for further alignment.} -\item{ref_only_first}{This parameter, if TRUE, means to take only first sequence from reference -for each allele name; if FALSE, all sequences will be taken, and the output table will -increase in size as a result.} +\item{.threads}{Number of threads to use.} } \value{ -Data with added columns V.first.allele, J.first.allele (with first alleles of V and J genes), -V.sequence, J.sequence (with V and J reference sequences), -Germline.sequence (with combined germline sequence) +Data with added columns: +* V.first.allele, J.first.allele (first alleles of V and J genes), +* V.ref.nt, J.ref.nt (V and J reference sequences), +* V.germline.nt, J.germline.nt (V and J germline sequences; they are references with + trimmed parts that are from CDR3), +* CDR3.germline.length (length of CDR3 in the germline), +* Germline.sequence (combined germline sequence) } \description{ -Creates germlines for clonal lineages +This function creates germlines for clonal lineages. B cell clonal lineage +represents a set of B cells that presumably have a common origin (arising from the same VDJ +rearrangement event) and a common ancestor. Each clonal lineage has its own germline sequence +that represents the ancestral sequence for each BCR in clonal lineage. In other words, +germline sequence is a sequence of B-cells immediately after VDJ recombination, before +B-cell maturation and hypermutation process. Germline sequence is useful for assessing +the degree of mutation and maturity of the repertoire. } \examples{ data(bcrdata) bcrdata$data \%>\% - repGermline() + top(5) \%>\% + repGermline(.threads = 1) } \concept{germline} diff --git a/man/repSomaticHypermutation.Rd b/man/repSomaticHypermutation.Rd new file mode 100644 index 00000000..e6d8e767 --- /dev/null +++ b/man/repSomaticHypermutation.Rd @@ -0,0 +1,50 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/somatic_hypermutation.R +\name{repSomaticHypermutation} +\alias{repSomaticHypermutation} +\title{Calculates number of mutations against the germline for each clonotype} +\usage{ +repSomaticHypermutation(.data, .threads, .nofail) +} +\arguments{ +\item{.data}{The data to be processed: an output of repClonalFamily(); +variants with one sample and list of samples are both supported.} + +\item{.threads}{Number of threads to use.} + +\item{.nofail}{Will return NA instead of stopping if Clustal W is not installed. +Used to avoid raising errors in examples on computers where Clustal W is not installed.} +} +\value{ +Dataframe or list of dataframes (if input is a list with multiple samples). +The dataframe has all the columns from repClonalFamily() output dataframe, with Sequence +column unnested: the resulting dataframe has one line per clonotype. Clone.ID column +contains original IDs for clonotypes, and can be used as dataframe key. +New columns are added: +* Germline.Alignment.V: contains V gene alignment of current clonotype with the germline +* Germline.Alignment.J: contains J gene alignment of current clonotype with the germline +* Substitutions: contains number of substitutions in the alignment (summary for V and J) +* Insertions: contains number of insertions in the clonotype relative to germline + (summary for V and J) +* Deletions: contains number of deletions in the clonotype relative to germline + (summary for V and J) +* Mutations: contains total number of mutations in the alignment (summary for V and J) +} +\description{ +This function aligns V and J genes from the germline in each cluster +with corresponding genes in each clonotype, saves the alignments for purpose of visualization, +and calculates number of mutations for each clonotype. +} +\examples{ + +data(bcrdata) +bcr_data <- bcrdata$data + +bcr_data \%>\% + seqCluster(seqDist(bcr_data), .fixed_threshold = 3) \%>\% + repGermline(.threads = 1) \%>\% + repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) \%>\% + repClonalFamily(.threads = 1, .nofail = TRUE) \%>\% + repSomaticHypermutation(.threads = 1, .nofail = TRUE) +} +\concept{somatic_hypermutation} diff --git a/man/seqCluster.Rd b/man/seqCluster.Rd index 92e092d4..e726ad0b 100644 --- a/man/seqCluster.Rd +++ b/man/seqCluster.Rd @@ -36,3 +36,4 @@ input_data <- lapply(immdata$data[1:2], head, 500) dist_result <- seqDist(input_data) cluster_result <- seqCluster(input_data, dist_result, .fixed_threshold = 1) } +\concept{seq_cluster} diff --git a/man/seqDist.Rd b/man/seqDist.Rd index 999bc3f5..d8923737 100644 --- a/man/seqDist.Rd +++ b/man/seqDist.Rd @@ -62,3 +62,4 @@ f <- function(x, y) { seqDist(immdata$data[1:2], .method = f, .group_by_seqLength = FALSE) } +\concept{distance} diff --git a/man/split_to_kmers.Rd b/man/split_to_kmers.Rd index c4e8e5cb..c456ce7b 100644 --- a/man/split_to_kmers.Rd +++ b/man/split_to_kmers.Rd @@ -12,7 +12,7 @@ kmer_profile(.data, .method = c("freq", "prob", "wei", "self"), .remove.stop = T \arguments{ \item{.data}{Character vector or the output from \code{getKmers}.} -\item{.k}{Integer. Size of kmers.} +\item{.k}{Integer. Size of k-mers.} \item{.method}{Character vector of length one. If "freq" then returns a position frequency matrix (PFM) - a matrix with occurences of each amino acid in each position. @@ -29,7 +29,7 @@ For more information see https://en.wikipedia.org/wiki/Position_weight_matrix.} \item{.remove.stop}{Logical. If TRUE (by default) remove stop codons.} } \value{ -\code{split_to_kmers} - Data frame with two columns (kmers and their counts). +\code{split_to_kmers} - Data frame with two columns (k-mers and their counts). \code{kmer_profile} - a matrix with per-position amino acid statistics. } @@ -41,4 +41,4 @@ data(immdata) kmers <- getKmers(immdata$data[[1]], 5) kmer_profile(kmers) \%>\% vis() } -\concept{kmers} +\concept{k-mers} diff --git a/man/vis.Rd b/man/vis.Rd index 134a95ab..eff1ffa0 100644 --- a/man/vis.Rd +++ b/man/vis.Rd @@ -53,6 +53,10 @@ Diversity estimation: - Diversity estimations (from \link{repDiversity}) - see \link{vis.immunr_chao1}. +BCR analysis: + +- Clonal tree (from \link{repClonalFamily}) - see \link{vis.clonal_family} and \link{vis.clonal_family_tree}. + Advanced analysis: - Repertoire dynamics (from \link{trackClonotypes}) - see \link{vis.immunr_dynamics}; diff --git a/man/vis.clonal_family.Rd b/man/vis.clonal_family.Rd new file mode 100644 index 00000000..79b71eff --- /dev/null +++ b/man/vis.clonal_family.Rd @@ -0,0 +1,31 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/vis.R +\name{vis.clonal_family} +\alias{vis.clonal_family} +\title{Visualise clonal family tree: wrapper for calling on the entire repClonalFamily output} +\usage{ +\method{vis}{clonal_family}(.data, ...) +} +\arguments{ +\item{.data}{Clonal families from 1 or multiple samples: \code{\link{repClonalFamily}} output.} + +\item{...}{Not used here.} +} +\value{ +A ggraph object. +} +\description{ +Visualise clonal family tree: wrapper for calling on the entire repClonalFamily output +} +\examples{ +data(bcrdata) +bcr_data <- bcrdata$data + +clonal_family <- bcr_data \%>\% + seqCluster(seqDist(bcr_data), .fixed_threshold = 3) \%>\% + repGermline(.threads = 1) \%>\% + repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) \%>\% + repClonalFamily(.threads = 1, .nofail = TRUE) \%>\% + vis() +} +\concept{phylip} diff --git a/man/vis.clonal_family_tree.Rd b/man/vis.clonal_family_tree.Rd new file mode 100644 index 00000000..7aba8f06 --- /dev/null +++ b/man/vis.clonal_family_tree.Rd @@ -0,0 +1,36 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/vis.R +\name{vis.clonal_family_tree} +\alias{vis.clonal_family_tree} +\title{Visualise clonal family tree} +\usage{ +\method{vis}{clonal_family_tree}(.data, ...) +} +\arguments{ +\item{.data}{Single clonal family tree data from 1 cluster: 1 element from TreeStats column from \code{\link{repClonalFamily}} output.} + +\item{...}{Not used here.} +} +\value{ +A ggraph object. +} +\description{ +Visualise clonal family tree +} +\examples{ +data(bcrdata) +bcr_data <- bcrdata$data + +clonal_family <- bcr_data \%>\% + seqCluster(seqDist(bcr_data), .fixed_threshold = 3) \%>\% + repGermline(.threads = 1) \%>\% + repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) \%>\% + repClonalFamily(.threads = 1, .nofail = TRUE) + +# This condition can be omitted; it prevents the example from crashing +# when ClustalW or PHYLIP are not installed +if (!("step_failure_ignored" \%in\% class(clonal_family))) { + vis(clonal_family[["full_clones"]][["TreeStats"]][[2]]) +} +} +\concept{phylip} diff --git a/man/vis.step_failure_ignored.Rd b/man/vis.step_failure_ignored.Rd new file mode 100644 index 00000000..2845a2fd --- /dev/null +++ b/man/vis.step_failure_ignored.Rd @@ -0,0 +1,21 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/vis.R +\name{vis.step_failure_ignored} +\alias{vis.step_failure_ignored} +\title{Handler for .nofail argument of pipeline steps that prevents examples from crashing +on computers where certain dependencies are not installed} +\usage{ +\method{vis}{step_failure_ignored}(.data, ...) +} +\arguments{ +\item{.data}{Not used here.} + +\item{...}{Not used here.} +} +\value{ +An empty object with "step_failure_ignored" class. +} +\description{ +Handler for .nofail argument of pipeline steps that prevents examples from crashing +on computers where certain dependencies are not installed +} diff --git a/vignettes/v1_introduction.Rmd b/vignettes/v1_introduction.Rmd index 3dd77a80..715bebe0 100644 --- a/vignettes/v1_introduction.Rmd +++ b/vignettes/v1_introduction.Rmd @@ -23,7 +23,7 @@ output: # Introduction -`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, aimed at medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. Follow us on [Twitter](https://twitter.com/immunomind) for news and updates. +`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, mainly tailored to medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. Follow us on [Twitter](https://twitter.com/immunomind) for news and updates. ## Installation @@ -35,7 +35,7 @@ In order to install `immunarch` execute the following command: install.packages("immunarch") ``` -That's it, you can start using `immunarch` now! See the [Quick Start](#quick-start) section below to dive into immune repertoire data analysis. If you run in any trouble with installation, take a look at the [Installation Troubleshooting](https://immunarch.com/articles/v1_introduction.html#installation-troubleshooting) section. +That's it, you can start using `immunarch` now! See the [Quick Start](#quick-start) section below to dive into immune repertoire data analysis. If you run in any trouble during installation, take a look at the [Installation Troubleshooting](https://immunarch.com/articles/v1_introduction.html#installation-troubleshooting) section. Note: there are quite a lot of dependencies to install with the package because it installs all the widely-used packages for data analysis and visualisation. You got both the AIRR data analysis framework and the full Data Science package ecosystem with only one command, making `immunarch` the entry-point for single-cell & immune repertoire Data Science. @@ -44,17 +44,19 @@ Note: there are quite a lot of dependencies to install with the package because If the above command doesn't work for any reason, try installing `immunarch` directly from its repository: ```r -install.packages("devtools") # skip this if you already installed devtools +install.packages(c("devtools", "pkgload")) # skip this if you already installed these packages devtools::install_github("immunomind/immunarch") +devtools::reload(pkgload::inst("immunarch")) ``` ### Latest pre-release on GitHub -Since releasing on CRAN is limited to one release per one-two months, you can install the latest pre-release version with bleeding edge features and optimisations directly from the code repository. In order to install the latest pre-release version, you need to execute only two commands: +Since releasing on CRAN is limited to one release per one or two months, you can install the latest pre-release version with all the bleeding edge and optimised features directly from the code repository. In order to install the latest pre-release version, you need to execute the following commands: ```r -install.packages("devtools") # skip this if you already installed devtools +install.packages(c("devtools", "pkgload")) # skip this if you already installed these packages devtools::install_github("immunomind/immunarch", ref="dev") +devtools::reload(pkgload::inst("immunarch")) ``` You can find the list of releases of `immunarch` here: https://github.com/immunomind/immunarch/releases @@ -95,16 +97,16 @@ immdata <- repLoad("path/to/your/data") # Replace it with the path to your data ### Advanced methods -For advanced methods such as clonotype annotation, clonotype tracking, kmer analysis and public repertoire analysis see "Tutorials". +For advanced methods such as clonotype annotation, clonotype tracking, k-mer analysis and public repertoire analysis see "Tutorials". ## Installation troubleshooting -If you can not install `devtools`, check sections 1 and 2 below. +If you cannot install `devtools`, check sections 1 and 2 below. -If you run in any other trouble, try the following steps: +If you run into any other trouble, try the following steps: -1. Check your R version. Run `version` command in the console to get your R versions. If the R version is below 3.5.0 (for example, `R version 3.1.0`), try updating your R version to the latest one. Note: if you try to install a package after the update and it still fails with the following message: +1. Check your R version. Run `version` command in the console to get your R versions. If the R version is below 3.5.0 (for example, `R version 3.1.0`), try updating your R version to the most recent one. Note: if you try to install a package after the update and it still fails with the following message: ``` ERROR: dependencies ‘httr’, ‘usethis’ are not available for package ‘devtools’ @@ -113,7 +115,7 @@ If you run in any other trouble, try the following steps: installation of package ‘devtools’ had non-zero exit status ``` - it means that you need to re-install packages that were built under the previous R version. In the above example it would be packages `httr` and `usethis`. In order to re-install a package you need to execute the command `install.packages("package_name")`, where `package_name` is the name of the package to update. To find packages that need to be re-installed after updating R, you need to look for installation messages like this in the installation process: + it means that you need to reinstall the packages that were built under the previous R versions. In the example above those would be the packages `httr` and `usethis`. In order to reinstall a package you need to execute the command `install.packages("package_name")`, where `package_name` is the name of the package to update. To find the packages that need to be reinstalled after updating R, you need to look for installation messages like this in the installation process: ``` ERROR: package ‘usethis’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version @@ -139,7 +141,7 @@ Execution halted 3. For Mac users. Make sure to install XCode from App Store first and command line developers tools second by executing the following command in Terminal: `xcode-select –install` -4. For Mac users. If you have issues like old packages can't be updated, or error messages such as `ld: warning: directory not found for option` or `ld: library not found for -lgfortran`, [this link](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/) will help you to fix the issue. +4. For Mac users. If you have other issues, for instance some old packages can't be updated, or you see an error message such as `ld: warning: directory not found for option` or `ld: library not found for -lgfortran`, [this link](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/) will help you to fix the issue. 5. For Mac Mojave (1.14) users. If you run into the following error: @@ -206,11 +208,11 @@ ERROR: lazy loading failed for package 'immunarch' Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true") ``` -9. For Windows users. If you have issues with the package installation, or if you want to change the folder for R packages, feel free to check [this forum post](https://community.rstudio.com/t/help-regarding-package-installation-renviron-rprofile-r-libs-r-libs-site-and-r-libs-user-oh-my/13888/8). +9. For Windows users. If you encounter issues with the package installation, or if you want to change the folder for R packages, feel free to check [this forum post](https://community.rstudio.com/t/help-regarding-package-installation-renviron-rprofile-r-libs-r-libs-site-and-r-libs-user-oh-my/13888/8). -9. For Windows users. Make sure to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). Before installation close RStudio, install Rtools and re-open it afterwards. To check if Rtools installed correctly, run the `devtools::find_rtools()` command (after installing the devtools package). If you have an error, check [this link](https://github.com/r-lib/devtools/issues/1941) for help. +9. For Windows users. Make sure to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). Before installation close RStudio, install Rtools and reopen it afterwards. To check if Rtools installed correctly, run the `devtools::find_rtools()` command (after installing the devtools package). If you have an error, check [this link](https://github.com/r-lib/devtools/issues/1941) for help. -9. If you can not install dependencies for `immunarch`, please try manual installation of all dependencies by executing the following command in R console: +9. If you cannot install dependencies for `immunarch`, please consider manual installation of all dependencies by executing the following command in R console: ``` install.packages(c("rematch", "prettyunits", "forcats", "cellranger", "progress", "zip", "backports", "ellipsis", "zeallot", "SparseM", "MatrixModels", "sp", "haven", "curl", "readxl", "openxlsx", "minqa", "nloptr", "RcppEigen", "utf8", "vctrs", "carData", "pbkrtest", "quantreg", "maptools", "rio", "lme4", "labeling", "munsell", "cli", "fansi", "pillar", "viridis", "car", "ellipse", "flashClust", "leaps", "scatterplot3d", "modeltools", "DEoptimR", "digest", "gtable", "lazyeval", "rlang", "scales", "tibble", "viridisLite", "withr", "assertthat", "glue", "magrittr", "pkgconfig", "R6", "tidyselect", "BH", "plogr", "purrr", "ggsci", "cowplot", "ggsignif", "polynom", "fastcluster", "plyr", "abind", "dendextend", "FactoMineR", "mclust", "flexmix", "prabclus", "diptest", "robustbase", "kernlab", "GlobalOptions", "shape", "colorspace", "stringi", "hms", "clipr", "crayon", "httpuv", "mime", "jsonlite", "xtable", "htmltools", "sourcetools", "later", "promises", "gridBase", "RColorBrewer", "yaml", "ggplot2", "dplyr", "dtplyr", "data.table", "gridExtra", "ggpubr", "pheatma3", "ggrepel", "reshape2", "DBI", "factoextra", "fpc", "circlize", "tidyr", "Rtsne", "readr", "readxl", "shiny", "shinythemes", "treemap", "igraph", "airr", "ggseqlogo", "UpSetR", "stringr", "ggalluvial", "Rcpp")) @@ -229,4 +231,4 @@ install.packages(c("rematch", "prettyunits", "forcats", "cellranger", "progress" Check your path to the downloaded package archive file. It should not be "path/to/your/folder/with/immunarch.tar.gz", but a path on your PC to the downloaded file, e.g., "C:/Users/UserName/Downloads/immunarch.tar.gz" or "/Users/UserName/Downloads/immunarch.tar.gz". -9. If troubles still persist, let us know via [GitHub](https://github.com/immunomind/immunarch/issues) (preferably) or [support@immunomind.io](mailto:support@immunomind.io) (in case of private data). +9. If any of the troubles still persist, let us know via [GitHub](https://github.com/immunomind/immunarch/issues) (preferably) or [support@immunomind.io](mailto:support@immunomind.io) (in case of private data). diff --git a/vignettes/v2_data.Rmd b/vignettes/v2_data.Rmd index 1126af8e..57dd7e6b 100644 --- a/vignettes/v2_data.Rmd +++ b/vignettes/v2_data.Rmd @@ -42,13 +42,13 @@ data(immdata) # Input / output The package provides several IO functions: -- `repLoad` - to load the repertoires, having compatible format. +- `repLoad` - loads the repertoires having compatible format. -- `repSave` - to save changes and to write the repertoire data to a file in a specific format (`immunarch`, VDJtools). +- `repSave` - saves changes and writes the repertoire data to a file in a specific format (`immunarch`, VDJtools). `repLoad` detects the input file format automatically. `immunarch` currently support the following immune repertoire data formats: -- `"immunarch"` - current software tool, in case you forgot it :) +- `"immunarch"` - current software tool, in case you forgot :) - `"immunoseq"` - https://www.immunoseq.com @@ -78,7 +78,7 @@ The package provides several IO functions: - `"vidjil"` - http://www.vidjil.org/ -Please contact us if there are more file formats you want to be supported. +Please contact us if there are more file formats you want supported. For parsing IgBLAST results process the data with MigMap first. @@ -90,7 +90,7 @@ You can load the data from a single file, a list of repertoire file paths, or fr If you have your files, you should just specify a path to your file or to a folder with files. Then load data using `repLoad`: ```{r, eval=F} -#path argument is a path to the folder with your file or files including metadata file also(if you have one). +#path argument is a path to the folder with your file or files including the metadata file. immdata <- repLoad(path) ``` @@ -99,16 +99,16 @@ immdata <- repLoad(path) You could find a folder with example files [here](https://github.com/immunomind/immunarch/releases/tag/0.6.7) (download and extract test_data.zip or test_data.tar.gz) and use it to test data loading without your own files. -If you are not familiar with file paths, you can download example data to your working directory. -You can obtain you working directory with `getwd()` command +If you are not familiar with the file paths, you can download our mock data to your working directory. +You can obtain working directory with `getwd()` command -You could also download all files to the `'example'` folder in your working directory and load all of them by passing folder name to repLoad function in a quotes: +You could also download all files to the `'example'` folder in your working directory and load all of them by passing folder name to repLoad function in quotation marks: ```{r, eval=F} immdata <- repLoad('example') ``` -Also, example data are already downloaded with `immunarch` package. You can load all sample files using the following command: +The example data is already downloaded with `immunarch` package. You can load all sample files using the following command: ```{r, eval=F} #path to the folder with example data @@ -116,7 +116,7 @@ file_path = paste0(system.file(package="immunarch"), "/extdata/io/") immdata <- repLoad(file_path) ``` -In other cases you may want to provide a metadata file and locate it in the folder. It is necessary to name it exactly "metadata.txt". +In other cases you may want to provide a metadata file and locate it in the folder. It is necessary to name it "metadata.txt". ```{r, eval=F} # For instance you have a following structure in your folder: @@ -138,7 +138,7 @@ immdata <- repLoad(file_path) print(names(immdata)) # In order to do that your folder must contain metadata file named -# exactly "metadata.txt". +# "metadata.txt". # In R, when you load your data: # > immdata <- repLoad("path/to/your/folder/") @@ -152,13 +152,13 @@ print(names(immdata)) ``` -Dummy metadata data frame look like this: +Dummy metadata data frame looks like this: ```{r} as_tibble(data.frame(Sample = c("immunoseq1", "immunoseq2", "immunoseq3"), stringsAsFactors = F)) ``` -The metadata file "metadata.txt" has to be tab delimited file with first column named "Sample" and any number of additional columns with arbitrary names. The first column should contain base names of files without extensions in your folder. +The metadata file "metadata.txt" has to be a table with first column named "Sample" and any number of additional columns with any names. The first column should contain the base names of files without extensions in your folder. | **Sample** |**Sex**|**Age**|**Status**| |:----------:|:-----:|:-----:|:--------:| @@ -166,7 +166,7 @@ The metadata file "metadata.txt" has to be tab delimited file with first column |immunoseq\_2|M |2 |C | |immunoseq\_3|F |3 |A | -In order to import data from the external databases you have to create a connection to this database and then load the data. Make sure that the table format in your database matches the `immunarch`'s format. +In order to import data from the external databases you have to connect to this database and then load the data. Make sure that the table format in your database matches the `immunarch`'s format. To illustrate the use of external database, here is an example demonstrating data loading to the local MonetDB database: ```{r, eval=F} @@ -200,15 +200,15 @@ list(data = res_db, meta = META) ``` -`immunarch` is compatible with following sources: +`immunarch` is compatible with the following sources: -- R data frames (for common applications) +- R data frames (for most applications) -- R data tables (for speed, although requires lots of RAM) +- R data tables (for faster calculations, although they require a lot of RAM) -- MonetDB-like databases that support both DBI and dplyr (best choice, although you have to be a bit familiar with dplyr) +- MonetDB-like databases that support both DBI and dplyr (an optimal choice, although you have to be familiar with dplyr) -- Apache Spark (if you are expierienced and comfortable with it) +- Apache Spark (if you have experience with it) # Basic data manipulations with dplyr and immunarch You can find the introduction to `dplyr` here: https://CRAN.R-project.org/package=dplyr/vignettes/dplyr.html @@ -228,7 +228,7 @@ The next one operates in a similar fashion: ```{r, eval=FALSE} noncoding(immdata$data[[1]]) ``` -Now, the computation of the number of filtered sequences is straightforward: +Now, the computation of the number of filtered sequences is rather straightforward: ```{r, eval=FALSE} nrow(inframes(immdata$data[[1]])) ``` @@ -238,7 +238,7 @@ nrow(outofframes(immdata$data[[1]])) ``` ## Get subset of clonotypes with a specific V gene -It is simple to subset data frame according to labels in the specified index. In the example the resulting data frame contains only records with 'TRBV10-1' V gene: +It is simple to subset data frame according to labels in the specified index. In this example the resulting data frame contains only records with 'TRBV10-1' V gene: ```{r} filter(immdata$data[[1]], V.name == 'TRBV10-1') ``` diff --git a/vignettes/web_only/BCRpipeline.Rmd b/vignettes/web_only/BCRpipeline.Rmd new file mode 100644 index 00000000..5f2a8c8e --- /dev/null +++ b/vignettes/web_only/BCRpipeline.Rmd @@ -0,0 +1,324 @@ +--- +title: 'BCR' +author: 'ImmunoMind – improving design of T-cell therapies using multi-omics and AI. Research and biopharma partnerships, more details: immunomind.io' +date: "support@immunomind.io" +output: + html_document: + fig_height: 8 + fig_width: 10 + theme: spacelab + toc: yes + pdf_document: + toc: yes + word_document: + toc: yes +--- + + + + + +# Overview + +A **B-cell receptor (BCR)** consists of an immunoglobulin molecule and a СD79 molecule. BCR includes two identical heavy chains (generated by recombination of V, D, and J segments), and two identical light chains (generated by recombination of V and J segments). Rapid improvements in high-throughput sequencing provide an opportunity to study B-cell immunoglobulin receptors on the cell surface [1] via the process called **BCR repertoire sequencing**. This pipeline describes how to work with BCR repertoire data after preprocessing raw sequenced data with MiXCR ([https://mixcr.readthedocs.io/en/master/](https://mixcr.readthedocs.io/en/master/)) or any other programs. + +The pipeline involves five steps: + +1. **Data loading.** + + The step includes loading the data and checking whether there is enough information for the pipeline to work. + +2. **Reconstructing clonal lineages.** + + At this step, all BCR sequences are divided into clonal lineages — sets of sequences that presumably share a common ancestor. + +3. **Building germline.** + + At this step, the algorithm generates a sequence that represents the ancestral sequence for each BCR in a clonal lineage. + +4. **Aligning sequences within clonal lineages.** + + This step involves preparation for phylogenetic and somatic hypermutation analysis. + + +5. **Phylogenetic analysis.** + + This step provides phylogeny reconstruction and trunk length calculation (by running the PHYLIP package). + + +6. **Somatic hypermutation analysis.** + + At this step, we compare clonotype and germline sequences to detect and count the number of mutations in clonotype sequence. + + +```{r setup, include=FALSE, echo=FALSE} +# knitr::knit_hooks$set(optipng = knitr::hook_optipng) +# knitr::opts_chunk$set(optipng = '-o7') + +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(fig.align = "center") +knitr::opts_chunk$set(fig.width = 12) +knitr::opts_chunk$set(fig.height = 6) + +library(immunarch) +``` + +# Data loading + +BCR data are loaded with `repLoad` functions, like any other data. + +Check that BCR data has the following: + +- information about full sequence read of each BCR, +- positions of CDR3 region start and end, +- number of mutations per CDR3 region. In MiXCR output files, this information is located in the column ‘`refPoints`’ + +Look at the sample BCR data implemented in Immunarch: + +```{r example 1, warning = FALSE} +#load the package into the R environment: +library(immunarch) + +data(bcrdata) +``` + +# Reconstructing clonal lineages + +B-cell **clonal lineage** represents a set of B cells that presumably have a common origin (arising from the same VDJ rearrangement event) and a common ancestor. BCR sequences are clustered according to their similarity, and sets of sequences in one cluster are named clonal lineages. Clustering involves two steps. + +- The first step is calculating the distance between sequences. +- The second step is clustering the sequences using the information about the distance between them. Check our tutorial for more information about clustering: [https://www.notion.so/immunomind/Clustering-1752b661f8794fb6aba19f25e9373488](https://www.notion.so/Clustering-1752b661f8794fb6aba19f25e9373488) + +An example of reconstructing clonal lineages using default Immunarch options: + +```{r example 2} +#calulate distance matrix +distBCR <- seqDist(bcrdata$data %>% top(500)) + +#find clusters +bcrdata$data <- seqCluster(bcrdata$data %>% top(500), distBCR, .perc_similarity = 0.9) +``` + +# Building germline + +Each clonal lineage has its own **germline sequence** that represents the ancestral sequence for each BCR in the clonal lineage. In other words, germline sequence is a sequence of B-cells immediately after VDJ recombination, before B-cell maturation and hypermutation process. Germline sequence is useful for assessing the degree of mutation and maturity of the repertoire. + +In Immunarch, `repGermline()` function generates germline for each sequence: + +```{r example 3, results = 'hide'} +#generate germline +bcrdata$data %>% + repGermline(.threads = 1) +``` + +A germline is represented via sequences of V gene - N...N (CDR3 length) - J gene: + +```{r example 4} +#germline example +bcrdata$data %>% + top(1) %>% + repGermline(.threads = 1) %>% .$full_clones %>% .$Germline.sequence +``` + +Оptions: + +`.data` - The data to be processed. Can be `data.frame`, `data.table`, or a list of these objects. + +`.species` - Specifies species from which reference V and J are taken. +Available species: "HomoSapiens" (default), "MusMusculus", "BosTaurus", "CamelusDromedarius", "CanisLupusFamiliaris", "DanioRerio", "MacacaMulatta", "MusMusculusDomesticus", "MusMusculusCastaneus", "MusMusculusMolossinus", "MusMusculusMusculus", "MusSpretus", "OncorhynchusMykiss", "OrnithorhynchusAnatinus", "OryctolagusCuniculus", "RattusNorvegicus", "SusScrofa". + +`.min_nuc_outside_cdr3` — this parameter sets how many nucleotides should have V or J chain outside of CDR3 to be considered good for further alignment. Reads with too short chains are filtered out + +`.align_j_gene` - if the germline sequence does not assemble correctly in the region of the J gene, then set this parameter to True. This will slow down the algorithm, but the assembly of the germline sequence will be more accurate. + +# Aligning sequences within a clonal lineage + +After building clonal lineage and germline, we can start analyzing the degree of mutation and maturity of each clonal lineage. This allows us to find cells with a large number of mutated clones. The phylogenetic analysis will find mutations that influence the affinity of BCRs. Aligning the sequence is the first step toward sequence phylogenetic analysis. + +In the Immunarch package, the function `repAlignLineage` aligns sequences within clonal lineages. This function requires `Clustal W` app to be installed. The app could be downloaded here: [http://www.clustal.org/download/current/](http://www.clustal.org/download/current/%5C%5Cn), or installed via your system package manager (such as apt or dnf). + +- For Ubuntu, check this guide: [https://www.howtoinstall.me/ubuntu/18-04/clustalw/](https://www.howtoinstall.me/ubuntu/18-04/clustalw/) + +```{r example 6, eval = FALSE} +sudo apt update +sudo apt install clustalw +``` + +- For Windows, download and run `clustalw-x.x-win.msi` + +`repAlignLineage` usage example: + +```{r example 7, results = 'hide'} +data(bcrdata) +bcr_data <- bcrdata$data %>% top(500) +bcr_data %>% + seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% + repGermline(.threads = 1) %>% + repAlignLineage(.min_lineage_sequences = 2) +``` + +The function has several parameters: + +- `.min_lineage_sequences` — Filters clusters (clonal lineages) with the number of clonotypes lower than the threshold. Aligning clonal lineages with few sequences is of little use. + +```{r example 8, results = 'hide'} +# take clusters that contain at least 1 sequence +align_dt <- bcr_data %>% + seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% + repGermline(.threads = 1) %>% + repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2) +``` + +- `.prepare_threads` — the number of threads to prepare results table. A high number can cause memory overload! +- `.align_threads` — the number of threads for lineage alignment. + +Requirements for the input table for `repAlignLineage()` + +- must have columns in the immunarch compatible format `immunarch_data_format`, +- must contain the 'Cluster' column generated by seqCluster() function, +- must contain the 'Sequence.germline' column generated by repGermline() function. + +Align sequences in a cluster can be visualized using standard functions: + +```{r align visualisation} +# A name of the first cluster +align_dt$full_clones$Cluster[[1]] + +# Alignment of sequences from the first cluster +image(align_dt$full_clones$Alignment[[1]], grid = TRUE) +``` + +# Phylogenetic analysis + +For BCR phylogenetic analysis, we use the **maximum parsimony method**. The maximum parsimony method reconstructs a phylogenetic tree for the lineage by minimizing the total number of mutation events [2]. + +This method also enables the reconstruction of intermediate sequences that may have existed between BCR and germline sequence. These simulated sequences contain mutations that were ancestral to the clonotype group selected with the maximum parsimony analysis. + +In Immunarch, function `repClonalFamily` builds a phylogeny of B-cells, generates a common ancestor, and calculates a trunk length. The mean trunk length represents the distance between the most recent common ancestor and germline sequence [3]. Mean trunk length serves as a measure of the maturity of a lineage. The trunk length of the lineage tree approximates the maturation state of the initiating B cell for each clone [4]. + +This function requires `PHYLIP` app to be installed. The app can be downloaded here: [https://evolution.genetics.washington.edu/phylip/getme-new1.html](https://evolution.genetics.washington.edu/phylip/getme-new1.html), or could be installed with your system package manager (such as apt or dnf). + +- For Ubuntu, follow the installation guide ([https://zoomadmin.com/HowToInstall/UbuntuPackage/phylip](https://zoomadmin.com/HowToInstall/UbuntuPackage/phylip) ): + +```{r example 9, eval = FALSE} +sudo apt-get update -y +sudo apt-get install -y phylip +``` + +- For Windows: + 1. Download a Zip archive ([https://evolution.genetics.washington.edu/phylip/install.html](https://evolution.genetics.washington.edu/phylip/install.html)). + 2. Extract the archive into a folder. + 3. Add the folder to the PATH ([https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/)) + +repClonalFamily usage example: + +```{r example 10, results = 'hide'} +library(ape) +bcr <- bcr_data %>% + seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% + repGermline(.threads = 1) %>% + repAlignLineage(.min_lineage_sequences = 2) %>% + repClonalFamily(.threads = 2) + +#plot visualization of the first tree +plot(bcr$full_clones$Tree[[1]]) +tiplabels() +nodelabels() +``` + +We have found 4 clusters: + +```{r example 11, warning = FALSE} +bcr$full_clones$Cluster %>% unique() +``` + +We have found mismatches between a germline and an ancestor sequence. Dots represent nucleotides matches between the sequences, letters represent mismatches between the sequences: + +```{r example 12} +# the example of common ancestor sequence +bcr$full_clones$Common.Ancestor[1] + +# the example of mismatches between a germline and an ancestor sequence +bcr$full_clones$Germline.Output[1] +``` + +We have calculated a trunk length for each cluster: + +```{r example 13} +bcr$full_clones[ , c('Cluster', 'Trunk.Length') ] +``` + +# Somatic hypermutation analysis + +The rate of somatic hypermutation allows us to estimate repertoire maturation and detect the type of mutation contributing to the emergence of high affinity antibodies. This makes somatic hypermutation analysis a valuable asset for B-cell repertoire analysis. + +In Immunarch, `repSomaticHypermutation()` function is designed for hypermutation analysis: + +```{r example 14, , warning = FALSE} +bcr_data <- bcrdata$data + +shm_data <- bcr_data %>% + seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>% + repGermline(.threads = 2) %>% + repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2, .nofail = TRUE) %>% + repClonalFamily(.threads = 2, .nofail = TRUE) %>% + repSomaticHypermutation(.threads = 2, .nofail = TRUE) +``` + +The function repSomaticHypermutation() takes V and J germline sequences and V and J clonotype sequences as an input. Then the function aligns germline and clonotype sequences to detect and calculate occurring mutations. + +Examples of germline and clonotype sequences: + +```{r example 15} +# the example of germline V sequence +shm_data$full_clones$V.germline.nt[1] + +# the example of germline J sequence +shm_data$full_clones$J.germline.nt[1] + +# the example of clonotype V sequence +cols <- c('FR1.nt', 'CDR1.nt', 'FR2.nt', 'CDR2.nt', 'FR3.nt') +apply(shm_data$full_clones[1,][ , cols ] , 1 , paste , collapse = "" )[[1]] + +# the example of clonotype J sequence +shm_data$full_clones$FR4.nt[1] +``` + +Example: aligning germline and clonotype V sequences: + +```{r example 16} +image(shm_data$full_clones$Germline.Alignment.V[[1]], grid = TRUE) +``` + +Example: aligning germline and clonotype J sequences: + +```{r example 17} +image(shm_data$full_clones$Germline.Alignment.J[[1]], grid = TRUE) +``` + +The number of mutations for each clonotype sequence: + +```{r example 18} +cols <- c('Clone.ID', 'Substitutions', 'Insertions', 'Deletions', 'Mutations') +shm_data$full_clones[ , cols ] +``` + +Then you could easily estimate the mutation rate: + +```{r example 19} +# estimate mutation rate +shm_data$full_clones %>% + mutate(Mutation.Rate = Mutations / (nchar(Common.Ancestor) - CDR3.germline.length)) %>% + select(Clone.ID, Mutation.Rate) +``` + +# References + +1. [https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0243-2](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0243-2) +2. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4754972/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4754972/) +3. [https://www.frontiersin.org/articles/10.3389/fimmu.2020.01734/full](https://www.frontiersin.org/articles/10.3389/fimmu.2020.01734/full) +4. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4754972/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4754972/) diff --git a/vignettes/web_only/clustering.Rmd b/vignettes/web_only/clustering.Rmd new file mode 100644 index 00000000..88882dc7 --- /dev/null +++ b/vignettes/web_only/clustering.Rmd @@ -0,0 +1,248 @@ +--- +title: 'Clustering' +author: 'ImmunoMind – improving design of T-cell therapies using multi-omics and AI. Research and biopharma partnerships, more details: immunomind.io' +date: "support@immunomind.io" +output: + html_document: + fig_height: 8 + fig_width: 10 + theme: spacelab + toc: yes + pdf_document: + toc: yes + word_document: + toc: yes +--- + + + + + +Clustering B-cell receptors (BCRs) and T-cell receptors (TCRs) sequences is a common step in the analysis of B and T cells. + +In the B-cell repertoire, this step is required for making clonal lineages. B-cell **clonal lineage** represents a set of B cells that presumably share a common origin (arising from the same VDJ rearrangement event) and a common ancestor. Information about clonal lineages allows you to estimate somatic hypermutation levels and find binding affinity-changing mutations. + +T cells that are able to recognize identical epitopes were shown to have high similar TCR (Glanville et al. 2017; Dash et al. 2017). Clustering allows for finding high similar TCR in repertoires that are likely to recognize the same antigens. + +Clustering analysis involves two steps. The first step requires calculating the distance between sequences. The second step requires clustering the sequences using the information about the distance between them. + +Application of clustering enables you to expand the clonotype concept. A **clonotype** is a set of cells that supposedly share a common ancestor or properties. There are a few possible ways to define clonotypes. For example, some researchers define clonotypes as a set of sequences that have the same V gene and same nucleotide CDR3 sequence. Other researchers define clonotypes as cell sets that have the same V gene, the same J gene, and the same amino acid CDR3 sequence. Clustering provides you with a way to create new groups of sequences based on your vision. + +```{r setup, include=FALSE, echo=FALSE} +# knitr::knit_hooks$set(optipng = knitr::hook_optipng) +# knitr::opts_chunk$set(optipng = '-o7') + +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(fig.align = "center") +knitr::opts_chunk$set(fig.width = 12) +knitr::opts_chunk$set(fig.height = 6) + +library(immunarch) +``` + +# Calculating the distance + +CDR3 region is the most variable region in BCR and TCR. This region is commonly used for calculating the distance between sequences. However, other regions could also be used for calculating this distance — like the full nucleotide reads. You can use any user-defined column of sequences to calculate the distance, since column selection depends solely on your research purpose. For example, the use of ‘nt’ sequences results in more conservative clustering. + +Example: + +```{r example 1, warning = FALSE} +#load the package into the R environment: +library(immunarch) + +data(immdata) +data(bcrdata) + +TCRdata <- repFilter(immdata, .method = "by.meta", .query = list(Sample = include("A2-i129")))$data +BCRdata <- bcrdata$data + +#using aa CDR3 sequences for calculating distance +distTCR <- seqDist( TCRdata, .col = 'CDR3.aa') + +#using nt CDR3 sequences for calculating distance +distTCR <- seqDist( TCRdata, .col = 'CDR3.nt') + +#using full sequences for calculating distance +distBCR <- seqDist( BCRdata, .col = 'Sequence') +``` + +Note that a similar CDR3 region could potentially arise from different VDJ rearrangement events. When running the analysis, group the clonotypes with similar CDR3 sequences by V gene, J gene, and/or sequence length to maximize the probability that similar CDR3 sequences arise from the same event (for BCRs) or recognize the same antigens (for TCRs). Such grouping also speeds up and simplifies the distance calculating process. + +Grouping by V gene is necessary; grouping by J gene or the sequence length is optional. By default, the sequences are grouped by ‘V.name’ column, ‘J. name’ column, and CDR3 length (since the comparison of CDR3 regions for sequences with different V and J genes is not a common case). + +Example: + +```{r example 2} +#group sequence by V.name and length CDR3 +distBCR <- seqDist( BCRdata, .group_by = c('V.name'), .group_by_seqLength = T) + +#group sequence by V.name and J.name +distBCR <- seqDist( BCRdata, .group_by = c('V.name', 'J.name'), .group_by_seqLength = F) +``` + +There are several possible approaches to estimating the distance between the sequences. + +The default approach is to use the Hamming distance. When sequences have different lengths the distance between them equals infinity. Use this approach if you want to take into consideration only the point mutations excluding the indels. + +Example: + +```{r example 3} +distBCR <- seqDist( BCRdata, .col = 'CDR3.aa', .methods = "hamming", .group_by_seqLength = T) +``` + +When you want to take into consideration not only single mutations but also indels, Levenshtein distance is a more appropriate method to use. + +```{r example 4} +distBCR <- seqDist( BCRdata, .col = 'CDR3.aa', .methods = "lv", .group_by_seqLength = F) +``` + +Or you could also use the longest common substring: + +```{r example 5} +distBCR <- seqDist( BCRdata, .col = 'CDR3.aa', .methods = "cls", .group_by_seqLength = F) +``` + +Another way is to write your own function for calculating distance. For example: + +```{r example 6, eval = FALSE} +distBCR <- seqDist(BCRdata, .col = 'CDR3.aa', .methods = {user-function}, .group_by_seqLength = F) +``` + +# Clustering + +After calculating the distance, it’s time to cluster your sequences and find TCRs and BCRs that have similar CDR3 sequences or other similar regions. It is possible to run the analysis based only on the similarity of CDR3 sequence, but we recommend not limiting the analysis to this sequence similarity. + +In mathematical terms, a clonotype is a vertex in graph. If two clonotypes have the same V and J genes and ‘CDR3.aa’. differs by less than a predefined threshold value, we draw an edge between the vertices. A cluster is a connected component — a set of clonotypes connected by edges. + +Clustering is commonly used to expand the concept of a clonotype. For example, you can consider clonotype as a group of sequences with the same amino acid CDR3 sequences, as these sequences are likely to have the same properties. + +Example: + +```{r example 7} + +#calculate distance +distTCR <- seqDist( TCRdata, .col = 'CDR3.aa') + +#clustering TCR by CDR3 regions +clustTCR <- seqCluster(TCRdata, distTCR, .perc_similarity = 0.9) + +#calculate number of unique clusters in column 'Cluster' +clustTCR$"A2-i129" %>% .$Cluster %>% unique() %>% length() +``` + +In general, the number of clusters you get depends on threshold you defined previously. There are 3 different ways to assign a threshold for clustering: + +1) *Percentage similarity* + +Requires defining the minimum percentage of similarity for the sequences in your cluster in advance. For example, if you want to find cluster of BCRs that not only have the same [V.name](http://V.name), [J.name](http://J.name), and CDR3 length — but also have more than 90 percent match in their CDR3. + +```{r example 8, warning = FALSE} +#clustering TCR +clustTCR <- seqCluster(TCRdata, distTCR, .perc_similarity = 0.75) + +#count cluster size +dt_perc_similarity <- clustTCR$"A2-i129" %>% group_by(Cluster) %>% summarise(cluster_size = n()) + +#calculate number of unique clusters in column 'Cluster' +clustTCR$"A2-i129" %>% .$Cluster %>% unique() %>% length() +``` + +2) *Fixed similarity* + +Requires directly determining the number of mismatched nucleotides. In case of TCR-recognising the same epitope, use 1 amino acid mismatch in CDR3 sequences. + +```{r example 9, warning = FALSE} +#clustering TCR +clustTCR <- seqCluster(TCRdata, distTCR, .fixed_threshold = 2) + +#count cluster size +dt_fixed_threshold <- clustTCR$"A2-i129" %>% group_by(Cluster) %>% summarise(cluster_size = n()) + +#calculate number of unique clusters in column 'Cluster' +clustTCR$"A2-i129" %>% .$Cluster %>% unique() %>% length() +``` + +3) *Similarity for every n letters in the sequence* + +If you want to guarantee that the matches in the sequences are at a set proximity, specify a distance threshold + +```{r example 10, warning = FALSE} +#clustering TCR +clustTCR <- seqCluster(TCRdata, distTCR, .nt_similarity = 10) + +#count cluster size +dt_nt_similarity <- clustTCR$"A2-i129" %>% group_by(Cluster) %>% summarise(cluster_size = n()) + +#calculate number of unique clusters in column 'Cluster' +clustTCR$"A2-i129" %>% .$Cluster %>% unique() %>% length() +``` + +# Clustering result + +If you made it to this part, then you have successfully clustered your sequences. Congratulations! Now you have a new column `“cluster”` in your data sheet. You can now generalize the concept of clonotype based on sequence similarity and consider clusters as clonotypes. + +Comparing cluster size distributions is a good way to analyse quality of clustering process and to choose the best parameters for functions: + +```{r } +#write small function for the visualization of cluster size destribution +mk_hist <- function(clust_dt, graph_name) { + return(ggplot(clust_dt, aes(x=cluster_size)) + geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) + + ggtitle(graph_name) + + theme_bw() + + theme( + plot.title = element_text(size=15) + )) +} + +# compare plots +mk_hist(dt_perc_similarity, '.perc_similarity=0.75') + mk_hist(dt_fixed_threshold, '.fixed_threshold=2') + + mk_hist(dt_nt_similarity, '.nt_similarity=10') +``` + +# Other ways to identify B cell clones + +There are other packages that provide computational framework for identification of B cell clones. One of the most popular is a `scoper` package ([https://scoper.readthedocs.io/en/stable/vignettes/Scoper-Vignette/#introduction](https://scoper.readthedocs.io/en/stable/vignettes/Scoper-Vignette/#introduction)) + +`Immunarch` enables users to integrate results from `scoper` package for further analysis: + +```{r example 11, warning = FALSE} +#load the package into the R environment +library(scoper) + +#generate germline sequence +bcr <- bcrdata$data$full_clones %>% + top(2000) %>% # reduce the dataset to save time on examples + repGermline() +``` + +Add columns ‘sequence_alignment’ required by `scoper` (https://scoper.readthedocs.io/en/stable/vignettes/Scoper-Vignette/): + +```{r example 12} +#generate `sequence_alignment` column +cols <- c('FR1.nt', 'CDR1.nt', 'FR2.nt', 'CDR2.nt', 'FR3.nt', 'CDR3.nt', 'FR4.nt') +bcr$sequence_alignment <- apply( bcr[ , cols ] , 1 , paste , collapse = "" ) +``` + +Rename columns in `scoper` format: + +```{r example 13} +ExampleDb <- bcr %>% select(J.first.allele, V.first.allele, CDR3.nt, Germline.sequence, sequence_alignment) %>% + rename(junction = CDR3.nt, v_call = V.first.allele, j_call = J.first.allele, germline_alignment_d_mask = Germline.sequence) +``` + +Cluster using `scoper` method: + +```{r example 14} +results <- hierarchicalClones(ExampleDb, threshold=0.15) +plot(results, binwidth=0.02) +``` + +Note that the column ‘clone_id' in `scoper` format has the same meaning as the ‘Cluster’ column in `immunarch`: + +```{r example 15} +glimpse(summary(results)) +``` diff --git a/vignettes/web_only/load_10x.Rmd b/vignettes/web_only/load_10x.Rmd index 6761833b..14451681 100644 --- a/vignettes/web_only/load_10x.Rmd +++ b/vignettes/web_only/load_10x.Rmd @@ -38,9 +38,9 @@ Their end-to-end pipeline also includes the here to use Cellranger on your data. Note that Cellranger is only supported by Linux operating systems currently. The `cellranger count` and `cellranger vdj` methods will be particularly useful. +Follow the instructions here to use Cellranger on your data. Note that Cellranger is currently only supported by Linux operating systems. The `cellranger count` and `cellranger vdj` methods will be particularly useful. -You can find sample data of full runs to download here. In this tutorial, I used this single cell mouse data. If you are only using `immunarch` you can download only the .csv files. +You can find sample data of full runs to download here. In this tutorial, we use this single cell mouse data. If you are using `immunarch` you can download only the .csv files. # Prepare 10x Data Upon processing the data, you will have a lot of output files. You should use the `filtered contigs` csv files because they contain barcode information. @@ -126,7 +126,7 @@ Congrats! Now your data is ready for exploration. Follow the steps [here](https: ## Note on barcodes -Another important note is that some of the contigs files lack a column for barcodes - a unique identified of a cell. +Another important note is that some of the contigs files lack a column for barcodes – a unique identifier of any cell. These files can be useful for analysis of single chain data (only alpha, or beta TCRs), but in order to analyze paired-chain data and fully utilize the full power of single-cell technologies, you should upload the file with barcodes to the Immunarch. diff --git a/vignettes/web_only/load_mixcr.Rmd b/vignettes/web_only/load_mixcr.Rmd index 44ae28f1..41fa4549 100644 --- a/vignettes/web_only/load_mixcr.Rmd +++ b/vignettes/web_only/load_mixcr.Rmd @@ -24,18 +24,18 @@ output: # Intro to MiXCR MiXCR is a universal software for fast and accurate extraction of T- and B- cell receptor repertoires from any type of sequencing data. It handles paired- and single-end reads, considers sequence quality, corrects PCR errors and identifies germline hypermutations. The software supports both partial- and full-length profiling and employs all available RNA or DNA information, including sequences upstream of V and downstream of J gene segments. -Some features include: +Some of its features include: -- Extracts both T- and B-cell receptor repertoires -- Extracts repertoire data even from regular RNA-Seq -- Successfully analyses full-length antibody data +- Extracting both T- and B-cell receptor repertoires +- Extracting repertoire data even from regular RNA-Seq +- Successfully analysing full-length antibody data -Find more info about MiXCR here +Find more info about MiXCR at here # Prepare MiXCR Data Follow these instructions here to install MiXCR and get started with processing data. **Important Note:** Currently supports MiXCR version 3 and above. -MiXCR supports the following formats of sequencing data: fasta, fastq, fastq.gz, paired-end fastq and fastq.gz. In this tutorial I used the real IGH data from here. +MiXCR supports the following formats of sequencing data: fasta, fastq, fastq.gz, paired-end fastq and fastq.gz. In this tutorial we use the real IGH data from here. You can choose to use the `analyze amplicon` method to process in one go: @@ -158,7 +158,7 @@ $meta ## Loading a folder -In this tutorial I used three identical samples just to show the output, but you should put all your output `.txt` clonotype files in this folder, along with your `metadata.txt` file. +In this tutorial we use three identical samples just to demonstrate the output, but you should put all your output `.txt` clonotype files in this folder, along with your `metadata.txt` file. ```{r, eval=F} # 1.1) Load the package into R: @@ -188,7 +188,7 @@ Processing "/path/to/your/mixcr/data/" ... Done! ``` -Now let's take a look at the data! Your output should look something like below. +Now let's take a look at the data! Your output should look something like that. ```{r, eval=F} r$> immdata_mixcr diff --git a/vignettes/web_only/repFilter_v3.Rmd b/vignettes/web_only/repFilter_v3.Rmd index 0d393f0d..404b6fbf 100644 --- a/vignettes/web_only/repFilter_v3.Rmd +++ b/vignettes/web_only/repFilter_v3.Rmd @@ -36,7 +36,7 @@ library(immunarch) ``` # Data filtering -In many research cases, you could want to filter your data by metadata, clonotypes parameters or genes, so for this purpose, you can use the `repFilter` function. +In many research cases, you would want to filter your data by metadata, clonotypes parameters or genes, so for this purpose, you can use the `repFilter` function. ## Methods for filtering data `repFilter` has 3 parameters: `.method`, `.query` and `.match`. @@ -165,7 +165,7 @@ names(repFilter(immdata, .method = "by.meta", .query = list(Age = interval(15, 2 #### Example 3 -Filter for repertoires containing more than 60000 clonotypes: +Filter for repertoires containing more than 6000 clonotypes: ```{r example-3.1} repFilter(immdata, .method = "by.repertoire", .query = list(n_clonotypes = morethan(6000)))$meta @@ -196,7 +196,7 @@ Filter out all noncoding clonotypes from `immdata`. As you see, `immdata` datase repFilter(immdata, .method = "by.clonotype", .query = list(CDR3.aa = exclude("partial", "out_of_frame")))$meta ``` -Note that filtering `by.clonotype` or `by.cl` works within repertoire(sample). Repertoire(sample) could be removed from your data only in case if all clonotypes in sample do not satisfy the condition: +Note that filtering `by.clonotype` or `by.cl` works within repertoire(sample). Repertoire(sample) could be removed from your data only in case if all clonotypes in sample do not meet the condition: ```{r example-4.1.1} names(repFilter(immdata, .method = "by.clonotype", .query = list(CDR3.aa = exclude("partial", "out_of_frame")))$data) @@ -214,7 +214,7 @@ names(repFilter(immdata, .method = "by.clonotype", .query = list(Clones = moreth In method `by.clonotype` or `by.cl`, there is an extra argument `.match`. The `.match` argument can has the following values: - `exact` - looks for exact match in gene names - `substring`- looks for substring in gene names - - `startswith` - looks for gene names starting with the same pattern + - `startswith` - looks for gene names starting with the chosen pattern Filter out all clonotypes within samples with V gene 'TRBV1' or 'TRGV11' @@ -250,7 +250,7 @@ scdata$meta names(scdata$data) ``` -`repFiter` can also work with single-cell data containing not only `meta` and 'data`, but also extra information, for example, about clusters: +`repFiter` can also work with single-cell data containing not only `meta` and 'data`, but also extra information, e.g. about clusters: ```{r, results = 'hide'} repFilter(scdata, .method = "by.clonotype", .query = list(CDR3.aa = exclude("partial", "out_of_frame"))) @@ -334,4 +334,4 @@ p1 <- vis(exp_vol) exp_vol <- repExplore(repFilter(scdata_cl, .method = "by.meta", .query = list(Cluster = include('Naive')))$data, .method = "volume") p2 <- vis(exp_vol) p1+p2 -``` \ No newline at end of file +``` diff --git a/vignettes/web_only/v10_prop.Rmd b/vignettes/web_only/v10_prop.Rmd index 0391ed5e..71d165d8 100644 --- a/vignettes/web_only/v10_prop.Rmd +++ b/vignettes/web_only/v10_prop.Rmd @@ -23,7 +23,7 @@ output: - ```{r setup, include=FALSE, echo=FALSE} +```{r setup, include=FALSE, echo=FALSE} # knitr::knit_hooks$set(optipng = knitr::hook_optipng) # knitr::opts_chunk$set(optipng = '-o7') @@ -40,4 +40,4 @@ data(immdata) # Get in contact with us -Can not find an important feature? Have a question or found a bug? Contact us at support@immunomind.io +Cannot find an important feature? Have a question or found a bug? Contact us at support@immunomind.io diff --git a/vignettes/web_only/v11_db.Rmd b/vignettes/web_only/v11_db.Rmd index e937296d..7d60d468 100644 --- a/vignettes/web_only/v11_db.Rmd +++ b/vignettes/web_only/v11_db.Rmd @@ -23,7 +23,7 @@ output: - ```{r setup, include=FALSE, echo=FALSE} +```{r setup, include=FALSE, echo=FALSE} # knitr::knit_hooks$set(optipng = knitr::hook_optipng) # knitr::opts_chunk$set(optipng = '-o7') @@ -38,9 +38,9 @@ data(immdata) # Introduction to immune receptor databases -Databases with aggregated information about immune receptor specificity provide a straightforward way to annotate your data and find condition-associated receptors. `immunarch` supports tools to annotate your data using the most popular AIRR databases - VDJDB, McPAS-TCR and PIRD TBAdb. +Databases with aggregated information about immune receptor specificity provide a straightforward way to annotate your data and find condition-associated receptors. `immunarch` supports the tools to annotate your data using the most popular AIRR databases - VDJDB, McPAS-TCR and PIRD TBAdb. -Database annotation is a two-step process. First, you need to download database files - either full database files or filtered data obtained from the web interface of databases. After that, you can use `immunarch` functions to annotate your data and visualise the results. Below you can find a guide to annotation covering both steps. +Database annotation is a two-step process. First, you need to download database files - either the full database or filtered data obtained from the web interface of the database. After that, you can use `immunarch` functions to annotate your data and visualise the results. Below you can find a guide for annotation covering both steps. # Downloading databases @@ -53,15 +53,15 @@ Citation: *Shugay M et al. VDJdb: a curated database of T-cell receptor sequence ### How to filter and download data -It can be useful to filter out immune receptors that are not relevant from the database before working with it. For instance, if you analyse human T-cell beta repertoires, it is not necessary to keep immune receptors from other species, as well as non-TRB data. Use the web interface to VDJDB located at https://vdjdb.cdr3.net/search to filter out data. Having filtered the data and pressed the "Refresh table" button, locate the "Export" button and select the "TSV" label inside. You will download the filtered database file with a name like "SearchTable-2019-10-17 12_36_11.989.tsv", which can be used for annotation with `immunarch`. +It can be useful to filter out immune receptors that are not relevant from the database before working with it. For instance, if you analyse human T-cell beta repertoires, it is not necessary to keep immune receptors from other species, as well as non-TRB data. Use the web interface to VDJDB located at https://vdjdb.cdr3.net/search to filter out data. Having filtered the data and pressed the "Refresh table" button, locate the "Export" button and select the "TSV" label inside. It will start the downloading of the filtered database file with a name like "SearchTable-2019-10-17 12_36_11.989.tsv", which can be used for annotation with `immunarch`. ### How to download full VDJDB -You can use the previous method to download the full database if you set all checkmarks in the "General" section of the "CDR3" tab. However, if you want to download the raw database files, here is the step-by-step guide to the sophisticated process of VDJDB downloading and unpacking. +You can use the previous method to download the full database if you set all check marks in the "General" section of the "CDR3" tab. However, if you want to download the raw database files, here is the step by step guide to the rather complicated process of VDJDB downloading and unpacking. 1. First, you need to install JDK 8 - Java Development Kit. If you already have it, skip this step. If you don't, just search for the proper installation instructions for your system. -2. Second, you need to install Groovy - a language that is used for processing VDJDB. Go to [this link](https://groovy.apache.org/download.html#distro) and download the distribution or windows installer depending on your system. For Windows users the best way is to download Windows installer. For Linux users the easiest way is to use OS package manager such as apt, dpkg, pacman, etc. For Mac users the most seamless way is to use [Homebrew](https://brew.sh). +2. Second, you need to install Groovy - a language that is used for processing VDJDB. Go to [this link](https://groovy.apache.org/download.html#distro) and download the distribution or windows installer depending on your system. For Windows users the best way is to download the Windows installer. For Linux users the easiest way is to use OS package manager such as apt, dpkg, pacman, etc. For Mac users the most seamless way is to use [Homebrew](https://brew.sh). 3. Download the VDJDB repository from GitHub via this link: https://github.com/antigenomics/vdjdb-db/archive/master.zip @@ -82,7 +82,7 @@ Citation: *Tickotsky N, Sagiv T, Prilusky J, Shifrut E, Friedman N (2017). McPAS ### How to filter and download data -The filtering feature of the database's web interface is located in the "Search Database" tab. After processing the data, press the "Download csv" button. The downloaded file named "McPAS-TCR_search.csv" can be used for annotation with `immunarch`. +The filtering feature of the database's web interface is located in the "Search Database" tab. After processing the data, press the "Download .csv" button. The downloaded file named "McPAS-TCR_search.csv" can be used for annotation with `immunarch`. ### How to download full McPAS-TCR @@ -98,19 +98,19 @@ Currently there is no direct way to download TBAdb. Citation: *ZHANG W, Wang L, Liu K, Wei X, Yang K, Du W, Wang S, Guo N, Ma C, Luo L, et al. PIRD: Pan immune repertoire database. Bioinformatics(2019)* -# Annotation of clonotypes +# Annotation of the clonotypes After downloading the database, we can proceed to the annotation part with R. To demonstrate the applicability of R and `immunarch`, we will use a common task of annotation of repertoires with Cytomegalovirus (CMV) infection. ## Preprocessing databases with R -For the start, we need to load databases into R and filter out non-human, non-TRB and non-CMV data from the input database. With databases, we follow the same philosophy as with `repLoad` and `vis` functions: the function `dbLoad` provides a singular interface to loading and basic quering for all supported databases. +As a start, we need to load databases into R and filter out non-human, non-TRB and non-CMV data from the input database. With databases, we follow the same philosophy as with `repLoad` and `vis` functions: the function `dbLoad` provides a single interface to loading and basic quering of all supported databases. For demonstration purposes, we will process each of the supported databases below. ### VDJDB -Download the VDJDB database following the instructions above. In the examples below, we use URLs to snippets of databases as file paths. In your own code you need to provide paths to your local database files, e.g., "/Users/yourname/Downloads/vdjdb-db-master/vdjdb.slim.txt". Do not use the links below since they are only for testing purposes and not the actual databases! +Download the VDJDB database following the instructions above. In the examples, we use URLs to snippets of databases as file paths. In your own code you need to provide paths to your local database files, e.g., "/Users/yourname/Downloads/vdjdb-db-master/vdjdb.slim.txt". Do not use the links below since they are only for testing purposes and do not contain the actual databases! Note that VDJDB data obtained from the web interface differs from VDJDB obtained from raw files. Check the next section for working with VDJDB search tables. @@ -151,7 +151,7 @@ tbadb ## Repertoire annotation -The key `immunarch` function for annotation is `dbAnnotate`. As an input it requires repertoires to search in, a database to lookup from, and columns of interest such as CDR3 amino acid sequence or V gene segment names columns. If you want to try it on the test data packaged with `immunarch`, execute the following line of code before proceeding further: +The key `immunarch` function for annotation is `dbAnnotate`. As an input it requires repertoires to search in, a database to look up, and columns of interest such as CDR3 amino acid sequence or V gene segment names columns. If you want to try it on the test data packaged with `immunarch`, execute the following line of code before proceeding further: ```{r eval=F} data(immdata) @@ -171,7 +171,7 @@ In the next example we will search the McPAS-TCR database for condition-associat dbAnnotate(immdata$data, mcpas, c("CDR3.aa", "V.name"), c("CDR3.beta.aa", "TRBV")) ``` -If you seek to search a database for a specific set of sequences, create a data frame with them and use it as a database file: +If you seek to search a database for a specific set of sequences, create a data frame containing them and use it as a database file: ```{r eval=T} local_db = data.frame(Seq = c("CASSDSSGGANEQFF", "CSARLAGGQETQYF"), V = c("TRBV6-4", "TRBV20-1"), stringsAsFactors = F) @@ -213,4 +213,4 @@ table(vdjdb$antigen.epitope) # Get in contact with us -Can not find an important feature? Have a question or found a bug? Contact us at support@immunomind.io +Cannot find an important feature? Have a question or found a bug? Contact us at support@immunomind.io diff --git a/vignettes/web_only/v21_singlecell.Rmd b/vignettes/web_only/v21_singlecell.Rmd index fa27cdaf..f73e8b6c 100644 --- a/vignettes/web_only/v21_singlecell.Rmd +++ b/vignettes/web_only/v21_singlecell.Rmd @@ -41,15 +41,15 @@ data(scdata) > This is a vignette dedicated to provide an overview on how to work with single-cell paired chain data in `immunarch` -> Single-cell support is currently in the development version. In order to access it, you need to install the latest development version of the package by executing the following command: +> Single-cell support is currently in the development version. In order to access it, you need to install the latest development version of the package by executing the following command: ```r install.packages("devtools"); devtools::install_github("immunomind/immunarch", ref="dev") ``` -> To read paired chain data into `immunarch` use the `repLoad` function with `.mode = "paired"`. Currently we support 10X Genomics only. +> To read paired chain data with `immunarch` use the `repLoad` function with `.mode = "paired"`. Currently we support 10X Genomics only. -> To subset immune repertoires by specific barcodes use the `select_barcodes` function. Output of `Seurat::Idents()` as a barcode vector works. +> To create subset immune repertoires with specific barcodes use the `select_barcodes` function. Output of `Seurat::Idents()` as a barcode vector works. > To create cluster-specific and patient-specific datasets using barcodes from the output of `Seurat::Idents()` use the `select_clusters` function. diff --git a/vignettes/web_only/v3_basic_analysis.Rmd b/vignettes/web_only/v3_basic_analysis.Rmd index d28c1853..c1ab201d 100644 --- a/vignettes/web_only/v3_basic_analysis.Rmd +++ b/vignettes/web_only/v3_basic_analysis.Rmd @@ -39,9 +39,9 @@ data(immdata) ``` # Basic analysis -For each task in this section `immunarch` includes separate functions that are generally self-explanatory and are written in camel-case. +For each task in this section `immunarch` includes separate functions that are generally self-explanatory and are written in CamelCase. -Note: all functions in immunarch require that the input immune repertoire data list have names. If you use the `repLoad` function, you will have no issues. If you compose your list by-hand, you must name elements in the list, e.g.: +Note: all functions in immunarch require that the input immune repertoire data list have names. If you use the `repLoad` function, you will have no issues. If you compose your list by hand, you must name elements in the list, e.g.: ```{r rename-list, eval=F} your_data # Your list with repertoires without names @@ -56,31 +56,31 @@ names(your_data) Basic analysis functions are: -- `repExplore` - to compute basic statistics, such as number of clones or distributions of lengths and counts. To explore them you need to pass the statistics, e.g. `count`, to the `.method`. +- `repExplore` - computes basic statistics, such as number of clones or distributions of lengths and counts. To explore them you need to pass the statistics, e.g. `count`, to the `.method`. -- `repClonality` - to compute the clonality of repertoires. +- `repClonality` - computes the clonality of repertoires. -- `repOverlap` - to compute the repertoire overlap. +- `repOverlap` - computes the repertoire overlap. -- `repOverlapAnalysis` - to analyse the repertoire overlap, including different clustering procedures and PCA. +- `repOverlapAnalysis` - analyses the repertoire overlap, including different clustering procedures and PCA. -- `geneUsage` - to compute the distributions of V or J genes. +- `geneUsage` - computes the distributions of V or J genes. -- `geneUsageAnalysis` - to analyse the distributions of V or J genes, including clustering and PCA. +- `geneUsageAnalysis` - analyses the distributions of V or J genes, including clustering and PCA. -- `repDiversity` - to estimate the diversity of repertoires. +- `repDiversity` - estimates the diversity of repertoires. -- `trackClonotypes` - to analyse the dynamics of repertoires across time points. +- `trackClonotypes` - analyses the dynamics of repertoires across time points. -- `spectratype` - to compute spectratype of clonotypes. +- `spectratype` - computes spectratype of clonotypes. -- `getKmers` and `kmer_profile` - to compute distributions of kmers and sequence profiles. +- `getKmers` and `kmer_profile` - computes distributions of kmers and sequence profiles. ## How to visualise analysis results -Output of each analysis function could be passed directly to the `vis` function - the general function for visualisation. Examples of usage are written below. -Almost all visualisations of analysis results are support grouping data by their respective properties from the metadata table or using user-supplied properties. Grouping is possible by passing either `.by` argument or by passing both `.by` and `.meta` arguments to the `vis` function. +Output of each analysis function could be passed directly to the `vis` function - the general function for visualisation. Examples of usage are given below. +Almost all visualisations of analysis involves grouping data by their respective properties from the metadata table or using user-supplied properties. Grouping is possible by passing either `.by` argument or by passing both `.by` and `.meta` arguments to the `vis` function. -1) You can pass `.by` as a character vector with one or several column names from the metadata table to group your data before plotting. In this case you should provide also the `.meta` argument with the metadata table. +1) You can pass `.by` as a character vector with one or several column names from the metadata table to group your data before plotting. In this case you should also provide the `.meta` argument with the metadata table. ```{r eda-by-meta, warning=F, fig.width=10, fig.height=4.5} exp_vol <- repExplore(immdata$data, .method = "volume") @@ -89,7 +89,7 @@ p2 <- vis(exp_vol, .by = c("Status", "Sex"), .meta = immdata$meta) p1 + p2 ``` -2) You can pass `.by` as a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to `.meta`. +2) You can pass `.by` as a character vector that matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to `.meta`. ```{r eda-by-only, warning=F, fig.width=6, fig.height=4.5} exp_vol <- repExplore(immdata$data, .method = "volume") @@ -98,7 +98,7 @@ p <- vis(exp_vol, .by = by_vec) p ``` -If data is grouped, than statistical tests for comparing means of groups will be performed, unless `.test = F` is supplied. +Once data is grouped, the statistical tests for comparing means of groups will be performed, unless `.test = F` is supplied. In case there are only two groups, the [Wilcoxon rank sum test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) is performed (R function `wilcox.test` with an argument `exact = F`) for testing if there is a difference in mean rank values between two groups. In case there more than two groups, the [Kruskal-Wallis test](https://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance) is performed (R function `kruskal.test`), that is equivalent to ANOVA for ranks and it tests whether samples from different groups originated from the same distribution. A significant Kruskal-Wallis test indicates that at least one sample stochastically dominates one other sample. Adjusted for multiple comparisons P-values are plotted on the top of groups. P-value adjusting is done using the [Holm method](https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method) (also known as Holm-Bonferroni correction). You can execute the command `?p.adjust` in the R console to see more. diff --git a/vignettes/web_only/v4_overlap.Rmd b/vignettes/web_only/v4_overlap.Rmd index d7d2409c..a3d350e1 100644 --- a/vignettes/web_only/v4_overlap.Rmd +++ b/vignettes/web_only/v4_overlap.Rmd @@ -44,7 +44,7 @@ Repertoire overlap is the most common approach to measure repertoire similarity. - overlap coefficient (`.method = "overlap"`) - a normalised measure of overlap similarity. It is defined as the size of the intersection divided by the smaller of the size of the two sets. -- Jaccard index (`.method = "jaccard"`) - it measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. +- Jaccard index (`.method = "jaccard"`) - measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. - Tversky index (`.method = "tversky"`) - an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it's similar to Dice's coefficient. @@ -54,7 +54,7 @@ Repertoire overlap is the most common approach to measure repertoire similarity. - incremental overlap - overlaps of the N most abundant clonotypes with incrementally growing N (`.method = "inc+METHOD"`, e.g., `"inc+public"` or `"inc+morisita"`). -The function that includes described methods is `repOverlap`. Again the output is easily visualised when passed to `vis()` function that does all the work: +The function that includes described methods is `repOverlap`. Again, the output is easily visualised when passed to `vis()` function that does all the work: ```{r overlap, message=F, warning=FALSE, fig.width=12, fig.height=7} imm_ov1 <- repOverlap(immdata$data, .method = "public", .verbose = F) diff --git a/vignettes/web_only/v5_gene_usage.Rmd b/vignettes/web_only/v5_gene_usage.Rmd index a9afa6c2..85def99f 100644 --- a/vignettes/web_only/v5_gene_usage.Rmd +++ b/vignettes/web_only/v5_gene_usage.Rmd @@ -46,7 +46,7 @@ data(immdata) gene_stats() ``` -To compute the distributions of genes, `immunarch` includes the `geneUsage` function. It receives a repertoire or a list of repertoires as input and genes and species for which you want to get the statistics. E.g., if you plan to use `TRBV` genes of `Homo Sapiens`, you need to use the `hs.trbv` string in the function, where `hs` comes from the `alias` column and `trbv` is the gene name. Of if you plan to use `IGHJ` genes of `Mus Musculus`, you need to use `musmus.ighj`: +To compute the distribution of genes, `immunarch` includes the `geneUsage` function. It receives a repertoire or a list of repertoires as input and genes and species for which you want to get the statistics. E.g., if you plan to use `TRBV` genes of `Homo Sapiens`, you need to use the `hs.trbv` string in the function, where `hs` comes from the `alias` column and `trbv` is the gene name. In case you plan to use `IGHJ` genes of `Mus Musculus`, you need to use `musmus.ighj`: ```{r} # Next four function calls are equal. "hs" is from the "alias" column. @@ -93,7 +93,7 @@ vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box") ## Ambiguity of gene segment names Due to the ambiguity of gene alignments for some clonotypes, `geneUsage` has the following options to deal with ambiguous data: -- `.ambig = "inc"` - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to `"exc"` in case of other data formats. It is ON by default, we recommend it to leave it that way. +- `.ambig = "inc"` - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to `"exc"` in case of other data formats. It is ON by default, and we recommend to leave it that way. - `.ambig = "exc"` - filters out all clonotypes with ambiguous gene alignments. diff --git a/vignettes/web_only/v6_diversity.Rmd b/vignettes/web_only/v6_diversity.Rmd index 4731bb56..42b0228e 100644 --- a/vignettes/web_only/v6_diversity.Rmd +++ b/vignettes/web_only/v6_diversity.Rmd @@ -38,13 +38,13 @@ library(immunarch) data(immdata) ``` -Sevral approaches to the estimation of repertoire diversity are implemented in the `repDiversity` function. The `.method` parameter similarly to abovementionned functions sets the means for diversity estimation. You can choose one of the following methods: +There are several approaches to the estimation of repertoire diversity implemented in the `repDiversity` function. The `.method` parameter similarly to above mentioned functions sets the means for diversity estimation. You can choose one of the following methods: -- `chao1` - Chao1 estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population). +- `chao1` - is a nonparameteric asymptotic estimator of species richness (number of species in a population). - `hill` - Hill numbers are a mathematically unified family of diversity indices (differing only by an exponent q). -- `div` - True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant. +- `div` - true diversity, or the effective number of types, refers to the number of equally abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant. - `gini.simp` - The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types. @@ -54,7 +54,7 @@ Sevral approaches to the estimation of repertoire diversity are implemented in t - `raref` - Rarefaction is a technique to assess species richness from the results of sampling through extrapolation. -The `.col` parameter regulates what sequences and gene segments to choose. For example, if you want to estimate diversity on the nucleotide level, you need to supply `.col = "nt"`, on the amino acid level - `.col = "aa"`. If you want to estimate diversity of amino acid CDR3 sequences coupled with V gene segments, you need to provide `.col = "aa+v"`. By default `.col = "aa"`. +The `.col` parameter regulates what sequences and gene segments to choose. For example, if you want to estimate diversity on the nucleotide level, you need to supply `.col = "nt"`, in case you want to estimate the diversity on the amino acid level - `.col = "aa"`. If you want to estimate diversity of the amino acid CDR3 sequences coupled with V gene segments, you need to provide `.col = "aa+v"`. By default `.col = "aa"`. ```{r diversity, fig.width=10, fig.height=4, warning=FALSE, message=FALSE} # Compute statistics and visualise them diff --git a/vignettes/web_only/v7_fixvis.Rmd b/vignettes/web_only/v7_fixvis.Rmd index b6524d43..a3d5f4aa 100644 --- a/vignettes/web_only/v7_fixvis.Rmd +++ b/vignettes/web_only/v7_fixvis.Rmd @@ -32,7 +32,7 @@ knitr::opts_chunk$set(fig.height = 6) library(immunarch) ``` -In order to make the visualization process easier, we developed a Shiny-based tool called FixVis. The FixVis is distributed with `immunarch` package. To run FixVis just apply it to any ggplot2-based plot, including those from the `immunarch`'s `vis` function. It works perfectly for every plot. FixVis could work only with one plot at a time. +In order to make the visualization process easier, we developed a Shiny-based tool called FixVis. It is distributed with the `immunarch` package. To run FixVis just apply it to any ggplot2-based plot, including those from the `immunarch`'s `vis` function. It works perfectly for every plot. FixVis can work only with one plot at a time. Here is an example of how to run FixVis with `vis` function from `immunarch`. First, the plot is generated by `vis` function from `immunarch` and then FixVis is applied: @@ -50,7 +50,7 @@ If you want to test it, run `fixVis` with no arguments; it will use the diamonds fixVis() ``` -For convenience we published a FixVis instances that are available online [here](https://immunomind.shinyapps.io/fixvis/). Also a FixVis instance for quick overview is embedded in the cell below: +For your convenience we published a FixVis instances that are available online [here](https://immunomind.shinyapps.io/fixvis/). Also, a quick overview of FixVis is embedded below: ```{r fixvis3, eval=T} knitr::include_app("https://immunomind.shinyapps.io/fixvis/", height = "800px") diff --git a/vignettes/web_only/v8_tracking.Rmd b/vignettes/web_only/v8_tracking.Rmd index 250f4214..8918632e 100644 --- a/vignettes/web_only/v8_tracking.Rmd +++ b/vignettes/web_only/v8_tracking.Rmd @@ -23,7 +23,7 @@ output: - ```{r setup, include=FALSE, echo=FALSE} +```{r setup, include=FALSE, echo=FALSE} # knitr::knit_hooks$set(optipng = knitr::hook_optipng) # knitr::opts_chunk$set(optipng = '-o7') @@ -37,20 +37,20 @@ data(immdata) ``` # Tracking of clonotypes -Clonotype tracking is popular approach to monitor changes in frequency of clonotypes of interest in vaccination and cancer immunology. For example, a researcher can track a clonotype of across different time points in pre- and post-vaccination repertoires. Or analyse growth of malignant clonotypes in tumor sample. +Clonotype tracking is a popular approach to monitor changes in the frequency of clonotypes of interest in vaccination and cancer immunology. For example, a researcher can track a clonotype across different time points in pre- and post-vaccination repertoires, or analyse the growth of malignant clonotypes in a tumor sample. Various methods of clonotype tracking are integrated into the one `trackClonotypes` function. Currently, there are three methods to choose from. The output of `trackClonotypes` can be immediately visualised with the `vis` function. ## Tracking the most abundant clonotypes The simplest approach is to choose the most abundant clonotypes from one of the input immune repertoires and track across all immune repertoires in a batch. Arguments `.which` and `.col` are used to choose the immune repertoire, the number of clonotypes to take from it and which columns to use. -To choose the 10 most abundant clonotypes from the first repertoire and track them using their CDR3 nucleotide sequence: +To choose the top 10 most abundant clonotypes from the first repertoire and track them using their CDR3 nucleotide sequence use this code: ```{r warning=F} tc1 <- trackClonotypes(immdata$data, list(1, 5), .col = "nt") ``` -Value `list(1, 5)` of the `.which` argument (second argument) means to choose 10 clonotypes from the 1st repertoire in the input list of repertoires `immdata$data`. Value `"nt"` of the `.col` argument means that the function should take CDR3 nucleotide sequences only. +Value `list(1, 5)` of the `.which` argument (the second argument) means to choose 10 clonotypes from the 1st repertoire in the input list of repertoires `immdata$data`. Value `"nt"` of the `.col` argument means that the function should apply to CDR3 nucleotide sequences only. To choose the 10 most abundant amino acid clonotype sequences and their V genes from the "MS1" repertoire to track: @@ -69,7 +69,7 @@ p1 / p2 ``` ## Tracking clonotypes with specific nucleotide or amino acid sequences -In order to track specific clonotype sequences, you can provide nucleotide or amino acid sequences as the `.which` argument, along with the column `.col` specifying in which columns to search for sequences. For example, to track seven CDR3 amino acid sequences specifyed below you need to execute the following code: +In order to track specific clonotype sequences, you can provide nucleotide or amino acid sequences as the `.which` argument, along with the column `.col` specifying in which columns to search for sequences. For example, to track seven CDR3 amino acid sequences specified below you need to execute the following code: ```{r warning=F} target <- c("CASSLEETQYF", "CASSDSSGGANEQFF", "CASSDSSGSTDTQYF", "CASSLAGGYNEQFF", "CASSDSAGGTDTQYF", "CASSLDSYEQYF", "CASSSAGGYNEQFF") @@ -79,7 +79,7 @@ vis(tc) ## Tracking clonotypes with specific sequences and gene segments -An improvement over the previous approach, it is possible to track clonotypes using information about both sequences and gene segments. Support you have a data frame of sequences with specific CDR3 sequences and gene segments. We will simulate this by choosing the 10 most abundant clonotypes from the first repertoire in the batch: +An improvement upon the previous approach, it is possible to track clonotypes using information about both sequences and gene segments. For support you can use a data frame of sequences with specific CDR3 sequences and gene segments. We will simulate this below by choosing the 10 most abundant clonotypes from the first repertoire in the batch: ```{r} target <- immdata$data[[1]] %>% @@ -101,8 +101,8 @@ Note that you can use any columns in the `target` data frame, such as both CDR3 # Visualisation of tracking There are three ways to visualise clonotype tracking, depending on your research and aesthetic needs. To choose the type of plot, you need to provide the `".plot"` parameter to the `vis()` function, specifying one of three plot types: - `.plot = "smooth"` - used by default, a visualisation using smooth lines and stacked bar plots; - - `.plot = "area"` - visualise abundances using areas under the abundance lines; - - `.plot = "line"` - visualise only lines, connecting levels of abundances of a same clonotype between time points. + - `.plot = "area"` - visualises abundances using areas under the abundance lines; + - `.plot = "line"` - visualises only the lines, connecting levels of abundances of a same clonotype between time points. ```{r warning=F, fig.height=5} target <- c("CASSLEETQYF", "CASSDSSGGANEQFF", "CASSDSSGSTDTQYF", "CASSLAGGYNEQFF", "CASSDSAGGTDTQYF", "CASSLDSYEQYF", "CASSSAGGYNEQFF") @@ -129,7 +129,7 @@ vis(tc, .order = c("A2-i129", "A2-i133", "A4-i191")) If your metadata contains information about time such as timepoints for vaccination or tumor samples, you can use it to re-order samples accordingly. In our examples `immdata$meta` does not contain information about timepoints, so we will simulate this case. -First, we create an additional column in the metadata with randomly choosen timepoints: +First, we create an additional column in the metadata with randomly chosen time points: ```{r} immdata$meta$Timepoint <- sample(1:length(immdata$data)) immdata$meta @@ -172,4 +172,4 @@ Run `?scale_fill_brewer` in the R console to learn more about ColorBrewer and it # Get in contact with us -Can not find an important feature? Have a question or found a bug? Contact us at support@immunomind.io +Cannot find an important feature? Have a question or found a bug? Contact us at support@immunomind.io diff --git a/vignettes/web_only/v9_kmers.Rmd b/vignettes/web_only/v9_kmers.Rmd index d1861d6a..5fb7f494 100644 --- a/vignettes/web_only/v9_kmers.Rmd +++ b/vignettes/web_only/v9_kmers.Rmd @@ -17,13 +17,13 @@ output: - ```{r setup, include=FALSE, echo=FALSE} +```{r setup, include=FALSE, echo=FALSE} # knitr::knit_hooks$set(optipng = knitr::hook_optipng) # knitr::opts_chunk$set(optipng = '-o7') @@ -38,21 +38,21 @@ data(immdata) # Kmer statistics computation -Counting kmer occurrences in `immunarch` is very easy. All you need to do is to run the `getKmers()` function on your data: +Counting k-mer occurrences in `immunarch` is rather straightforward. All you need to do is to run the `getKmers()` function on your data: ```{r} kmers <- getKmers(immdata$data[[1]], 3) kmers ``` -If is possible to compute occurrences of kmers in a batch of immune repertoires. In order to do that, you just need to provide a list of immune repertoires to the function. NA means that there is no such kmer found in a sample, specifyed by the column name. +It is also possible to compute occurrences of k-mers in a batch of immune repertoires. In order to do that, you just need to provide a list of immune repertoires to the function. NA means that there is no such kmer found in a sample, specified by the column name. ```{r} kmers <- getKmers(immdata$data, 5) kmers ``` -Note that by default `getKmers()` filter out all non-coding sequences before counting the kmer statistics. You can use both coding and non-coding sequences by setting the `.coding` argument to FALSE: +Note that by default `getKmers()` filter out all non-coding sequences before counting the k-mer statistics. You can use both coding and non-coding sequences by setting the `.coding` argument to FALSE: ```{r} kmers <- getKmers(immdata$data[[1]], 3, .coding = F) @@ -61,14 +61,14 @@ kmers ## Kmer statistics visualisation -To visualise your kmer statistics, the `vis()` function comes to help: +To visualise your k-mer statistics, the `vis()` function comes to help: ```{r} kmers <- getKmers(immdata$data, 5) vis(kmers) ``` -The `vis()` function for kmers has a number of arguments to manipulate the plot. First, the `.head` argument specifies the number of the most abundant kmers to visualise. +The `vis()` function for k-mers has a number of arguments to manipulate the plot. First, the `.head` argument specifies the number of the most abundant k-mers to visualise. ```{r} p1 <- vis(kmers, .head = 5) p2 <- vis(kmers, .head = 10) @@ -87,9 +87,9 @@ p3 <- vis(kmers, .head = 10, .position = "dodge") (p1 + p2) / p3 ``` -Option "stack" stacks all bars on top of each other so you can see the full distribution of kmers. Option "fill" stack all bars on top of each other as well, but normalises it in a such way so you see distribution of counts per-kmer, i.e., you can clearly see which repertoire has more kmer counts than others for a specific kmer. Option "dodge" groups kmer bars of different samples so you can clearly see, which samples has more kmer occurrences overall. +Option "stack" stacks all the bars on top of each other so you can see the full distribution of k-mers. Option "fill" stack all bars on top of each other as well, but normalises it in such a way that you can see distribution of counts per k-mer, i.e., you can clearly see which repertoire has more k-mer counts than others for a specific k-mer. Option "dodge" groups k-mer bars of different samples so that you can clearly see which samples has more k-mer occurrences overall. -Additional argument is `.log` needed if your distribution of kmer counts is vastly imbalanced for some of repertoires. It permits to use the log-transformation of y-axis so you can see differences in orders of magnitude in kmer counts rather. +Additional argument is `.log` needed if your distribution of k-mer counts is vastly imbalanced for some of repertoires. It permits to use the log-transformation of y-axis so you can see differences in orders of magnitude in k-mer counts. ```{r} p1 <- vis(kmers, .head = 10, .position = "stack") @@ -107,14 +107,14 @@ p1 + p2 - position weight matrix (PWM) - a matrix with log likelihoods of PPM elements; - a matrix with self-information of elements in PWM. -To compute and visualise sequence motifs, first you need to compute kmer statistics for one of the input immune repertoires, and then apply the `kmer_profile()` function to compute sequence motif matrices: +To compute and visualise sequence motifs, first you need to compute k-mer statistics for one of the input immune repertoires, and then apply the `kmer_profile()` function to compute sequence motif matrices: ```{r} kmers <- getKmers(immdata$data[[1]], 5) kmer_profile(kmers) ``` -Currently we are not supporting sequence motifs analysis for more than one sample, but we are working on it. In order to compute and visualise sequence motif matrices for all of your samples you need to process them one-by-one, which can be easily done in for-loops or via the `lapply()` function. +Currently we do not support sequence motifs analysis for more than one sample, but we are working on including it into our following release. In order to compute and visualise sequence motif matrices for all of your samples you need to process them one by one, which can be easily done in for-loops or via the `lapply()` function. Argument `.method` specifies which matrix to compute: @@ -145,4 +145,4 @@ p1 + p2 # Get in contact with us -Can not find an important feature? Have a question or found a bug? Contact us at support@immunomind.io +Cannot find an important feature? Have a question or found a bug? Contact us at support@immunomind.io