| Title: | Geographically-Conscious Taxonomic Assignment for Metabarcoding |
|---|---|
| Description: | A bioinformatics pipeline for performing taxonomic assignment of DNA metabarcoding sequence data while considering geographic location. A detailed tutorial is available at <https://urodelan.github.io/LocaTT_Tutorial/>. A manuscript describing these methods is in preparation. |
| Authors: | Kenen B Goodwin [aut, cre] (ORCID: <https://orcid.org/0000-0002-9219-7693>), Christopher Cousins [ctb], Taal Levi [ctb], Tiffany S Garcia [ths] |
| Maintainer: | Kenen B Goodwin <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.2.7 |
| Built: | 2026-05-31 22:15:48 UTC |
| Source: | https://github.com/urodelan/locatt |
Performs adjustments to a taxonomy system according to a taxonomy edits file.
adjust_taxonomies( path_to_input_file, path_to_output_file, path_to_taxonomy_edits )adjust_taxonomies( path_to_input_file, path_to_output_file, path_to_taxonomy_edits )
path_to_input_file |
String specifying path to list of species (in CSV format) whose taxonomies are to be adjusted. The file should contain the following fields: 'Common_Name', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'. There should be no |
path_to_output_file |
String specifying path to output species list with adjusted taxonomies. The output file will be in CSV format. |
path_to_taxonomy_edits |
String specifying path to taxonomy edits file in CSV format. The file must contain the following fields: 'Old_Taxonomy', 'New_Taxonomy', 'Notes'. Old taxonomies are replaced with new taxonomies in the order the records appear in the file. The taxonomic levels in the 'Old_Taxonomy' and 'New_Taxonomy' fields should be delimited by a semi-colon. |
No return value. Writes an output CSV file with adjusted taxonomies.
get_taxonomies.species_binomials for remotely fetching NCBI taxonomies from species binomials. get_taxonomies.IUCN for formatting taxonomies from the IUCN Red List.
# Get path to input file. path_to_input_file<-system.file("extdata", "example_local_taxa_list.csv", package="LocaTT", mustWork=TRUE) # Get path to taxonomy edits. path_to_taxonomy_edits<-system.file("extdata", "example_taxonomy_edits.csv", package="LocaTT", mustWork=TRUE) # Create temporary output file path. path_to_output_file<-tempfile(fileext=".csv") # Adjust taxonomies. adjust_taxonomies(path_to_input_file=path_to_input_file, path_to_output_file=path_to_output_file, path_to_taxonomy_edits=path_to_taxonomy_edits)# Get path to input file. path_to_input_file<-system.file("extdata", "example_local_taxa_list.csv", package="LocaTT", mustWork=TRUE) # Get path to taxonomy edits. path_to_taxonomy_edits<-system.file("extdata", "example_taxonomy_edits.csv", package="LocaTT", mustWork=TRUE) # Create temporary output file path. path_to_output_file<-tempfile(fileext=".csv") # Adjust taxonomies. adjust_taxonomies(path_to_input_file=path_to_input_file, path_to_output_file=path_to_output_file, path_to_taxonomy_edits=path_to_taxonomy_edits)
Performs binomial tests.
binomial_test(k, n, p, alternative = "greater")binomial_test(k, n, p, alternative = "greater")
k |
A numeric vector of the number of successes. |
n |
A numeric vector of the number of trials. |
p |
A numeric vector of the hypothesized probabilities of success. |
alternative |
A string specifying the alternative hypothesis. Must be |
Calls on the pbinom function in the stats package to perform vectorized binomial tests. Arguments are recycled as in pbinom. Only one-sided tests are supported, and only p-values are returned.
A numeric vector of p-values from the binomial tests.
binomial_test(k=c(5,1,7,4), n=c(10,3,15,5), p=c(0.2,0.1,0.5,0.6), alternative="greater")binomial_test(k=c(5,1,7,4), n=c(10,3,15,5), p=c(0.2,0.1,0.5,0.6), alternative="greater")
Checks whether a BLAST program can be found.
blast_command_found(blast_command)blast_command_found(blast_command)
blast_command |
String specifying the path to a BLAST program. |
Logical. Returns TRUE if the BLAST program could be found.
blast_command_found(blast_command="blastn")blast_command_found(blast_command="blastn")
Gets the version of a BLAST program.
blast_version(blast_command = "blastn")blast_version(blast_command = "blastn")
blast_command |
String specifying the path to a BLAST program. The default ( |
Returns a string of the version of the BLAST program.
blast_version()blast_version()
Draws circle polygon.
circle(r, v = 1000, ...)circle(r, v = 1000, ...)
r |
Numeric scalar of circle radius. |
v |
Numeric scalar of vertex count (default = |
... |
Additional arguments passed to |
Draws a circle polygon of a given radius. The circle is drawn about the origin (i.e., x = 0, y = 0). Intended for use with template to generate detection and proportion plots.
No return value.
sector for plotting sector polygons.
template(l=1) circle(r=1)template(l=1) circle(r=1)
Checks whether DNA sequences contain wildcard characters.
contains_wildcards(sequences)contains_wildcards(sequences)
sequences |
A character vector of DNA sequences. |
A logical vector indicating whether each DNA sequence contains wildcard characters.
contains_wildcards(sequences=c("TKCTAGGTGW","CATAATTAGG","ATYGGCTATG"))contains_wildcards(sequences=c("TKCTAGGTGW","CATAATTAGG","ATYGGCTATG"))
Generates coordinates along a circular path.
coordinates(a, r)coordinates(a, r)
a |
Numeric vector of angles (degrees). |
r |
Numeric scalar of the circle radius. |
Calculates xy coordinates along a circular path given a vector of angles and a specified radius. The center of the circle is at the origin (i.e., x = 0, y = 0). This function is helpful for calculating the vertices of circle and sector polygons.
A numeric matrix of xy coordinates.
circle for plotting circle polygons. sector for plotting sector polygons.
coordinates(a=c(90,180,270,360),r=1)coordinates(a=c(90,180,270,360),r=1)
Derives covariance matrix from correlation matrix and standard deviation vector.
cor2cov(sd, R)cor2cov(sd, R)
sd |
Numeric vector of standard deviations. |
R |
Numeric correlation matrix. |
Given correlation matrix R and standard deviation vector sd, performs the operation diag(sd) %*% R %*% diag(sd) to derive the corresponding covariance matrix. This is a counterpart to stats::cov2cor, which scales a covariance matrix into the corresponding correlation matrix.
Returns a numeric covariance matrix.
stats::cov2cor for scaling a covariance matrix into the corresponding correlation matrix.
# Define standard deviation vector. sd<-c(9.655,1.157,1.128,2.925) # Define correlation matrix. R<-matrix(data=c(1.000,-0.80,0.64,-0.512, -0.800,1.00,-0.80,0.640, 0.640,-0.80,1.00,-0.800, -0.512,0.64,-0.80,1.000), ncol=4,byrow=TRUE) # Derive covariance matrix. cor2cov(sd=sd,R=R)# Define standard deviation vector. sd<-c(9.655,1.157,1.128,2.925) # Define correlation matrix. R<-matrix(data=c(1.000,-0.80,0.64,-0.512, -0.800,1.00,-0.80,0.640, 0.640,-0.80,1.00,-0.800, -0.512,0.64,-0.80,1.000), ncol=4,byrow=TRUE) # Derive covariance matrix. cor2cov(sd=sd,R=R)
Density function for the Gaussian copula.
dcopula(u, R, log = FALSE)dcopula(u, R, log = FALSE)
u |
Numeric vector or matrix of uniformly-distributed margins on the interval [ |
R |
Numeric correlation matrix. If |
log |
Logical scalar. If |
Computes the probability density of the Gaussian copula. Given uniformly-distributed margins u on the interval [0, 1], applies the inverse cumulative distribution function of the standard normal (i.e., stats::qnorm) to map uniform margins to normal scores. Then, uses equation 1 of Song (2000) with the normal scores to calculate the probability density of the Gaussian copula.
Numeric vector of probability densities.
Song P. 2000. Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2): 305-320. DOI: 10.1111/1467-9469.00191
dmvlogis for density of the multivariate logistic distribution.
# Define uniform margins. u<-c(0.324,0.383,0.917,0.015) # Define correlation matrix. R<-matrix(data=c(1.000,-0.80,0.64,-0.512, -0.800,1.00,-0.80,0.640, 0.640,-0.80,1.00,-0.800, -0.512,0.64,-0.80,1.000), ncol=4,byrow=TRUE) # Compute log probability density. dcopula(u=u,R=R,log=TRUE)# Define uniform margins. u<-c(0.324,0.383,0.917,0.015) # Define correlation matrix. R<-matrix(data=c(1.000,-0.80,0.64,-0.512, -0.800,1.00,-0.80,0.640, 0.640,-0.80,1.00,-0.800, -0.512,0.64,-0.80,1.000), ncol=4,byrow=TRUE) # Compute log probability density. dcopula(u=u,R=R,log=TRUE)
Density function for the Dirichlet-multinomial distribution.
ddirmult(x, p, theta, alpha, log = FALSE)ddirmult(x, p, theta, alpha, log = FALSE)
x |
Numeric vector or matrix of counts. If matrix, then a vector of probability densities is returned with an element for each record of the matrix. Matrix records represent observations, and matrix fields represent dimensions. |
p |
Numeric vector or matrix of proportions. Matrix records represent observations, and matrix fields represent dimensions. If vector, then |
theta |
Numeric vector. Precision parameter with domain ( |
alpha |
Numeric vector or matrix of conventional alpha values. Matrix records represent observations, and matrix fields represent dimensions. If vector, then |
log |
Logical scalar. If |
Computes the probability mass of the Dirichlet-multinomial distribution. Under the proportion parameterization, the alpha parameters of the conventional Dirichlet-multinomial distribution are derived as the product of a proportion vector (p) and an exponentiated precision parameter (exp(theta)). The precision parameter controls the degree of overdispersion relative to the multinomial distribution, where higher values of theta are associated with reduced overdispersion. When theta = 0, the alpha parameters of the conventional Dirichlet-multinomial distribution are equal to the proportion vector (p). To ensure a simplex, the values of p (vector or matrix records) are internally normalized to sum to one. If alpha is provided, then the conventional alpha parameterization of the Dirichlet-multinomial distribution is used.
Numeric vector of probability densities.
Minka TP. 2000. Estimating a Dirichlet distribution.
waic for generic function to compute widely applicable information criterion. dmWAIC for computing widely applicable information criteria for Dirichlet-multinomial regression models.
# Compute log probability density. ddirmult(x=c(33,115,95,359), p=c(0.075,0.201,0.175,0.549), theta=4.027,log=TRUE)# Compute log probability density. ddirmult(x=c(33,115,95,359), p=c(0.075,0.201,0.175,0.549), theta=4.027,log=TRUE)
Decodes Phred quality scores in Sanger format from symbols to numeric values.
decode_quality_scores(symbols)decode_quality_scores(symbols)
symbols |
A string containing quality scores encoded as symbols in Sanger format. |
A numeric vector of Phred quality scores.
decode_quality_scores(symbols="989!.C;F@\"")decode_quality_scores(symbols="989!.C;F@\"")
Generates detection plots for multiple groups.
detection( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", m = 3, ... )detection( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", m = 3, ... )
x |
A list of vectors named |
r |
Numeric scalar. Radius of plot circle (default = |
b |
Numeric scalar. Plot radius buffer (proportion; default = |
v |
Numeric scalar. Vertex count of plot circle (default = |
w |
Numeric scalar. Line width of outer circle (default = |
f |
Numeric scalar. Line width of sectors as a proportion of |
c |
Character string. Fill color of sub-sector detections (default = |
m |
Numeric scalar. Maximum number of plot columns (default = |
... |
Additional arguments passed to |
Produces a pie-chart-like detection plot with grouping structure. Each circle represents a group. Each sector represents a sample, and each sub-sector represents a replicate. Filled replicates represent detections. Groups are sorted alphabetically (or inherit factor level ordering) and arranged from left to right and top to bottom. Samples are sorted alphabetically and arranged in a clockwise orientation (from angle zero). Samples are sorted independently for each group. This plot design is specialized for visualizing binary detection data.
No return value.
A manuscript describing this plot design is in preparation.
singular.detection for singular detection plots. proportion for grouped proportion plots.
set.seed(1234) n.groups<-6 n.samples<-6 n.replicates<-3 data<-list(g=rep(x=LETTERS[1:n.groups],each=n.samples), s=rep(x=letters[1:n.samples],times=n.groups), r=rep(x=n.replicates,times=n.groups*n.samples), d=sample(x=0:n.replicates,size=n.groups*n.samples, replace=TRUE)) detection(x=data)set.seed(1234) n.groups<-6 n.samples<-6 n.replicates<-3 data<-list(g=rep(x=LETTERS[1:n.groups],each=n.samples), s=rep(x=letters[1:n.samples],times=n.groups), r=rep(x=n.replicates,times=n.groups*n.samples), d=sample(x=0:n.replicates,size=n.groups*n.samples, replace=TRUE)) detection(x=data)
Compute Bray-Curtis dissimilarity from proportional abundances.
dissimilarity(p1, p2)dissimilarity(p1, p2)
p1 |
Numeric vector or matrix of proportional abundances for first community. See details. |
p2 |
Numeric vector or matrix of proportional abundances for second community. See details. |
Calculates Bray-Curtis dissimilarity from proportional abundances using the formula sum(abs(p1-p2))/2. This is equivalent to the Bray-Curtis dissimilarity formula in the vegdist function of the vegan package when both communities have the same total counts (e.g., rarefied counts). This function is primarily intended to provide a method to compute Bray-Curtis dissimilarity from the posterior predictions of Dirichlet-multinomial regression models, which generate predictions of proportional abundances.
The dimensions of p1 and p2 must match. If p1 and p2 are matrices, then each record represents paired replicates for the communities (e.g., predictions from the same posterior draw). The elements (if p1 and p2 are vectors) or fields (if p1 and p2 are matrices) represent dimensions (e.g., taxa or species). If p1 and p2 are vectors, then a numeric scalar is returned. If p1 and p2 are matrices, then a numeric vector is returned whose elements correspond to the paired records of p1 and p2.
Numeric scalar or vector of Bray-Curtis dissimilarity values.
Bray JR, and Curtis JT. 1957. An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27(4): 325-349. DOI: 10.2307/1942268
Legendre P, and Legendre L. 2012. Numerical Ecology: Third Edition. Elsevier.
Odum EP. 1950. Bird populations of the Highlands (North Carolina) Plateau in relation to plant succession and avian invasion. Ecology, 31(4): 587-605. DOI: 10.2307/1931577
dmreg for fitting Dirichlet-multinomial regression models. dmpredict for generating predictions from Dirichlet-multinomial regression models. diversity for computing Hill diversity from proportional abundances. richness for computing species richness from occupancy probabilities.
# Compute Bray-Curtis dissimilarity. dissimilarity(p1=c(0.15,0.25,0.4,0.2), p2=c(0.25,0.35,0.1,0.3))# Compute Bray-Curtis dissimilarity. dissimilarity(p1=c(0.15,0.25,0.4,0.2), p2=c(0.25,0.35,0.1,0.3))
Compute Hill diversity from proportional abundances.
diversity(p, alpha = 2)diversity(p, alpha = 2)
p |
Numeric vector or matrix of proportional abundances. If vector, then Hill diversity is computed for the vector of proportions (and a scalar is returned). If matrix, then Hill diversity is computed independently for each record (and a vector is returned). |
alpha |
Numeric scalar or vector. Continuous positive alpha parameter of the Hill diversity formula. If scalar, then |
Calculates Hill diversity from proportional abundances as defined in Hill (1973), which provides a unifying theory for ecological diversity indices. When alpha = 0, Hill diversity is equal to species richness. When alpha = 1, Hill diversity is equal to the exponentiated Shannon's entropy. When alpha = 2 (the default), Hill diversity is equal to the inverse of Simpson's index. For any value of alpha, the Hill diversity of a community with uniform proportional abundances is equal to species richness. Hill diversity represents the effective number of species.
Numeric scalar or vector of Hill diversity values.
Hill MO. 1973. Diversity and evenness: A unifying notation and its consequences. Ecology, 54(2): 427-432. DOI: 10.2307/1934352
dissimilarity for computing Bray-Curtis dissimilarity from proportional abundances. richness for computing species richness from occupancy probabilities.
# Compute Hill diversity. diversity(p=c(0.15,0.25,0.4,0.2))# Compute Hill diversity. diversity(p=c(0.15,0.25,0.4,0.2))
Density function for a joint species distribution model.
djsdm(x, psi, log = FALSE)djsdm(x, psi, log = FALSE)
x |
Numeric vector or matrix. Binary values of species occurrence. If matrix, then a vector of probability densities is returned with an element for each record of the matrix. Matrix records represent sites, and matrix fields represent species. |
psi |
Numeric vector or matrix. Probabilities of site occupancy. Matrix records represent sites, and matrix fields represent species. If vector, then |
log |
Logical scalar. If |
Computes the probability density of a joint species distribution model. The probability of observing a community is calculated as the product of the probabilities of observing each species. Observations for each species are Bernoulli-distributed, and species-specific probability densities are computed with stats::dbinom.
Numeric vector of probability densities.
Wilkinson DP, Golding N, Guillera‐Arroita G, Tingley R, and McCarthy MA. 2021. Defining and evaluating predictions of joint species distribution models. Methods in Ecology and Evolution, 12(3): 394-404. DOI: 10.1111/2041-210X.13518
stats::dbinom for density of the binomial distribution. mlWAIC for computing widely applicable information criteria for joint species distribution models.
# Define species occurrence. x<-c(1,0,0,1) # Define occupancy probabilities. psi<-c(0.886,0.391,0.139,0.991) # Compute log probability density. djsdm(x=x,psi=psi,log=TRUE)# Define species occurrence. x<-c(1,0,0,1) # Define occupancy probabilities. psi<-c(0.886,0.391,0.139,0.991) # Compute log probability density. djsdm(x=x,psi=psi,log=TRUE)
Generate predictions for Dirichlet-multinomial regression models.
dmpredict(X, H, fit, names)dmpredict(X, H, fit, names)
X |
Numeric predictor matrix. Predictions are made for each record. Each field represents a predictor variable, and the predictor variables must match (in order) those used to fit the |
H |
Numeric vector or matrix (optional). If provided, then hierarchical effects are included in the predictions. Vector or matrix elements contain integer identifiers for values of hierarchical variables. If vector, then a single hierarchical variable is included, with each element corresponding to a record in |
fit |
A |
names |
Vector (optional). If provided, then field names in the matrices of the returned list will receive these values. If omitted, then the matrices in the returned list will lack field names. |
Generates posterior predictions for Dirichlet-multinomial regression models fit with the dmreg function. Predictions can either include or omit hierarchical effects, depending on whether argument H is provided. Returns a list where each element contains a matrix of posterior predictions for the respective record of X. Field names for the element matrices can optionally be provided with the names argument.
A list whose elements contain numeric matrices of posterior predictions. Within the list, one element is returned for each record of X. Element names are taken from the row names of X.
dmreg for fitting Dirichlet-multinomial regression models. dmWAIC for computing widely applicable information criteria for Dirichlet-multinomial regression models.
# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Predict with fitted Dirichlet-multinomial regression. out<-dmpredict(X=data$X,fit=data$fit,names=colnames(data$Y))# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Predict with fitted Dirichlet-multinomial regression. out<-dmpredict(X=data$X,fit=data$fit,names=colnames(data$Y))
Fit a Bayesian Dirichlet-multinomial regression model. Both fixed and hierarchical effects are supported. Installation of the rstan package is required to use this function.
dmreg( Y, X, H, ones = TRUE, priors = c(B.mu = 0, B.sd = 1, theta.mu = 0, theta.sd = 1, sigma2.alpha = 0.01, sigma2.beta = 0.01), control = list(adapt_delta = 0.95, max_treedepth = 20), ... )dmreg( Y, X, H, ones = TRUE, priors = c(B.mu = 0, B.sd = 1, theta.mu = 0, theta.sd = 1, sigma2.alpha = 0.01, sigma2.beta = 0.01), control = list(adapt_delta = 0.95, max_treedepth = 20), ... )
Y |
Numeric response matrix. Each record represents an observation, and each field represents a response dimension. Matrix cells contain integer counts. |
X |
Numeric predictor matrix. Each record represents an observation, and each field represents a predictor variable. Matrix cells contain predictor values. |
H |
Numeric vector or matrix (optional). If provided, then hierarchical effects are included in the model. Vector or matrix elements contain integer identifiers for values of hierarchical variables. If vector, then a single hierarchical variable is included, with each element representing an observation. If matrix, then each record represents an observation, and each field represents a hierarchical variable. Up to four hierarchical variables are supported (each with an arbitrary number of hierarchical levels). |
ones |
Logical scalar. If |
priors |
Named numeric vector. Elements represent the prior values of their respective named parameters. When predictors are centered and scaled, the defaults generally represent weakly informative priors. Regression coefficients ( |
control |
Named list of parameters which control the behavior of the Stan sampler. Passed to the |
... |
Additional arguments passed to the |
Fits the Bayesian Dirichlet-multinomial regression model of Goodwin et al. (2022) using the rstan interface to Stan (Carpenter et al. 2017). A stanfit object of the fitted model is returned, which can be used with standard rstan functions to evaluate model convergence (e.g., posterior trace plots, R-hat convergence diagnostics, and effective sample sizes). The model formulation is identical to that of Goodwin et al. (2022), except that the hard sum-to-zero constraint on hierarchical effects was removed to preserve the prior marginal variance of the final element. Up to four hierarchical variables are supported.
For each observation, counts are distributed according to the Dirichlet-multinomial distribution with alpha parameters defined as the product of an expected proportions vector and an exponentiated precision parameter. The precision parameter controls the degree of overdispersion relative to the multinomial distribution. The softmax function normalizes linear predictor combinations into expected proportions. For the model to be identifiable, the regression coefficients of the final dimension are set to zero. By default, weakly informative priors are used on the regression coefficients (B), precision parameter (theta), and hierarchical variances (sigma2). See the supplement of Goodwin et al. (2022) for details.
Returns a stanfit object of the fitted Bayesian Dirichlet-multinomial regression model.
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, and Riddell A. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, 76: 1-32. DOI: 10.18637/jss.v076.i01
Goodwin KB, Hutchinson JD, and Gompert Z. 2022. Spatiotemporal and ontogenetic variation, microbial selection, and predicted Bd-inhibitory function in the skin-associated microbiome of a Rocky Mountain amphibian. Frontiers in Microbiology, 13: 1020329. DOI: 10.3389/fmicb.2022.1020329
Harrison JG, Calder WJ, Shastry V, and Buerkle CA. Dirichlet-multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data. Molecular Ecology Resources, 20(2): 481-497. DOI: 10.1111/1755-0998.13128
dmpredict for generating predictions from Dirichlet-multinomial regression models. dmWAIC for computing widely applicable information criteria for Dirichlet-multinomial regression models.
# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Fit Dirichlet-multinomial regression. out<-dmreg(Y=data$Y,X=data$X,H=data$H)# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Fit Dirichlet-multinomial regression. out<-dmreg(Y=data$Y,X=data$X,H=data$H)
Density function for the multivariate logistic distribution.
dmvlogis(x, location, scale, R, log = FALSE)dmvlogis(x, location, scale, R, log = FALSE)
x |
Numeric vector or matrix. Values of logistically-distributed marginals. If matrix, then a vector of probability densities is returned with an element for each record of the matrix. Matrix records represent observations, and matrix fields represent dimensions. |
location |
Numeric vector. Location parameters of the logistic distribution. If |
scale |
Numeric vector. Scale parameters of the logistic distribution. If |
R |
Numeric correlation matrix. If |
log |
Logical scalar. If |
Computes the probability density of the multivariate logistic distribution. The multivariate logistic distribution is constructed using a Gaussian copula with logistic marginals. The probability density is the product of the densities of the logistic marginals, which is further multiplied by the density of a Gaussian copula of the transformed standard uniform margins (i.e., probability integral transformation of the logistic marginals with stats::plogis).
Numeric vector of probability densities.
Decani JS, and Stine RA. 1986. A note on deriving the information matrix for a logistic distribution. The American Statistician, 40(3): 220-222. DOI: 10.2307/2684541
Song P. 2000. Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2): 305-320. DOI: 10.1111/1467-9469.00191
stats::dlogis for density of the logistic distribution. dcopula for density of the Gaussian copula.
# Define logistic margins. x<-c(0.055,-1.625,0.329,-5.765) # Define location parameters. location<-c(0.477,-0.998,-0.776,0.064) # Define scale parameters. scale<-c(0.574,1.314,0.460,1.393) # Define correlation matrix. R<-matrix(data=c(1.000,-0.80,0.64,-0.512, -0.800,1.00,-0.80,0.640, 0.640,-0.80,1.00,-0.800, -0.512,0.64,-0.80,1.000), ncol=4,byrow=TRUE) # Compute log probability density. dmvlogis(x=x,location=location, scale=scale,R=R, log=TRUE)# Define logistic margins. x<-c(0.055,-1.625,0.329,-5.765) # Define location parameters. location<-c(0.477,-0.998,-0.776,0.064) # Define scale parameters. scale<-c(0.574,1.314,0.460,1.393) # Define correlation matrix. R<-matrix(data=c(1.000,-0.80,0.64,-0.512, -0.800,1.00,-0.80,0.640, 0.640,-0.80,1.00,-0.800, -0.512,0.64,-0.80,1.000), ncol=4,byrow=TRUE) # Compute log probability density. dmvlogis(x=x,location=location, scale=scale,R=R, log=TRUE)
Computes the widely applicable information criterion (WAIC) for Dirichlet-multinomial regression models. Serves as a wrapper for dmreg, dmpredict, ddirmult, and waic for convenient WAIC calculations. Installation of the rstan package is required to use this function.
dmWAIC( Y, X, H, ones = TRUE, method = 2, priors = c(B.mu = 0, B.sd = 1, theta.mu = 0, theta.sd = 1, sigma2.alpha = 0.01, sigma2.beta = 0.01), control = list(adapt_delta = 0.95, max_treedepth = 20), ... )dmWAIC( Y, X, H, ones = TRUE, method = 2, priors = c(B.mu = 0, B.sd = 1, theta.mu = 0, theta.sd = 1, sigma2.alpha = 0.01, sigma2.beta = 0.01), control = list(adapt_delta = 0.95, max_treedepth = 20), ... )
Y |
Numeric response matrix. Each record represents an observation, and each field represents a response dimension. Matrix cells contain integer counts. |
X |
Numeric predictor matrix. Each record represents an observation, and each field represents a predictor variable. Matrix cells contain predictor values. |
H |
Numeric vector or matrix (optional). If provided, then hierarchical effects are included in the model. Vector or matrix elements contain integer identifiers for values of hierarchical variables. If vector, then a single hierarchical variable is included, with each element representing an observation. If matrix, then each record represents an observation, and each field represents a hierarchical variable. Up to four hierarchical variables are supported (each with an arbitrary number of hierarchical levels). |
ones |
Logical scalar. If |
method |
Numeric scalar. Options are |
priors |
Named numeric vector. Elements represent the prior values of their respective named parameters. When predictors are centered and scaled, the defaults generally represent weakly informative priors. Regression coefficients ( |
control |
Named list of parameters which control the behavior of the Stan sampler. Passed to the |
... |
Additional arguments passed to the |
For convenience, wraps the steps involved in WAIC calculations for Bayesian Dirichlet-multinomial regression models. Begins by fitting a Bayesian Dirichlet-multinomial regression model with the dmreg function, then generates resubstitution posterior predictions using the dmpredict function. The pointwise log-likelihood is calculated with the ddirmult function given the response matrix, posterior predictions, and precision parameter. WAIC is calculated from the pointwise log-likelihood using the waic function.
Returns numeric scalar of the widely applicable information criterion.
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, and Riddell A. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, 76: 1-32. DOI: 10.18637/jss.v076.i01
Gelman A, Hwang J, and Vehtari A. 2014. Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6): 997-1016. DOI: 10.1007/s11222-013-9416-2
Goodwin KB, Hutchinson JD, and Gompert Z. 2022. Spatiotemporal and ontogenetic variation, microbial selection, and predicted Bd-inhibitory function in the skin-associated microbiome of a Rocky Mountain amphibian. Frontiers in Microbiology, 13: 1020329. DOI: 10.3389/fmicb.2022.1020329
Harrison JG, Calder WJ, Shastry V, and Buerkle CA. Dirichlet-multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data. Molecular Ecology Resources, 20(2): 481-497. DOI: 10.1111/1755-0998.13128
Watanabe S. 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(116): 3571-3594.
dmreg for fitting Dirichlet-multinomial regression models. dmpredict for generating predictions from Dirichlet-multinomial regression models. ddirmult for probability mass function of the Dirichlet-multinomial distribution. waic for generic function to compute widely applicable information criterion.
# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Compute WAIC for Dirichlet-multinomial regression. out<-dmWAIC(Y=data$Y,X=data$X,H=data$H)# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Compute WAIC for Dirichlet-multinomial regression. out<-dmWAIC(Y=data$Y,X=data$X,H=data$H)
Extracts each taxonomic level from a vector of taxonomic strings.
expand_taxonomies( taxonomies, levels = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), full_names = TRUE, delimiter = ";", ignore )expand_taxonomies( taxonomies, levels = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), full_names = TRUE, delimiter = ";", ignore )
taxonomies |
A character vector of taxonomic strings. |
levels |
A character vector of taxonomic level names. The length of |
full_names |
Logical. If |
delimiter |
A character string of the delimiter between taxonomic levels in the input taxonomies. The default is |
ignore |
An optional character vector of taxonomic strings for which taxonomic expansion will be skipped. In the returned data frame (see return value section), the record for each skipped taxonomic string will be filled with |
Returns a data frame of extracted taxonomic levels. One record for each element of taxonomies, and one field for each element of levels. Field names are inherited from levels. If a taxonomic level is not present in a taxonomic string, then the respective cell in the returned data frame will contain NA.
get_taxonomic_level for extracting a taxonomic level from taxonomic strings. get_consensus_taxonomy for generating a consensus taxonomy from taxonomic strings.
expand_taxonomies(taxonomies= c("Eukaryota;Chordata;Amphibia;Caudata;Ambystomatidae;Ambystoma;Ambystoma mavortium", "Eukaryota;Chordata;Amphibia;Anura;Bufonidae;Anaxyrus;Anaxyrus boreas", "Eukaryota;Chordata;Amphibia;Anura;Ranidae;Rana;Rana luteiventris"), full_names=FALSE, delimiter=";")expand_taxonomies(taxonomies= c("Eukaryota;Chordata;Amphibia;Caudata;Ambystomatidae;Ambystoma;Ambystoma mavortium", "Eukaryota;Chordata;Amphibia;Anura;Bufonidae;Anaxyrus;Anaxyrus boreas", "Eukaryota;Chordata;Amphibia;Anura;Ranidae;Rana;Rana luteiventris"), full_names=FALSE, delimiter=";")
Filters DNA sequences by minimum read count within a PCR replicate, minimum proportion within a PCR replicate, and number of detections across PCR replicates.
filter_sequences( input_files, samples, PCR_replicates, output_file, minimum_reads.PCR_replicate = 1, minimum_reads.sequence = 1, minimum_proportion.sequence = 0.005, binomial_test.enabled = TRUE, binomial_test.p.adjust.method = "none", binomial_test.alpha_level = 0.05, minimum_PCR_replicates = 2, delimiter.read_counts = ": ", delimiter.PCR_replicates = ", " )filter_sequences( input_files, samples, PCR_replicates, output_file, minimum_reads.PCR_replicate = 1, minimum_reads.sequence = 1, minimum_proportion.sequence = 0.005, binomial_test.enabled = TRUE, binomial_test.p.adjust.method = "none", binomial_test.alpha_level = 0.05, minimum_PCR_replicates = 2, delimiter.read_counts = ": ", delimiter.PCR_replicates = ", " )
input_files |
A character vector of file paths to input FASTA files. DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from |
samples |
A character vector of sample identifiers, with one element for each element of |
PCR_replicates |
A character vector of PCR replicate identifiers, with one element for each element of |
output_file |
String specifying path to output file of filtered sequences in CSV format. |
minimum_reads.PCR_replicate |
Numeric. PCR replicates which contain fewer reads than this value are discarded and do not contribute detections to any sequence. The default is |
minimum_reads.sequence |
Numeric. For a sequence to be considered detected within a PCR replicate, the sequence's read count within the PCR replicate must match or exceed this value. The default is |
minimum_proportion.sequence |
Numeric. For a sequence to be considered detected within a PCR replicate, the proportion of reads in the PCR replicate comprised by the sequence must exceed this value. If |
binomial_test.enabled |
Logical. If |
binomial_test.p.adjust.method |
String specifying the p-value adjustment method for multiple hypothesis testing. p-value adjustments are performed within each PCR replicate for each sample. Passed to the |
binomial_test.alpha_level |
Numeric. The alpha level used in deciding whether the proportion of reads in a PCR replicate comprised by a sequence significantly exceeds a minimum threshold required for detection. See the |
minimum_PCR_replicates |
Numeric. The minimum number of PCR replicates in which a sequence must be detected in order to be considered present (i.e., not erroneous) in a sample. The default is |
delimiter.read_counts |
String specifying the delimiter between PCR replicate identifiers and sequence read counts in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is |
delimiter.PCR_replicates |
String specifying the delimiter between PCR replicates in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is |
For each set of input polymerase chain reaction (PCR) replicate FASTA files associated with a sample, writes out DNA sequences which are detected across a minimum number of PCR replicates (minimum_PCR_replicates argument). Detection within a PCR replicate is defined as a sequence having at least a minimum read count and exceeding a minimum proportion of reads (minimum_reads.sequence and minimum_proportion.sequence arguments, respectively). When binomial_test.enabled = TRUE, a sequence must significantly exceed the minimum proportion within a PCR replicate at the provided alpha level (binomial_test.alpha_level argument) based on a one-sided binomial test (i.e., binomial_test with alternative = "greater"). Within a PCR replicate, p-values can be adjusted for multiple hypothesis testing by setting the binomial_test.p.adjust.method argument (see stats::p.adjust.methods and p.adjust in the stats package). PCR replicates which contain fewer than a minimum number of reads are discarded (minimum_reads.PCR_replicate argument) and do not contribute detections to any sequence.
DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from truncate_and_merge_pairs have this format and can be used directly with this function. Each input FASTA file is assumed to contain the DNA sequence reads for a single PCR replicate for a single sample.
For pipeline calibration purposes, a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate is invisibly returned (see return value section). While the primary output of this function is the written CSV file of filtered sequences (described below), the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received.
For the primary output, this function writes a CSV file of filtered DNA sequences with the following field definitions:
Sample: The sample name.
Sequence: The DNA sequence.
Detections_across_PCR_replicates: The number of PCR replicates the sequence was detected in.
Read_count_by_PCR_replicate: The sequence's read count in each PCR replicate the sequence was detected in.
Sequence_read_count: The sequence's total read count across the PCR replicates the sequence was detected in. Calculated as the sum of the read counts in the Read_count_by_PCR_replicate field.
Sample_read_count: The sample's total read count across all sequences detected in the PCR replicates. Calculated as the sum of the read counts in Sequence_read_count field associated with the sample.
Proportion_of_sample: The proportion of sample reads comprised by the sequence. Calculated by dividing the Sequence_read_count field by the Sample_read_count field. Equivalent to the weighted average of the sequence's proportion in each PCR replicate, with weights given by the proportion of the sample's total reads contained in each PCR replicate.
Invisibly returns a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate. While the primary output of this function is the written CSV file of filtered sequences described in the details section, the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received. Field definitions for the invisibly returned data frame of unfiltered sequences are:
Sample: The sample name.
PCR_replicate: The PCR replicate identifier.
Sequence: The DNA sequence.
Read_count.sequence: The sequence's read count within the PCR replicate.
Read_count.PCR_replicate: The number of reads in the PCR replicate.
Proportion_of_PCR_replicate.observed: The proportion of reads in the PCR replicate comprised by the sequence.
Proportion_of_PCR_replicate.null (Field only present if binomial_test.enabled = TRUE): The null hypothesis for a one-sided binomial test (inherited from the minimum_proportion.sequence argument). See the p.value field below.
p.value (Field only present if binomial_test.enabled = TRUE): The p-value from a one-sided binomial test of whether the proportion of reads in the PCR replicate comprised by the sequence exceeds the null hypothesis (i.e., binomial_test with alternative = "greater").
p.value.adjusted (Field only present if binomial_test.enabled = TRUE): The p-value from the one-sided binomial test adjusted for multiple comparisons within each PCR replicate for each sample. See the p.value_adjustment_method field below.
p.value_adjustment_method (Field only present if binomial_test.enabled = TRUE): The p-value adjustment method (inherited from the binomial_test.p.adjust.method argument).
A manuscript describing these methods is in preparation.
binomial_test for performing vectorized one-sided binomial tests. truncate_and_merge_pairs for truncating and merging read pairs prior to sequence filtering. local_taxa_tool for performing geographically-conscious taxonomic assignment of filtered sequences.
# Get example FASTA files. input_files<-system.file("extdata", paste0(rep(x=paste0("S0",1:3), each=3), "P0",1:3,".fasta"), package="LocaTT", mustWork=TRUE) # Create path for temporary output file. output_file<-tempfile(fileext=".csv") # Specify samples. samples<-rep(x=paste0("S0",1:3),each=3) # Specify replicates. PCR_replicates<-rep(x=paste0("P0",1:3),times=3) # Filter sequences. filter_sequences(input_files=input_files, samples=samples, PCR_replicates=PCR_replicates, output_file=output_file)# Get example FASTA files. input_files<-system.file("extdata", paste0(rep(x=paste0("S0",1:3), each=3), "P0",1:3,".fasta"), package="LocaTT", mustWork=TRUE) # Create path for temporary output file. output_file<-tempfile(fileext=".csv") # Specify samples. samples<-rep(x=paste0("S0",1:3),each=3) # Specify replicates. PCR_replicates<-rep(x=paste0("P0",1:3),times=3) # Filter sequences. filter_sequences(input_files=input_files, samples=samples, PCR_replicates=PCR_replicates, output_file=output_file)
Formats reference databases from MIDORI or UNITE for use with the local_taxa_tool function.
format_reference_database( path_to_input_database, path_to_output_database, input_database_source = "MIDORI", path_to_taxonomy_edits = NA, path_to_sequence_edits = NA, path_to_taxa_subset_list = NA, makeblastdb_command = "makeblastdb", ... )format_reference_database( path_to_input_database, path_to_output_database, input_database_source = "MIDORI", path_to_taxonomy_edits = NA, path_to_sequence_edits = NA, path_to_taxa_subset_list = NA, makeblastdb_command = "makeblastdb", ... )
path_to_input_database |
String specifying path to input reference database in FASTA format. |
path_to_output_database |
String specifying path to output BLAST database in FASTA format. File path cannot contain spaces. |
input_database_source |
String specifying input reference database source ( |
path_to_taxonomy_edits |
String specifying path to taxonomy edits file in CSV format. The file must contain the following fields: 'Old_Taxonomy', 'New_Taxonomy', 'Notes'. Old taxonomies are replaced with new taxonomies in the order the records appear in the file. The taxonomic levels in the 'Old_Taxonomy' and 'New_Taxonomy' fields should be delimited by a semi-colon. If no taxonomy edits are desired, then set this variable to |
path_to_sequence_edits |
String specifying path to sequence edits file in CSV format. The file must contain the following fields: 'Action', 'Common_Name', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species', 'Sequence', 'Notes'. The values in the 'Action' field must be either 'Add' or 'Remove', which will add or remove the respective sequence from the reference database. Values in the 'Common_Name' field are optional. Values should be supplied to all taxonomy fields. If using a reference database from MIDORI, then use NCBI domain names (e.g., 'Eukaryota') in the 'Domain' field. If using a reference database from UNITE, then use kingdom names (e.g., 'Fungi') in the 'Domain' field. The 'Species' field should contain species binomials. Sequence edits are performed after taxonomy edits, if applied. If no sequence edits are desired, then set this variable to |
path_to_taxa_subset_list |
String specifying path to list of species (in CSV format) to subset the reference database to. This option is helpful if the user wants the reference database to include only the sequences of local species. The file should contain the following fields: 'Common_Name', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'. There should be no |
makeblastdb_command |
String specifying path to the makeblastdb program, which is a part of BLAST. The default ( |
... |
Accepts former argument names for backwards compatibility. |
No return value. Writes formatted BLAST database files.
local_taxa_tool for performing geographically-conscious taxonomic assignment. adjust_taxonomies for adjusting a taxonomy system.
# Get path to example reference sequences FASTA file. path_to_input_file<-system.file("extdata", "example_reference_sequences.fasta", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output reference database FASTA file. path_to_output_file<-tempfile(fileext=".fasta") # Format reference database. format_reference_database(path_to_input_database=path_to_input_file, path_to_output_database=path_to_output_file)# Get path to example reference sequences FASTA file. path_to_input_file<-system.file("extdata", "example_reference_sequences.fasta", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output reference database FASTA file. path_to_output_file<-tempfile(fileext=".fasta") # Format reference database. format_reference_database(path_to_input_database=path_to_input_file, path_to_output_database=path_to_output_file)
Gets the consensus taxonomy from a vector of taxonomic strings.
get_consensus_taxonomy(taxonomies, full_names = TRUE, delimiter = ";")get_consensus_taxonomy(taxonomies, full_names = TRUE, delimiter = ";")
taxonomies |
A character vector of taxonomic strings. |
full_names |
Logical. If |
delimiter |
A character string of the delimiter between taxonomic levels in the input taxonomies. The default is |
A character string containing the taxonomy agreed upon by all input taxonomies. If the input taxonomies are not the same at any taxonomic level, then NA is returned.
get_taxonomic_level for extracting a taxonomic level from taxonomic strings. expand_taxonomies for extracting each taxonomic level from a vector of taxonomic strings.
get_consensus_taxonomy(taxonomies= c("Eukaryota;Chordata;Amphibia;Caudata;Ambystomatidae;Ambystoma;Ambystoma_mavortium", "Eukaryota;Chordata;Amphibia;Anura;Bufonidae;Anaxyrus;Anaxyrus_boreas", "Eukaryota;Chordata;Amphibia;Anura;Ranidae;Rana;Rana_luteiventris"), full_names=TRUE, delimiter=";")get_consensus_taxonomy(taxonomies= c("Eukaryota;Chordata;Amphibia;Caudata;Ambystomatidae;Ambystoma;Ambystoma_mavortium", "Eukaryota;Chordata;Amphibia;Anura;Bufonidae;Anaxyrus;Anaxyrus_boreas", "Eukaryota;Chordata;Amphibia;Anura;Ranidae;Rana;Rana_luteiventris"), full_names=TRUE, delimiter=";")
Gets the specified taxonomic level from a vector of taxonomic strings.
get_taxonomic_level(taxonomies, level, full_names = TRUE, delimiter = ";")get_taxonomic_level(taxonomies, level, full_names = TRUE, delimiter = ";")
taxonomies |
A character vector of taxonomic strings. |
level |
A numeric value representing the taxonomic level to be extracted. A value of |
full_names |
Logical. If |
delimiter |
A character string of the delimiter between taxonomic levels in the input taxonomies. The default is |
A character vector containing the requested taxonomic level for each element of the input taxonomies.
expand_taxonomies for extracting each taxonomic level from a vector of taxonomic strings. get_consensus_taxonomy for generating a consensus taxonomy from taxonomic strings.
get_taxonomic_level(taxonomies= c("Eukaryota;Chordata;Amphibia;Caudata;Ambystomatidae;Ambystoma;Ambystoma_mavortium", "Eukaryota;Chordata;Amphibia;Anura;Bufonidae;Anaxyrus;Anaxyrus_boreas", "Eukaryota;Chordata;Amphibia;Anura;Ranidae;Rana;Rana_luteiventris"), level=5, full_names=TRUE, delimiter=";")get_taxonomic_level(taxonomies= c("Eukaryota;Chordata;Amphibia;Caudata;Ambystomatidae;Ambystoma;Ambystoma_mavortium", "Eukaryota;Chordata;Amphibia;Anura;Bufonidae;Anaxyrus;Anaxyrus_boreas", "Eukaryota;Chordata;Amphibia;Anura;Ranidae;Rana;Rana_luteiventris"), level=5, full_names=TRUE, delimiter=";")
Formats taxonomies from IUCN Red List taxonomy.csv and common_names.csv files for use with the local_taxa_tool function.
get_taxonomies.IUCN( path_to_taxonomies, path_to_common_names, path_to_output_file, domain = "Eukaryota", path_to_taxonomy_edits = NA, ... )get_taxonomies.IUCN( path_to_taxonomies, path_to_common_names, path_to_output_file, domain = "Eukaryota", path_to_taxonomy_edits = NA, ... )
path_to_taxonomies |
String specifying path to input IUCN Red List taxonomy.csv file. |
path_to_common_names |
String specifying path to input IUCN Red List common_names.csv file. |
path_to_output_file |
String specifying path to output species list (in CSV format) with formatted taxonomies. |
domain |
String specifying the domain name to use for all species. The IUCN Red List files do not include domain information, so a domain name must be provided. If using a reference database from UNITE, provide a kingdom name here (e.g., |
path_to_taxonomy_edits |
String specifying path to taxonomy edits file in CSV format. The file must contain the following fields: 'Old_Taxonomy', 'New_Taxonomy', 'Notes'. Old taxonomies are replaced with new taxonomies in the order the records appear in the file. The taxonomic levels in the 'Old_Taxonomy' and 'New_Taxonomy' fields should be delimited by a semi-colon. If no taxonomy edits are desired, then set this variable to |
... |
Accepts former argument names for backwards compatibility. |
No return value. Writes an output CSV file with formatted taxonomies.
get_taxonomies.species_binomials for remotely fetching NCBI taxonomies from species binomials. adjust_taxonomies for adjusting a taxonomy system.
# Get path to example taxonomy CSV file. path_to_taxonomies<-system.file("extdata", "example_taxonomy.csv", package="LocaTT", mustWork=TRUE) # Get path to example common names CSV file. path_to_common_names<-system.file("extdata", "example_common_names.csv", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output CSV file. path_to_output_file<-tempfile(fileext=".csv") # Format common names and taxonomies. get_taxonomies.IUCN(path_to_taxonomies=path_to_taxonomies, path_to_common_names=path_to_common_names, path_to_output_file=path_to_output_file)# Get path to example taxonomy CSV file. path_to_taxonomies<-system.file("extdata", "example_taxonomy.csv", package="LocaTT", mustWork=TRUE) # Get path to example common names CSV file. path_to_common_names<-system.file("extdata", "example_common_names.csv", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output CSV file. path_to_output_file<-tempfile(fileext=".csv") # Format common names and taxonomies. get_taxonomies.IUCN(path_to_taxonomies=path_to_taxonomies, path_to_common_names=path_to_common_names, path_to_output_file=path_to_output_file)
Remotely fetches taxonomies from the NCBI taxonomy database for a list of species binomials. Installation of the taxize package is required to use this function.
get_taxonomies.species_binomials( path_to_species_binomials, path_to_output_file, path_to_taxonomy_edits = NA, print_queries = TRUE, ... )get_taxonomies.species_binomials( path_to_species_binomials, path_to_output_file, path_to_taxonomy_edits = NA, print_queries = TRUE, ... )
path_to_species_binomials |
String specifying path to input species list with common and scientific names. The file should be in CSV format and contain the following fields: 'Common_Name', 'Scientific_Name'. Values in the 'Common_Name' field are optional. Values in the 'Scientific_Name' field are required. |
path_to_output_file |
String specifying path to output species list with added NCBI taxonomies. The output file will be in CSV format. |
path_to_taxonomy_edits |
String specifying path to taxonomy edits file in CSV format. The file must contain the following fields: 'Old_Taxonomy', 'New_Taxonomy', 'Notes'. Old taxonomies are replaced with new taxonomies in the order the records appear in the file. The taxonomic levels in the 'Old_Taxonomy' and 'New_Taxonomy' fields should be delimited by a semi-colon. If no taxonomy edits are desired, then set this variable to |
print_queries |
Logical. Whether taxa queries should be printed. The default is |
... |
Accepts former argument names for backwards compatibility. |
No return value. Writes an output CSV file with added taxonomies. Species which could not be found in the NCBI taxonomy database appear in the top records of the output file.
get_taxonomies.IUCN for formatting taxonomies from the IUCN Red List. adjust_taxonomies for adjusting a taxonomy system.
# Get path to example input species binomials CSV file. path_to_species_binomials<-system.file("extdata", "example_species_binomials.csv", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output CSV file. path_to_output_file<-tempfile(fileext=".csv") # Fetch taxonomies from species binomials. get_taxonomies.species_binomials(path_to_species_binomials=path_to_species_binomials, path_to_output_file=path_to_output_file, print_queries=FALSE)# Get path to example input species binomials CSV file. path_to_species_binomials<-system.file("extdata", "example_species_binomials.csv", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output CSV file. path_to_output_file<-tempfile(fileext=".csv") # Fetch taxonomies from species binomials. get_taxonomies.species_binomials(path_to_species_binomials=path_to_species_binomials, path_to_output_file=path_to_output_file, print_queries=FALSE)
Trims DNA sequences to an amplicon region using forward and reverse primer sequences. Ambiguous nucleotides in forward and reverse primers are supported.
isolate_amplicon(sequences, forward_primer, reverse_primer)isolate_amplicon(sequences, forward_primer, reverse_primer)
sequences |
A character vector of DNA sequences to trim to the amplicon region. |
forward_primer |
A string specifying the forward primer sequence. Can contain ambiguous nucleotides. |
reverse_primer |
A string specifying the reverse primer sequence. Can contain ambiguous nucletodies. |
For each DNA sequence, nucleotides matching and preceding the forward primer are removed, and nucleotides matching and following the reverse complement of the reverse primer are removed. The reverse complement of the reverse primer is internally derived from the reverse primer using the reverse_complement function. Ambiguous nucleotides in primers (i.e., the forward and reverse primer arguments) are supported through the internal use of the substitute_wildcards function on the forward primer and the reverse complement of the reverse primer, and primer regions in DNA sequences are located using regular expressions. Trimming will fail for DNA sequences which contain ambiguous nucleotides in their primer regions (e.g., Ns), resulting in NAs for those sequences.
A character vector of DNA sequences trimmed to the amplicon region. NAs are returned for DNA sequences which could not be trimmed, which occurs when either primer region is missing from the DNA sequence or when the forward primer region occurs after a region matching the reverse complement of the reverse primer.
isolate_amplicon(sequences=c("ACACAATCGTGTTTATATTAACTTCAAGAGTGGGCATAGG", "CGTGACAATCATGTTTGTGATTCGTACAAAAGTGCGTCCT"), forward_primer="AATCRTGTTT", reverse_primer="CSCACTHTTG")isolate_amplicon(sequences=c("ACACAATCGTGTTTATATTAACTTCAAGAGTGGGCATAGG", "CGTGACAATCATGTTTGTGATTCGTACAAAAGTGCGTCCT"), forward_primer="AATCRTGTTT", reverse_primer="CSCACTHTTG")
Performs taxonomic assignment of DNA metabarcoding sequences while considering geographic location.
local_taxa_tool( path_to_query_sequences, path_to_BLAST_database, path_to_output_file, path_to_local_taxa_list = NA, include_missing = FALSE, blast_e_value = 1e-05, blast_max_target_seqs = 2000, blast_task = "megablast", full_names = FALSE, underscores = FALSE, separator = ", ", blastn_command = "blastn", ... )local_taxa_tool( path_to_query_sequences, path_to_BLAST_database, path_to_output_file, path_to_local_taxa_list = NA, include_missing = FALSE, blast_e_value = 1e-05, blast_max_target_seqs = 2000, blast_task = "megablast", full_names = FALSE, underscores = FALSE, separator = ", ", blastn_command = "blastn", ... )
path_to_query_sequences |
String specifying path to FASTA file containing sequences to classify. File path cannot contain spaces. |
path_to_BLAST_database |
String specifying path to BLAST reference database in FASTA format. File path cannot contain spaces. |
path_to_output_file |
String specifying path to output file of classified sequences in CSV format. |
path_to_local_taxa_list |
String specifying path to list of local species in CSV format. The file should contain the following fields: 'Common_Name', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'. There should be no 'NA's or blanks in the taxonomy fields. The species field should contain the binomial name without subspecies or other information below the species level. There should be no duplicate species (i.e., multiple records with the same species binomial and taxonomy) in the local species list. If local taxa suggestions are not desired, set this variable to |
include_missing |
Logical. If |
blast_e_value |
Numeric. Maximum E-value of returned BLAST hits (lower E-values are associated with more 'significant' matches). The default is |
blast_max_target_seqs |
Numeric. Maximum number of BLAST target sequences returned per query sequence. Enough target sequences should be returned to ensure that all minimum E-value matches are returned for each query sequence. A warning will be produced if this value is not sufficient. The default is |
blast_task |
String specifying BLAST task specification. Use |
full_names |
Logical. If |
underscores |
Logical. If |
separator |
String specifying the separator to use between taxa names in the output CSV file. The default is |
blastn_command |
String specifying path to the blastn program. The default ( |
... |
Accepts former argument names for backwards compatibility. |
Sequences are BLASTed against a global reference database, and the tool suggests locally occurring species which are most closely related (by taxonomy) to any of the best-matching BLAST hits (by bit score). Optionally, local sister taxonomic groups without reference sequences can be added to the local taxa suggestions by setting the include_missing argument to TRUE. If a local taxa list is not provided, then local taxa suggestions will be disabled, but all best-matching BLAST hits will still be returned. Alternatively, a reference database containing just the sequences of local species can be used, and local taxa suggestions can be disabled to return all best BLAST matches of local species. The reference database should be formatted with the format_reference_database function, and the local taxa lists can be prepared using the get_taxonomies.species_binomials and get_taxonomies.IUCN functions. Output field definitions are:
Sequence_name: The query sequence name.
Sequence: The query sequence.
Best_match_references: Species binomials of all best-matching BLAST hits (by bit score) from the reference database.
Best_match_E_value: The E-value associated with the best-matching BLAST hits.
Best_match_bit_score: The bit score associated with the best-matching BLAST hits.
Best_match_query_cover.mean: The mean query cover of all best-matching BLAST hits.
Best_match_query_cover.SD: The standard deviation of query cover of all best-matching BLAST hits.
Best_match_PID.mean: The mean percent identity of all best-matching BLAST hits.
Best_match_PID.SD: The standard deviation of percent identity of all best-matching BLAST hits.
Local_taxa (Field only present if a path to a local taxa list is provided): The finest taxonomic unit(s) which include both any species of the best-matching BLAST hits and any local species. If the species of any of the best-matching BLAST hits are local, then the finest taxonomic unit(s) are at the species level.
Local_species (Field only present if a path to a local taxa list is provided): Species binomials of all local species which belong to the taxonomic unit(s) in the Local_taxa field.
Local_taxa.include_missing (Field only present if both a path to a local taxa list is provided and the include_missing argument is set to TRUE): Local sister taxonomic groups without reference sequences are added to the local taxa suggestions from the Local_taxa field.
Local_species.include_missing (Field only present if both a path to a local taxa list is provided and include_missing argument is set to TRUE): Species binomials of all local species which belong to the taxonomic unit(s) in the Local_taxa.include_missing field.
No return value. Writes an output CSV file with fields defined in the details section.
A manuscript describing this taxonomic assignment method is in preparation.
format_reference_database for formatting reference databases. get_taxonomies.species_binomials and get_taxonomies.IUCN for creating local taxa lists. adjust_taxonomies for adjusting a taxonomy system.
# Get path to example query sequences FASTA file. path_to_query_sequences<-system.file("extdata", "example_query_sequences.fasta", package="LocaTT", mustWork=TRUE) # Get path to example BLAST reference database FASTA file. path_to_BLAST_database<-system.file("extdata", "example_blast_database.fasta", package="LocaTT", mustWork=TRUE) # Get path to example local taxa list CSV file. path_to_local_taxa_list<-system.file("extdata", "example_local_taxa_list.csv", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output CSV file. path_to_output_file<-tempfile(fileext=".csv") # Run the local taxa tool. local_taxa_tool(path_to_query_sequences=path_to_query_sequences, path_to_BLAST_database=path_to_BLAST_database, path_to_output_file=path_to_output_file, path_to_local_taxa_list=path_to_local_taxa_list, include_missing=TRUE, full_names=TRUE, underscores=TRUE)# Get path to example query sequences FASTA file. path_to_query_sequences<-system.file("extdata", "example_query_sequences.fasta", package="LocaTT", mustWork=TRUE) # Get path to example BLAST reference database FASTA file. path_to_BLAST_database<-system.file("extdata", "example_blast_database.fasta", package="LocaTT", mustWork=TRUE) # Get path to example local taxa list CSV file. path_to_local_taxa_list<-system.file("extdata", "example_local_taxa_list.csv", package="LocaTT", mustWork=TRUE) # Create a temporary file path for the output CSV file. path_to_output_file<-tempfile(fileext=".csv") # Run the local taxa tool. local_taxa_tool(path_to_query_sequences=path_to_query_sequences, path_to_BLAST_database=path_to_BLAST_database, path_to_output_file=path_to_output_file, path_to_local_taxa_list=path_to_local_taxa_list, include_missing=TRUE, full_names=TRUE, underscores=TRUE)
Merges forward and reverse DNA sequence reads.
merge_pairs(forward_reads, reverse_reads, minimum_overlap = 10)merge_pairs(forward_reads, reverse_reads, minimum_overlap = 10)
forward_reads |
A character vector of forward DNA sequence reads. |
reverse_reads |
A character vector of reverse DNA sequence reads. |
minimum_overlap |
Numeric. The minimum length of an overlap that must be found between the end of the forward read and the start of the reverse complement of the reverse read in order for a read pair to be merged. The default is |
For each pair of forward and reverse DNA sequence reads, the reverse complement of the reverse read is internally derived using the reverse_complement function, and the read pair is merged into a single sequence if an overlap of at least the minimum length is found between the end of the forward read and the start of the reverse complement of the reverse read. If an overlap of the minimum length is not found, then an NA is returned for the merged read pair.
A character vector of merged DNA sequence read pairs. NAs are returned for read pairs which could not be merged, which occurs when an overlap of at least the minimum length is not found between the end of the forward read and the start of the reverse complement of the reverse read.
truncate_and_merge_pairs for truncating and merging forward and reverse DNA sequence reads.
merge_pairs(forward_reads=c("CCTTACGAATCCTGT","TTCTCCACCCGCGGATA","CGCCCGGAGTCCCTGTAGTA"), reverse_reads=c("GACAAACAGGATTCG","CAATATCCGCGGGTG","TACTACAGGGACTCC"))merge_pairs(forward_reads=c("CCTTACGAATCCTGT","TTCTCCACCCGCGGATA","CGCCCGGAGTCCCTGTAGTA"), reverse_reads=c("GACAAACAGGATTCG","CAATATCCGCGGGTG","TACTACAGGGACTCC"))
Extract regression coefficients from multivariate logistic regression models.
mlcoef(fit, probs = c(0.025, 0.25, 0.5, 0.75, 0.975), dimnames)mlcoef(fit, probs = c(0.025, 0.25, 0.5, 0.75, 0.975), dimnames)
fit |
A |
probs |
Numeric vector of probabilities. Passed to the |
dimnames |
List (optional). If provided, then names within the returned 3-dimensional array will receive these values. Passed to the |
Extracts regression coefficient estimates from a multivariate logistic regression model fit using the mlreg function. Summarizes estimates by the quantiles of their posterior distributions, and returns summaries in a 3-dimensional array. The dimensions of the 3D array represent the posterior quantiles, the predictor variables, and the response variables, respectively. Values at probs = 0.025 and 0.975 comprise 95% credible intervals. Values at probs = 0.25 and 0.75 comprise 50% credible intervals, and values at probs = 0.5 represent point estimates.
Numeric 3-dimensional array of regression coefficient posterior quantiles.
mlreg for fitting multivariate logistic regression models. mlcor for extracting residual correlations from multivariate logistic regression models. mlformat for formatting output of multivariate logistic regression models.
# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Extract regression coefficients. out<-mlcoef(fit=data$fit)# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Extract regression coefficients. out<-mlcoef(fit=data$fit)
Extract residual correlations from multivariate logistic regression models.
mlcor(fit, probs = c(0.025, 0.25, 0.5, 0.75, 0.975), dimnames)mlcor(fit, probs = c(0.025, 0.25, 0.5, 0.75, 0.975), dimnames)
fit |
A |
probs |
Numeric vector of probabilities. Passed to the |
dimnames |
List (optional). If provided, then names within the returned 3-dimensional array will receive these values. Passed to the |
Extracts residual correlation estimates from a multivariate logistic regression model fit using the mlreg function. Summarizes estimates by the quantiles of their posterior distributions, and returns summaries in a 3-dimensional array. The dimensions of the 3D array represent the posterior quantiles (dimension 1) and the response variables (both dimensions 2 and 3). Values at probs = 0.025 and 0.975 comprise 95% credible intervals. Values at probs = 0.25 and 0.75 comprise 50% credible intervals, and values at probs = 0.5 represent point estimates.
Numeric 3-dimensional array of residual correlation posterior quantiles.
mlreg for fitting multivariate logistic regression models. mlcoef for extracting regression coefficients from multivariate logistic regression models. mlformat for formatting output of multivariate logistic regression models.
# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Extract residual correlations. out<-mlcor(fit=data$fit)# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Extract residual correlations. out<-mlcor(fit=data$fit)
Format output of multivariate logistic regression models.
mlformat(fit, mode = "B", ci = 0.95, digits = 3, names.x, names.y)mlformat(fit, mode = "B", ci = 0.95, digits = 3, names.x, names.y)
fit |
A |
mode |
Character scalar. Specifies which parameter set to summarize. When |
ci |
Numeric scalar. Defines the credible interval for parameter summaries. When |
digits |
Numeric scalar. Positive integer value specifying the number of decimal places to which results will be rounded. The default is |
names.x |
Character vector (optional). If provided, then supplies the names of predictor variables in the returned matrix. Names should match those of the predictor variables used to fit the |
names.y |
Character vector (optional). If provided, then supplies the names of response variables in the returned matrix. Names should match those of the response variables used to fit the |
Formats output of a multivariate logistic regression model fit using the mlreg function. When mode = "B" (the default), returns regression coefficient estimates. When mode = "R", returns residual correlation estimates (if mlreg is fit with multivariate = TRUE). Summarizes parameters by the quantiles of their posterior distributions, with a point estimate at the 50th percentile (i.e., the posterior median). Lower and upper limits are defined by the credible interval argument. At the default ci = 0.95, returns 95% credible intervals. When a credible interval does not overlap zero, the point estimate is appended with an asterisk.
Numeric matrix of posterior summaries.
mlreg for fitting multivariate logistic regression models. mlcoef for extracting regression coefficients from multivariate logistic regression models. mlcor for extracting residual correlations from multivariate logistic regression models.
# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Retrieve fitted regression model. fit<-data$fit # Retrieve predictor matrix. X<-data$X # Retrieve response matrix. Y<-data$Y # Extract regression coefficients. B<-mlformat(fit=fit,mode="B", names.x=colnames(X), names.y=colnames(Y)) # Display regression coefficients. print(B,quote=FALSE,right=TRUE) # Extract residual correlations. R<-mlformat(fit=fit,mode="R", names.y=colnames(Y)) # Display residual correlations. print(R,quote=FALSE,right=TRUE)# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Retrieve fitted regression model. fit<-data$fit # Retrieve predictor matrix. X<-data$X # Retrieve response matrix. Y<-data$Y # Extract regression coefficients. B<-mlformat(fit=fit,mode="B", names.x=colnames(X), names.y=colnames(Y)) # Display regression coefficients. print(B,quote=FALSE,right=TRUE) # Extract residual correlations. R<-mlformat(fit=fit,mode="R", names.y=colnames(Y)) # Display residual correlations. print(R,quote=FALSE,right=TRUE)
Generate predictions for multivariate logistic regression models.
mlpredict(X, fit, names)mlpredict(X, fit, names)
X |
Numeric predictor matrix. Predictions are made for each record. Each field represents a predictor variable, and the predictor variables must match (in order) those used to fit the |
fit |
A |
names |
Vector (optional). If provided, then field names in the matrices of the returned list will receive these values. If omitted, then the matrices in the returned list will lack field names. |
Generates posterior predictions for multivariate logistic regression models fit with the mlreg function. Returns a list where each element contains a matrix of posterior predictions for the respective record of X. Field names for the element matrices can optionally be provided with the names argument.
A list whose elements contain numeric matrices of posterior predictions. Within the list, one element is returned for each record of X. Element names are taken from the row names of X.
mlreg for fitting multivariate logistic regression models. mlformat for formatting output of multivariate logistic regression models. mlWAIC for computing widely applicable information criteria for multivariate logistic regression models.
# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Predict with fitted multivariate logistic regression. out<-mlpredict(X=data$X,fit=data$fit,names=colnames(data$Y))# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Predict with fitted multivariate logistic regression. out<-mlpredict(X=data$X,fit=data$fit,names=colnames(data$Y))
Fit a multivariate logistic regression model. Installation of the rstan package is required to use this function.
mlreg( Y, X, multivariate = TRUE, priors = c(B.mu = 0, B.sd = 1, lkj = 1), iter = 20000, thin = 20, control = list(adapt_delta = 0.99, max_treedepth = 20, stepsize = 0.01), ... )mlreg( Y, X, multivariate = TRUE, priors = c(B.mu = 0, B.sd = 1, lkj = 1), iter = 20000, thin = 20, control = list(adapt_delta = 0.99, max_treedepth = 20, stepsize = 0.01), ... )
Y |
Numeric response matrix. Each record represents an observation, and each field represents a response dimension. Matrix cells contain binary values (i.e., |
X |
Numeric predictor matrix. Each record represents an observation, and each field represents a predictor variable. Matrix cells contain predictor values. |
multivariate |
Logical scalar. If |
priors |
Named numeric vector. Elements represent the prior values of their respective named parameters. When predictors are centered and scaled, the defaults generally represent weakly informative priors. Regression coefficients ( |
iter |
Numeric scalar. Integer value specifying the number of iterations for each chain (including warmup). The default is |
thin |
Numeric scalar. Integer value specifying the thinning interval. The default is |
control |
Named list of parameters which control the behavior of the Stan sampler. Passed to the |
... |
Additional arguments passed to the |
Fits a multivariate logistic regression model using the rstan interface to Stan (Carpenter et al. 2017). The multivariate logistic regression follows that of Ovaskainen et al. 2010, where the Bernoulli marginals are reparameterized as truncated continuous latent variables (Albert & Chib 1993). The latent variables z receive a positive constraint when y = 1 and a negative constraint when y = 0, where z is a linear combination of predictors with correlated standard logistic errors. Equivalently, the latent variables follow a multivariate logistic distribution with scale parameters fixed at one (O'Brien & Dunson 2004), constructed in Stan as a Gaussian copula with logistic marginals (Song 2000). A stanfit object of the fitted model is returned, which can be used with standard rstan functions to evaluate model convergence (e.g., posterior trace plots, R-hat convergence diagnostics, and effective sample sizes). By default, weakly informative priors are used on the regression coefficients (B) and residual correlation matrix (R).
Returns a stanfit object of the fitted multivariate logistic regression model.
Albert JH, and Chib S. 1993. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669-679.
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, and Riddell A. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, 76: 1-32. DOI: 10.18637/jss.v076.i01
O'Brien SM, and Dunson DB. 2004. Bayesian multivariate logistic regression. Biometrics, 60: 739-746. DOI: 10.1111/j.0006-341X.2004.00224.x
Ovaskainen O, Hottola J, and Siitonen J. 2010. Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions. Ecology, 91(9): 2514-2521. DOI: 10.1890/10-0173.1
Song P. 2000. Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2): 305-320. DOI: 10.1111/1467-9469.00191
mlformat for formatting output of multivariate logistic regression models. mlpredict for generating predictions from multivariate logistic regression models. mlWAIC for computing widely applicable information criteria for multivariate logistic regression models.
# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Fit multivariate logistic regression. out<-mlreg(Y=data$Y,X=data$X)# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Fit multivariate logistic regression. out<-mlreg(Y=data$Y,X=data$X)
Computes the widely applicable information criterion (WAIC) for multivariate logistic regression models. Serves as a wrapper for mlreg, mlpredict, djsdm, and waic for convenient WAIC calculations. Installation of the rstan package is required to use this function.
mlWAIC( Y, X, multivariate = TRUE, method = 2, priors = c(B.mu = 0, B.sd = 1, lkj = 1), iter = 20000, thin = 20, control = list(adapt_delta = 0.99, max_treedepth = 20, stepsize = 0.01), ... )mlWAIC( Y, X, multivariate = TRUE, method = 2, priors = c(B.mu = 0, B.sd = 1, lkj = 1), iter = 20000, thin = 20, control = list(adapt_delta = 0.99, max_treedepth = 20, stepsize = 0.01), ... )
Y |
Numeric response matrix. Each record represents an observation, and each field represents a response dimension. Matrix cells contain binary values (i.e., |
X |
Numeric predictor matrix. Each record represents an observation, and each field represents a predictor variable. Matrix cells contain predictor values. |
multivariate |
Logical scalar. If |
method |
Numeric scalar. Options are |
priors |
Named numeric vector. Elements represent the prior values of their respective named parameters. When predictors are centered and scaled, the defaults generally represent weakly informative priors. Regression coefficients ( |
iter |
Numeric scalar. Integer value specifying the number of iterations for each chain (including warmup). The default is |
thin |
Numeric scalar. Integer value specifying the thinning interval. The default is |
control |
Named list of parameters which control the behavior of the Stan sampler. Passed to the |
... |
Additional arguments passed to the |
For convenience, wraps the steps involved in WAIC calculations for Bayesian multivariate logistic regression models. Begins by fitting a Bayesian multivariate logistic regression model with the mlreg function, then generates resubstitution posterior predictions using the mlpredict function. The pointwise log-likelihood is calculated with the djsdm function given the response matrix and posterior predictions. WAIC is calculated from the pointwise log-likelihood using the waic function. Because djsdm does not consider residual correlations in density calculations, species interactions do not contribute to WAIC (i.e., response dimensions are independent).
Returns numeric scalar of the widely applicable information criterion.
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, and Riddell A. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, 76: 1-32. DOI: 10.18637/jss.v076.i01
Gelman A, Hwang J, and Vehtari A. 2014. Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6): 997-1016. DOI: 10.1007/s11222-013-9416-2
O'Brien SM, and Dunson DB. 2004. Bayesian multivariate logistic regression. Biometrics, 60: 739-746. DOI: 10.1111/j.0006-341X.2004.00224.x
Ovaskainen O, Hottola J, and Siitonen J. 2010. Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions. Ecology, 91(9): 2514-2521. DOI: 10.1890/10-0173.1
Watanabe S. 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(116): 3571-3594.
mlreg for fitting multivariate logistic regression models. mlpredict for generating predictions from multivariate logistic regression models. djsdm for probability mass function of a joint species distribution model. waic for generic function to compute widely applicable information criterion.
# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Compute WAIC for multivariate logistic regression. out<-mlWAIC(Y=data$Y,X=data$X)# Define example data file path. path<-system.file("extdata", "example_mvlogistic_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Compute WAIC for multivariate logistic regression. out<-mlWAIC(Y=data$Y,X=data$X)
Normalizes a vector or each record of a matrix into a simplex.
normalize(x)normalize(x)
x |
Numeric vector or matrix. If vector, then the vector will be normalized to sum to one. If matrix, then each record will be normalized to sum to one (and a matrix returned). |
Returns a vector or matrix whose elements (if vector) or records (if matrix) are computed as x/sum(x). This normalizes a vector or matrix record into a set of proportions which sum to one (i.e., a simplex). If a matrix is provided for the x argument, then normalization is performed independently for each record.
A numeric vector or matrix whose elements (if vector) or records (if matrix) sum to one.
softmax for the softmax function.
# Normalize vector. normalize(x=c(3,1,5,7))# Normalize vector. normalize(x=c(3,1,5,7))
Generates proportion plots for multiple groups.
proportion( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", s = FALSE, a = TRUE, m = 3, ... )proportion( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", s = FALSE, a = TRUE, m = 3, ... )
x |
A list of vectors named |
r |
Numeric scalar. Radius of plot circle (default = |
b |
Numeric scalar. Plot radius buffer (proportion; default = |
v |
Numeric scalar. Vertex count of plot circle (default = |
w |
Numeric scalar. Line width of outer circle (default = |
f |
Numeric scalar. Line width of sectors as a proportion of |
c |
Character string. Fill color of sector proportions (default = |
s |
Logical value. If |
a |
Logical value. If |
m |
Numeric scalar. Maximum number of plot columns (default = |
... |
Additional arguments passed to |
Produces a pie-chart-like proportion plot with grouping structure. Each circle represents a group, and each sector represents a sample. When a = TRUE (the default), then the proportion of each sector filled with color represents the within-sample proportional abundance. Groups are sorted alphabetically (or inherit factor level ordering) and arranged from left to right and top to bottom. When s = FALSE (the default), then samples are sorted alphabetically and arranged in a clockwise orientation (from angle zero). When s = TRUE, then samples are sorted by decreasing proportional abundance. Samples are sorted independently for each group. This plot design is specialized for visualizing proportional abundance data.
No return value.
A manuscript describing this plot design is in preparation.
singular.proportion for singular proportion plots. detection for grouped detection plots.
set.seed(1234) n.groups<-6 n.samples<-6 data<-list(g=rep(x=LETTERS[1:n.groups],each=n.samples), s=rep(x=letters[1:n.samples],times=n.groups), p=stats::rbeta(n=n.groups*n.samples, shape1=1,shape2=1)) proportion(x=data)set.seed(1234) n.groups<-6 n.samples<-6 data<-list(g=rep(x=LETTERS[1:n.groups],each=n.samples), s=rep(x=letters[1:n.samples],times=n.groups), p=stats::rbeta(n=n.groups*n.samples, shape1=1,shape2=1)) proportion(x=data)
Reads FASTA files. Supports the reading of FASTA files with sequences wrapping multiple lines.
read.fasta(file)read.fasta(file)
file |
A string specifying the path to a FASTA file to read. |
A data frame with fields for sequence names and sequences.
write.fasta for writing FASTA files. read.fastq for reading FASTQ files. write.fastq for writing FASTQ files.
# Get path to example FASTA file. path_to_fasta_file<-system.file("extdata", "example_query_sequences.fasta", package="LocaTT", mustWork=TRUE) # Read the example FASTA file. read.fasta(file=path_to_fasta_file)# Get path to example FASTA file. path_to_fasta_file<-system.file("extdata", "example_query_sequences.fasta", package="LocaTT", mustWork=TRUE) # Read the example FASTA file. read.fasta(file=path_to_fasta_file)
Reads FASTQ files. Does not support the reading of FASTQ files with sequences or quality scores wrapping multiple lines.
read.fastq(file)read.fastq(file)
file |
A string specifying the path to a FASTQ file to read. |
A data frame with fields for sequence names, sequences, comments, and quality scores.
write.fastq for writing FASTQ files. read.fasta for reading FASTA files. write.fasta for writing FASTA files.
# Get path to example FASTQ file. path_to_fastq_file<-system.file("extdata", "example_query_sequences.fastq", package="LocaTT", mustWork=TRUE) # Read the example FASTQ file. read.fastq(file=path_to_fastq_file)# Get path to example FASTQ file. path_to_fastq_file<-system.file("extdata", "example_query_sequences.fastq", package="LocaTT", mustWork=TRUE) # Read the example FASTQ file. read.fastq(file=path_to_fastq_file)
Gets the reverse complement of a DNA sequence. Ambiguous nucleotides are supported.
reverse_complement(sequence)reverse_complement(sequence)
sequence |
A string specifying the DNA sequence. Can contain ambiguous nucleotides. |
A string of the reverse complement of the DNA sequence.
reverse_complement(sequence="TTCTCCASCCGCGGATHTTG")reverse_complement(sequence="TTCTCCASCCGCGGATHTTG")
Compute species richness from occupancy probabilities.
richness(psi)richness(psi)
psi |
Numeric vector or matrix of occupancy probabilities. If vector, then species richness is computed for the vector of probabilities (and a scalar is returned). If matrix, then species richness is computed independently for each record (and a vector is returned). |
Calculates species richness from occupancy probabilities. Given a vector of species occupancy probabilities, computes the expected number of species as the sum of the probabilities. If given a matrix of species occupancy probabilities (where each record represents a community), computes the expected number of species as the row sums.
Numeric scalar or vector of species richness values.
diversity for computing Hill diversity from proportional abundances. dissimilarity for computing Bray-Curtis dissimilarity from proportional abundances.
# Compute species richness. richness(psi=c(0.506,0.825,0.135,0.683))# Compute species richness. richness(psi=c(0.506,0.825,0.135,0.683))
Draws sector polygon.
sector(s, e, r, v = 1000, ...)sector(s, e, r, v = 1000, ...)
s |
Numeric scalar of start angle (degrees). |
e |
Numeric scalar of end angle (degrees). |
r |
Numeric scalar of circle radius. |
v |
Numeric scalar of full-circle vertex count (default = |
... |
Additional arguments passed to |
Draws a sector polygon given a start angle, end angle, and circle radius. The sector is drawn about the origin (i.e., x = 0, y = 0). Intended for use with template to generate detection and proportion plots.
No return value.
circle for plotting circle polygons.
template(l=1) sector(s=0,e=45,r=1)template(l=1) sector(s=0,e=45,r=1)
Generates a detection plot for a singular group.
singular.detection( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", t = "", ... )singular.detection( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", t = "", ... )
x |
A list of vectors named |
r |
Numeric scalar. Radius of plot circle (default = |
b |
Numeric scalar. Plot radius buffer (proportion; default = |
v |
Numeric scalar. Vertex count of plot circle (default = |
w |
Numeric scalar. Line width of outer circle (default = |
f |
Numeric scalar. Line width of sectors as a proportion of |
c |
Character string. Fill color of sub-sector detections (default = |
t |
Character string. Plot title (default = |
... |
Additional arguments passed to |
Produces a pie-chart-like detection plot without grouping structure. Each sector represents a sample, and each sub-sector represents a replicate. Filled replicates represent detections. Samples are sorted alphabetically and arranged in a clockwise orientation (from angle zero). This plot design is specialized for visualizing binary detection data.
No return value.
A manuscript describing this plot design is in preparation.
detection for grouped detection plots. proportion for grouped proportion plots.
set.seed(1234) n.samples<-6 n.replicates<-3 data<-list(s=letters[1:n.samples], r=rep(x=n.replicates,times=n.samples), d=sample(x=0:n.replicates,size=n.samples, replace=TRUE)) singular.detection(x=data)set.seed(1234) n.samples<-6 n.replicates<-3 data<-list(s=letters[1:n.samples], r=rep(x=n.replicates,times=n.samples), d=sample(x=0:n.replicates,size=n.samples, replace=TRUE)) singular.detection(x=data)
Generates a proportion plot for a singular group.
singular.proportion( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", t = "", s = FALSE, a = TRUE, ... )singular.proportion( x, r = 1, b = 0.025, v = 1000, w = 1, f = 0.5, c = "lightskyblue", t = "", s = FALSE, a = TRUE, ... )
x |
A list of vectors named |
r |
Numeric scalar. Radius of plot circle (default = |
b |
Numeric scalar. Plot radius buffer (proportion; default = |
v |
Numeric scalar. Vertex count of plot circle (default = |
w |
Numeric scalar. Line width of outer circle (default = |
f |
Numeric scalar. Line width of sectors as a proportion of |
c |
Character string. Fill color of sector proportions (default = |
t |
Character string. Plot title (default = |
s |
Logical value. If |
a |
Logical value. If |
... |
Additional arguments passed to |
Produces a pie-chart-like proportion plot without grouping structure. Each sector represents a sample. When a = TRUE (the default), then the proportion of each sector filled with color represents the within-sample proportional abundance. When s = FALSE (the default), then samples are sorted alphabetically and arranged in a clockwise orientation (from angle zero). When s = TRUE, then samples are sorted by decreasing proportional abundance. This plot design is specialized for visualizing proportional abundance data.
No return value.
A manuscript describing this plot design is in preparation.
proportion for grouped proportion plots. singular.detection for singular detection plots.
set.seed(1234) n.samples<-6 data<-list(s=letters[1:n.samples], p=stats::rbeta(n=n.samples, shape1=1,shape2=1)) singular.proportion(x=data)set.seed(1234) n.samples<-6 data<-list(s=letters[1:n.samples], p=stats::rbeta(n=n.samples, shape1=1,shape2=1)) singular.proportion(x=data)
Applies the softmax to a vector or each record of a matrix.
softmax(x)softmax(x)
x |
Numeric vector or matrix. If vector, then the softmax of the vector will be returned. If matrix, then the softmax will be applied independently to each record (and a matrix returned). |
Returns a vector or matrix whose elements (if vector) or records (if matrix) are computed as exp(x)/sum(exp(x)). The softmax converts a vector or matrix record into a set of proportions which sum to one. If a matrix is provided for the x argument, then the softmax is applied independently for each record.
A numeric vector or matrix whose elements (if vector) or records (if matrix) sum to one.
normalize for vector or matrix normalization.
# Perform softmax on vector. softmax(x=c(-0.25,0.75,1.5,0))# Perform softmax on vector. softmax(x=c(-0.25,0.75,1.5,0))
Substitutes wildcard characters in a DNA sequence with their associated nucleotides surrounded by square brackets. The output is useful for matching in regular expressions.
substitute_wildcards(sequence)substitute_wildcards(sequence)
sequence |
A string specifying the DNA sequence containing wildcard characters. |
A string of the DNA sequence in which wildcard characters are replaced with their associated nucleotides surrounded by square brackets.
substitute_wildcards(sequence="CAADATCCGCGGSTGGAGAA")substitute_wildcards(sequence="CAADATCCGCGGSTGGAGAA")
For each base pair position, summarizes read length, Phred quality score, and the cumulative probability that all bases were called correctly.
summarize_quality_scores( forward_files, reverse_files, n.total = 10000, n.each = ceiling(n.total/length(forward_files)), seed = NULL, FUN = mean, ... )summarize_quality_scores( forward_files, reverse_files, n.total = 10000, n.each = ceiling(n.total/length(forward_files)), seed = NULL, FUN = mean, ... )
forward_files |
A character vector of file paths to FASTQ files containing forward DNA sequence reads. |
reverse_files |
A character vector of file paths to FASTQ files containing reverse DNA sequence reads. |
n.total |
Numeric. The number of read pairs to randomly sample from the input FASTQ files. Ignored if |
n.each |
Numeric. The number of read pairs to randomly sample from each pair of input FASTQ files. The default is |
seed |
Numeric. The seed for randomly sampling read pairs. If |
FUN |
A function to compute summary statistics of the quality scores. The default is |
... |
Additional arguments passed to |
For each combination of base pair position and read direction, calculates summary statistics of read length, Phred quality score, and the cumulative probability that all bases were called correctly. The cumulative probability is calculated from the first base pair up to the current position. Quality scores are assumed to be encoded in Sanger format. Read pairs are selected by randomly sampling up to n.each read pairs from each pair of input FASTQ files. By default, n.each is derived from n.total, and n.total will be ignored if n.each is provided. By default, mean is used to compute the summary statistics, but the user may provide another summary function instead (e.g., median). Functions which return multiple summary statistics are also supported (e.g., summary and quantile). Arguments in ... are passed to the summary function.
Returns a data frame containing summary statistics of read length and quality score at each base pair position. The returned data frame contains the following fields:
Direction: The read direction (i.e., "Forward" or "Reverse").
Position: The base pair position.
Length: The summary statistic(s) of read lengths. If FUN returns multiple summary statistics, then a matrix of the summary statistics will be stored in this field, which can be accessed with $Length.
Score: The summary statistic(s) of Phred quality scores. If FUN returns multiple summary statistics, then a matrix of the summary statistics will be stored in this field, which can be accessed with $Score.
Probability: The summary statistic(s) of the cumulative probability that all bases were called correctly. If FUN returns multiple summary statistics, then a matrix of the summary statistics will be stored in this field, which can be accessed with $Probability.
decode_quality_scores for decoding quality scores.
# Get example forward FASTQ files. forward_files<-system.file("extdata", paste0("S0",1:3,"F.fastq"), package="LocaTT", mustWork=TRUE) # Get example reverse FASTQ files. reverse_files<-system.file("extdata", paste0("S0",1:3,"R.fastq"), package="LocaTT", mustWork=TRUE) # Summarize quality scores. summarize_quality_scores(forward_files,reverse_files)# Get example forward FASTQ files. forward_files<-system.file("extdata", paste0("S0",1:3,"F.fastq"), package="LocaTT", mustWork=TRUE) # Get example reverse FASTQ files. reverse_files<-system.file("extdata", paste0("S0",1:3,"R.fastq"), package="LocaTT", mustWork=TRUE) # Summarize quality scores. summarize_quality_scores(forward_files,reverse_files)
Initiates a blank template plot.
template(l, b = 0.025)template(l, b = 0.025)
l |
Numeric scalar of axis limits (applies to both axes). |
b |
Numeric scalar to extend axis limits (see Details; default = |
Initiates a blank template plot with limits l and buffer b about the origin (i.e., x = 0, y = 0). l is used for axis limits in both the negative and positive directions. b extends the limits beyond l by a fixed proportion (i.e., l * (1 + b)). Intended for use with circle and sector.
No return value.
circle for plotting circle polygons. sector for plotting sector polygons.
template(l=1) circle(r=1)template(l=1) circle(r=1)
Trims a target nucleotide sequence from the front or back of DNA sequences. Ambiguous nucleotides in the target nucleotide sequence are supported.
trim_sequences( sequences, target, anchor = "start", fixed = TRUE, required = TRUE, quality_scores )trim_sequences( sequences, target, anchor = "start", fixed = TRUE, required = TRUE, quality_scores )
sequences |
A character vector of DNA sequences to trim. |
target |
A string specifying the target nucleotide sequence. |
anchor |
A string specifying whether the target nucleotide sequence should be trimmed from the start or end of the DNA sequences. Allowable values are |
fixed |
A logical value specifying whether the position of the target nucleotide sequence should be fixed at the ends of the DNA sequences. If |
required |
A logical value specifying whether trimming is required. If |
quality_scores |
An optional character vector of DNA sequence quality scores. If supplied, these will be trimmed to their corresponding trimmed DNA sequences. |
For each DNA sequence, the target nucleotide sequence is searched for at either the front or back of the DNA sequence, depending on the value of the anchor argument. If the target nucleotide sequence is found, then it is removed from the DNA sequence. If the required argument is set to TRUE, then DNA sequences in which the target nucleotide sequence was not found will be returned as NAs. If the required argument is set to FALSE, then untrimmed DNA sequences will be returned along with DNA sequences for which trimming was successful. Ambiguous nucleotides in the target nucleotide sequence are supported through the internal use of the substitute_wildcards function on the target nucleotide sequence, and a regular expression with a leading or ending anchor is used to search for the target nucleotide sequence in the DNA sequences. If the fixed argument is set to FALSE, then any number of characters are allowed between the start or end of the DNA sequences and the target nucleotide sequence. Trimming will fail for DNA sequences which contain ambiguous nucleotides (e.g., Ns) in their target nucleotide sequence region, resulting in NAs for those sequences if the required argument is set to TRUE.
If quality scores are not provided, then a character vector of trimmed DNA sequences is returned. If quality scores are provided, then a list containing two elements is returned. The first element is a character vector of trimmed DNA sequences, and the second element is a character vector of quality scores which have been trimmed to their corresponding trimmed DNA sequences.
trim_sequences(sequences=c("ATATAGCGCG","TGCATATACG","ATCTATCACCGC"), target="ATMTA", anchor="start", fixed=TRUE, required=TRUE, quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"))trim_sequences(sequences=c("ATATAGCGCG","TGCATATACG","ATCTATCACCGC"), target="ATMTA", anchor="start", fixed=TRUE, required=TRUE, quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"))
Removes DNA read pairs containing ambiguous nucleotides, truncates reads by length and quality score, and merges forward and reverse reads.
truncate_and_merge_pairs( forward_files, reverse_files, output_files, truncation_length = NA, threshold.quality_score = 3, threshold.probability = 0.5, minimum_overlap = 10, cores = 1, progress = FALSE )truncate_and_merge_pairs( forward_files, reverse_files, output_files, truncation_length = NA, threshold.quality_score = 3, threshold.probability = 0.5, minimum_overlap = 10, cores = 1, progress = FALSE )
forward_files |
A character vector of file paths to FASTQ files containing forward DNA sequence reads. |
reverse_files |
A character vector of file paths to FASTQ files containing reverse DNA sequence reads. |
output_files |
A character vector of file paths to output FASTA files. |
truncation_length |
Numeric. The length to truncate DNA sequences to (passed to the |
threshold.quality_score |
Numeric. The Phred quality score threshold used for truncation (passed to the |
threshold.probability |
Numeric. The probability threshold used for truncation (passed to the |
minimum_overlap |
Numeric. The minimum length of an overlap that must be found between the end of the forward read and the start of the reverse complement of the reverse read in order for a read pair to be merged (passed to |
cores |
Numeric. If |
progress |
Logical. If |
For each pair of input FASTQ files, removes DNA read pairs containing ambiguous nucleotides, truncates reads by length, quality score threshold, and probability threshold (in that order), and then merges forward and reverse reads. Merged reads are summarized by frequency of occurrence and written to a FASTA file. See contains_wildcards, truncate_sequences.length, truncate_sequences.quality_score, truncate_sequences.probability, and merge_pairs for methods. Quality scores are assumed to be encoded in Sanger format. Forward and reverse reads can be truncated by different thresholds (see truncation_length, threshold.quality_score, and threshold.probability arguments).
Multicore parallel processing is supported on Mac and Linux operating systems (not available on Windows). When cores > 1 (parallel processing enabled), warnings and errors are printed to the console in addition to being invisibly returned as a list (see the return value section), and errors produced while processing a pair of FASTQ files will not interrupt the processing of other FASTQ file pairs. When cores = 1, FASTQ file pairs are processed sequentially on a single core, and errors will prevent the processing of subsequent FASTQ file pairs (but warnings will not).
If cores = 1, then no return value. Writes a FASTA file for each pair of input FASTQ files with DNA sequence counts stored in the header lines. If cores > 1, then also invisibly returns a list where each element contains warning or error messages associated with processing each pair of input FASTQ files. A NULL value in the returned list means that no warnings or errors were generated from processing the respective pair of FASTQ files.
A manuscript describing these methods is in preparation.
contains_wildcards for detecting ambiguous nucleotides in DNA sequences. truncate_sequences.length for truncating DNA sequences to a specified length. truncate_sequences.quality_score for truncating DNA sequences by Phred quality score. truncate_sequences.probability for truncating DNA sequences by cumulative probability that all bases were called correctly. merge_pairs for merging forward and reverse DNA sequence reads. filter_sequences for filtering merged read pairs by PCR replicate.
# Get example forward FASTQ files. forward_files<-system.file("extdata", paste0("S0",1:3,"F.fastq"), package="LocaTT", mustWork=TRUE) # Get example reverse FASTQ files. reverse_files<-system.file("extdata", paste0("S0",1:3,"R.fastq"), package="LocaTT", mustWork=TRUE) # Create paths for temporary output files. output_files<-tempfile(pattern=paste0("O",1:3),fileext=".fasta") # Truncate and merge pairs. truncate_and_merge_pairs(forward_files=forward_files, reverse_files=reverse_files, output_files=output_files)# Get example forward FASTQ files. forward_files<-system.file("extdata", paste0("S0",1:3,"F.fastq"), package="LocaTT", mustWork=TRUE) # Get example reverse FASTQ files. reverse_files<-system.file("extdata", paste0("S0",1:3,"R.fastq"), package="LocaTT", mustWork=TRUE) # Create paths for temporary output files. output_files<-tempfile(pattern=paste0("O",1:3),fileext=".fasta") # Truncate and merge pairs. truncate_and_merge_pairs(forward_files=forward_files, reverse_files=reverse_files, output_files=output_files)
Truncates DNA sequences to a specified length.
truncate_sequences.length(sequences, length, quality_scores)truncate_sequences.length(sequences, length, quality_scores)
sequences |
A character vector of DNA sequences to truncate. |
length |
Numeric. The length to truncate DNA sequences to. |
quality_scores |
An optional character vector of DNA sequence quality scores. If supplied, these will be truncated to their corresponding truncated DNA sequences. |
If quality scores are not provided, then a character vector of truncated DNA sequences is returned. If quality scores are provided, then a list containing two elements is returned. The first element is a character vector of truncated DNA sequences, and the second element is a character vector of quality scores which have been truncated to their corresponding truncated DNA sequences.
truncate_sequences.quality_score for truncating DNA sequences by Phred quality score. truncate_sequences.probability for truncating DNA sequences by cumulative probability that all bases were called correctly. truncate_and_merge_pairs for truncating and merging forward and reverse DNA sequence reads.
truncate_sequences.length(sequences=c("ATATAGCGCG","TGCCGATATA","ATCTATCACCGC"), length=5, quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"))truncate_sequences.length(sequences=c("ATATAGCGCG","TGCCGATATA","ATCTATCACCGC"), length=5, quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"))
Calculates the cumulative probability that all bases were called correctly along each DNA sequence and truncates the DNA sequence immediately prior to the first occurrence of a probability being equal to or less than a specified value.
truncate_sequences.probability(sequences, quality_scores, threshold = 0.5)truncate_sequences.probability(sequences, quality_scores, threshold = 0.5)
sequences |
A character vector of DNA sequences to truncate. |
quality_scores |
A character vector of DNA sequence quality scores encoded in Sanger format. |
threshold |
Numeric. The probability threshold used for truncation. The default is |
A list containing two elements. The first element is a character vector of truncated DNA sequences, and the second element is a character vector of quality scores which have been truncated to their corresponding truncated DNA sequences.
truncate_sequences.length for truncating DNA sequences to a specified length. truncate_sequences.quality_score for truncating DNA sequences by Phred quality score. truncate_and_merge_pairs for truncating and merging forward and reverse DNA sequence reads.
truncate_sequences.probability(sequences=c("ATATAGCGCG","TGCCGATATA","ATCTATCACCGC"), quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"), threshold=0.5)truncate_sequences.probability(sequences=c("ATATAGCGCG","TGCCGATATA","ATCTATCACCGC"), quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"), threshold=0.5)
Truncates DNA sequences immediately prior to the first occurrence of a Phred quality score being equal to or less than a specified value.
truncate_sequences.quality_score(sequences, quality_scores, threshold = 3)truncate_sequences.quality_score(sequences, quality_scores, threshold = 3)
sequences |
A character vector of DNA sequences to truncate. |
quality_scores |
A character vector of DNA sequence quality scores encoded in Sanger format. |
threshold |
Numeric. The Phred quality score threshold used for truncation. The default is |
A list containing two elements. The first element is a character vector of truncated DNA sequences, and the second element is a character vector of quality scores which have been truncated to their corresponding truncated DNA sequences.
truncate_sequences.length for truncating DNA sequences to a specified length. truncate_sequences.probability for truncating DNA sequences by cumulative probability that all bases were called correctly. truncate_and_merge_pairs for truncating and merging forward and reverse DNA sequence reads.
truncate_sequences.quality_score(sequences=c("ATATAGCGCG","TGCCGATATA","ATCTATCACCGC"), quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"), threshold=3)truncate_sequences.quality_score(sequences=c("ATATAGCGCG","TGCCGATATA","ATCTATCACCGC"), quality_scores=c("989!.C;F@\"","A((#-#;,2F","HD8I/+67=1>?"), threshold=3)
Generic function calculating widely applicable information criterion (WAIC) from the pointwise log-likelihood.
waic(loglik, method = 2)waic(loglik, method = 2)
loglik |
Numeric matrix of the pointwise log-likelihood. Each record represents a Markov chain Monte Carlo (MCMC) sample, and each field represents an observation. |
method |
Numeric scalar. Options are |
Given the pointwise log-likelihood, calculates WAIC (Watanabe 2010) using the formulas described in Gelman et al. (2014). The expected log pointwise predictive density (elppd) is estimated as the log pointwise predictive density (lppd) adjusted by a bias correction (either pWAIC1 or pWAIC2). To reflect the deviance scale, WAIC is defined as the elppd times negative two. As recommended by Gelman et al. (2014), pWAIC2 is used as the default bias correction (method = 2). See Gelman et al. (2014) for details.
Returns numeric scalar of the widely applicable information criterion.
Gelman A, Hwang J, and Vehtari A. 2014. Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6): 997-1016. DOI: 10.1007/s11222-013-9416-2
Watanabe S. 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(116): 3571-3594.
dmWAIC for computing widely applicable information criteria for Dirichlet-multinomial regression models. mlWAIC for computing widely applicable information criteria for multivariate logistic regression models.
# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Compute WAIC from pointwise log-likelihood. out<-waic(loglik=data$loglik)# Define example data file path. path<-system.file("extdata", "example_regression_data.rds", package="LocaTT", mustWork=TRUE) # Read in example regression data. data<-readRDS(file=path) # Compute WAIC from pointwise log-likelihood. out<-waic(loglik=data$loglik)
Writes FASTA files.
write.fasta(names, sequences, file)write.fasta(names, sequences, file)
names |
A character vector of sequence names. |
sequences |
A character vector of sequences. |
file |
A string specifying the path to a FASTA file to write. |
No return value. Writes a FASTA file.
read.fasta for reading FASTA files. write.fastq for writing FASTQ files. read.fastq for reading FASTQ files.
# Get path to example sequences CSV file. path_to_CSV_file<-system.file("extdata", "example_query_sequences.csv", package="LocaTT", mustWork=TRUE) # Read the example sequences CSV file. df<-read.csv(file=path_to_CSV_file,stringsAsFactors=FALSE) # Create a temporary file path for the FASTA file to write. path_to_FASTA_file<-tempfile(fileext=".fasta") # Write the example sequences as a FASTA file. write.fasta(names=df$Name, sequences=df$Sequence, file=path_to_FASTA_file)# Get path to example sequences CSV file. path_to_CSV_file<-system.file("extdata", "example_query_sequences.csv", package="LocaTT", mustWork=TRUE) # Read the example sequences CSV file. df<-read.csv(file=path_to_CSV_file,stringsAsFactors=FALSE) # Create a temporary file path for the FASTA file to write. path_to_FASTA_file<-tempfile(fileext=".fasta") # Write the example sequences as a FASTA file. write.fasta(names=df$Name, sequences=df$Sequence, file=path_to_FASTA_file)
Writes FASTQ files.
write.fastq(names, sequences, quality_scores, file, comments)write.fastq(names, sequences, quality_scores, file, comments)
names |
A character vector of sequence names. |
sequences |
A character vector of sequences. |
quality_scores |
A character vector of quality scores. |
file |
A string specifying the path to a FASTQ file to write. |
comments |
An optional character vector of sequence comments. |
No return value. Writes a FASTQ file.
read.fastq for reading FASTQ files. write.fasta for writing FASTA files. read.fasta for reading FASTA files.
# Get path to example sequences CSV file. path_to_CSV_file<-system.file("extdata", "example_query_sequences.csv", package="LocaTT", mustWork=TRUE) # Read the example sequences CSV file. df<-read.csv(file=path_to_CSV_file,stringsAsFactors=FALSE) # Create a temporary file path for the FASTQ file to write. path_to_FASTQ_file<-tempfile(fileext=".fastq") # Write the example sequences as a FASTQ file. write.fastq(names=df$Name, sequences=df$Sequence, quality_scores=df$Quality_score, file=path_to_FASTQ_file, comments=df$Comment)# Get path to example sequences CSV file. path_to_CSV_file<-system.file("extdata", "example_query_sequences.csv", package="LocaTT", mustWork=TRUE) # Read the example sequences CSV file. df<-read.csv(file=path_to_CSV_file,stringsAsFactors=FALSE) # Create a temporary file path for the FASTQ file to write. path_to_FASTQ_file<-tempfile(fileext=".fastq") # Write the example sequences as a FASTQ file. write.fastq(names=df$Name, sequences=df$Sequence, quality_scores=df$Quality_score, file=path_to_FASTQ_file, comments=df$Comment)