Diagnosability of mtDNA with Random Forests: Using sequence data to delimit subspecies. Mar. Mam. Sci. 33(special issue):101-131
F.I. Archer, K.K. Martien and B.L. Taylor
We examine the use of an ensemble method, Random Forests, to delimit subspecies using mitochondrial DNA (mtDNA) sequences. Diagnosability, a measure of the ability to correctly determine the taxon of a specimen of unknown origin, has historically been used to delimit subspecies, but few studies have explored how to . estimate it from DNA sequences. Using simulated and empirical data sets, we demonstrate that Random Forests produces classification models that perform well for diagnosing subspecies and species. Populations with strong social structure and relatively low abundances (e.g., killer whales, Orcinus orca) were found to be as diagnosable as species. Conversely, comparisons involving subspecies that are abundant (e.g., spinner and spotted dolphins, Stenella longirostris and S. attenuata), are only as diagnosable as many population comparisons. Estimates of diagnosability reported in subspecies and species descriptions should include confidence intervals, which are influenced by the sample sizes of the training data. We also stress the importance of reporting the certainty with which individuals in the training data are classified in order to communicate the strength of the classification model and diagnosability estimate. Guidance as to ideal minimum diagnosability thresholds for subspecies will improve with more comprehensive analyses; however, values in the range of 80%â€“90% are considered appropriate.