Causes and analytical impacts of missing data in RADseq phylogenetics: Insights from an African frog (Afrixalus)

Crotti, Marco and Barratt, Christopher D. and Loader, Simon P. and Gower, David J. and Streicher, Jeffrey W.. (2019) Causes and analytical impacts of missing data in RADseq phylogenetics: Insights from an African frog (Afrixalus). Zoologica Scripta, 48 (2). pp. 157-167.

Full text not available from this repository.

Official URL: https://edoc.unibas.ch/72068/

Downloads: Statistics Overview


Restriction site‐associated DNA sequencing (RADseq) has emerged as a useful tool in systematics and population genomics. A common feature of RADseq data sets is that they contain missing data that arise from multiple sources including genealogical sampling bias, assembly methodology and sequencing error. Many RADseq studies have demonstrated that allowing sites (single nucleotide polymorphisms, SNPs) with missing data can increase support for phylogenetic hypotheses. Two non‐mutually exclusive explanations for this observation are that (a) larger data sets contain more phylogenetic information; and (b) excluding missing data disproportionally removes sites with the highest mutation rates, causing the exclusion of characters that are likely variable and informative. Using a RADseq data set derived from the East African banana frog, Afrixalus fornasini (up to 1.1 million SNPs), we found that missing data thresholds were positively correlated with the proportion of parsimony‐informative sites and mean branch support. Using three proxies for estimating site‐specific rate, we found that the most conservative missing data strategies excluded rapidly evolving sites, with four‐state sites present only when allowing ≥60% missing data per SNP. Topological similarity among estimated phylogenies was highest for the data sets with ≥60% missing data per SNP. Our results suggest that several desirable phylogenetic qualities were observed when allowing ≥60% missing data per SNP. However, at the highest missing data thresholds (80% and 90% missing data per SNP), we observed differences in performance between high‐ and mixed‐weight DNA extraction samples, which may indicate there are trade‐offs to consider when using degraded genomic template with RADseq protocols.
Faculties and Departments:05 Faculty of Science > Departement Umweltwissenschaften > Ehemalige Einheiten Umweltwissenschaften > Biogeographie (Nagel)
UniBasel Contributors:Loader, Simon Paul and Barratt, Christopher
Item Type:Article, refereed
Article Subtype:Research Article
Note:Publication type according to Uni Basel Research Database: Journal article
Related URLs:
Identification Number:
Last Modified:09 Nov 2020 16:16
Deposited On:09 Nov 2020 16:16

Repository Staff Only: item control page