Misleading results of likelihood‐based phylogenetic analyses in the presence of missing data

@article{Simmons2012MisleadingRO,
  title={Misleading results of likelihood‐based phylogenetic analyses in the presence of missing data},
  author={Mark P. Simmons},
  journal={Cladistics},
  year={2012},
  volume={28},
  url={https://api.semanticscholar.org/CorpusID:53123024}
}
This study uses contrived and simulated examples to demonstrate that likelihood, even when applied to simple matrices with little or no homoplasy, homogeneous evolution across groups of characters, perfect model fit, and hundreds or thousands of variable characters, can provide strong support for incorrect topologies when the matrices have non‐random distributions of missing data distributed across all partitions.

The Impact of Missing Data on Species Tree Estimation.

It is demonstrated that concatenation (RAxML), gene-tree-based coalescent (ASTRAL, MP-EST, and STAR), and supertree (matrix representation with parsimony [MRP]) methods perform reliably, so long as missing data are randomly distributed and that a sufficiently large number of genes are sampled.

Differences between hard and soft phylogenetic data

When building the tree of life, variability of phylogenetic signal is often accounted for by partitioning gene sequences and testing for differences. The same considerations, however, are rarely

Phylogenetic inference using discrete characters: performance of ordered and unordered parsimony and of three-item statements

The results suggest that the hierarchical character representation not only results in the greatest resolving power, but also in the highest artefactual resolution, both with the simulated and empirical data.

Divergence and support among slightly suboptimal likelihood gene trees

Contemporary phylogenomic studies frequently incorporate two‐step coalescent analyses wherein the first step is to infer individual‐gene trees, generally using maximum‐likelihood implemented in the
...

Missing data in phylogenetic analysis: reconciling results from simulations and empirical data.

Previous simulation and empirical studies showing that taxa with extensive missing data can be accurately placed in phylogenetic analyses and that adding characters with missing dataCan be beneficial can be beneficial (at least under some conditions) are confirmed.

Does Adding Characters with Missing Data Increase or Decrease Phylogenetic Accuracy ?

The results show that the addition of a set of characters with missing data is generally more likely to increase phylogenetic accuracy than decrease it, but the potential beneŽts of adding these characters quickly disappear as the proportion of missing data increases, and it is suggested that accuracy can be increased to a surprising degree.

PROBLEMS DUE TO MISSING DATA IN PHYLOGENETIC ANALYSES INCLUDING FOSSILS: A CRITICAL REVIEW

Missing data simply represent the unknown and should not be viewed as an impediment to considering all available evidence in phylogenetic analyses, nor used as justification for excluding specific taxa or characters.

Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous

It is shown that maximum likelihood and BMCMC can become strongly biased and statistically inconsistent when the rates at which sequence sites evolve change non-identically over time.

Missing data, incomplete taxa, and phylogenetic accuracy.

In this study, simulations are used to show that the reduced accuracy associated with including incomplete taxa is caused by these taxa bearing too few complete characters rather than too many missing data cells, and suggest a more effective strategy for dealing with incompleteTaxa.

The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.

Effects of data incompleteness on the relative performance of parsimony and Bayesian approaches in a supermatrix phylogenetic reconstruction of Mustelidae and Procyonidae (Carnivora)

Parsimony and Bayesian analyses on a mustelid–procyonid molecular supermatrix found no compelling evidence in support of a relationship between the inferior performance of parsimony and taxon incompleteness, and the relatively good performance of the analyses may be related to the large number of sampled characters.

Quantification of the success of phylogenetic inference in simulations

This method represents an improvement relative to the commonly used approaches of quantifying the percentage of clades that are correctly resolved in the inferred trees or presenting the Robinson–Foulds distance between the inferred Trees and the correct tree.
...