The Composition Vector Tree (CVTree) is a parameter-free and alignment-free method to infer pro-karyotic phylogeny from their complete genomes. It is distinct from the traditional 16S rRNA analysis in both the input data and the methodology. The prokaryotic phylogenetic trees constructed by using the CVTree method agree well with the Bergey’s taxonomy in all major groupings and fine branching patterns. Thus, combined use of the CVTree approach and the 16S rRNA analysis may provide an objective and reliable reconstruction of the prokaryotic branch of the Tree of Life.
全 文 :Journal of Systematics and Evolution 46 (3): 258–262 (2008) doi: 10.3724/SP.J.1002.2008.08008
(formerly Acta Phytotaxonomica Sinica) http://www.plantsystematics.com
Prokaryotic branch of the Tree of Life: A composition vector approach
1,2,3Bai-Lin HAO* 2,4Lei GAO
1(T-Life Research Center, Fudan University, Shanghai 200433, China)
2(Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100080, China)
3(Santa Fe Institute, Santa Fe, NM 87501, USA)
4(Present: Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA)
Abstract The Composition Vector Tree (CVTree) is a parameter-free and alignment-free method to infer pro-
karyotic phylogeny from their complete genomes. It is distinct from the traditional 16S rRNA analysis in both the
input data and the methodology. The prokaryotic phylogenetic trees constructed by using the CVTree method
agree well with the Bergey’s taxonomy in all major groupings and fine branching patterns. Thus, combined use of
the CVTree approach and the 16S rRNA analysis may provide an objective and reliable reconstruction of the
prokaryotic branch of the Tree of Life.
Key words Bergey’s Manual, composition vector, CVTree, prokaryotic phylogeny, taxonomy.
Prokaryotes are the most abundant organisms on
Earth. They have been thriving for more than 3.7
billion years. They shaped most of the ecological and
even geochemical environments for all living organ-
isms. Yet our understanding of prokaryotes, particu-
larly, their taxonomy and phylogeny, has been quite
limited. No wonder merely a few years ago Carl
Woese called microbiology “the science without a
past” (Woese, 2000). Nevertheless, the use of 16S
rRNA sequences to infer prokaryotic phylogeny,
suggested by Woese and collaborators (Woese & Fox,
1977), has brought about a wealth of new knowledge.
As a consequence of this success, the modern pro-
karyotic taxonomy as reflected in the new edition of
Bergey’s Manual of Systematic Bacteriology (Ber-
gey’s Manual Trust, 2001–2009) is now largely based
on 16S rRNA analysis. This situation, however,
broaches a question of principle: the Bergey’s taxon-
omy needs verification independent of the 16S rRNA
analysis, in order to serve a function of demarcating
the natural boundaries among prokaryotic species in
an objective and convincing way. The fact that our
newly proposed CVTree approach, using entirely
different input data and methodology, supports most
of the 16S rRNA results, may put the prokaryotic
branch of the Tree of Life on a secure footing.
1 CVTree approach to prokaryotic phy-
logeny
The CVTree approach was announced in 2002
(Hao et al., 2003) and has been described in Qi et al.
(2004a) and Hao & Qi (2004). A Web Server has been
installed for public access (Qi et al., 2004b). In brief,
the input to CVTree is a collection of all translated
amino acid sequences from the genome of an organ-
ism. We use the NCBI curated RefSeq (Pruitt et al.,
2007) sequences in order to provide a common basis
of comparison. Then the number of K-peptides is
counted by using a sliding window, shifting one letter
at a time along all protein sequences. These counts are
kept in a fixed lexicographic order of amino acid
letters to form a vector with 20K components. A key
procedure leading to the final composition vector is
the subtraction of a background caused mainly by
neutral mutations in order to highlight the shaping role
of natural selection. As mutations occur randomly at
molecular level (Kimura, 1983), this is done by using
a (K-2)-th order Markovian prediction based on the
number of (K-2)- and (K-1)-peptides from the same
genome. A distance matrix is calculated from these
composition vectors and the standard Neighbor-
Joining program from the PHYLIP package (Felsen-
stein, 1980–2008) is used to generate the CVTrees.
Instead of further elaboration of the method we em-
phasize its distinction from other more traditional
methods:
(1) It is an alignment-free method as each organ-
ism is represented by a composition vector with 20K
components determined by the number of distinct
K-peptides in the collection of all translated protein
sequences. Sequence alignment is replaced by
K-peptide counting which is not challenged by the
huge difference in genome size and gene number of
prokaryotes.
——————————
Received: 23 January 2008 Accepted: 22 April 2008
* Author for correspondence. E-mail:
HAO & GAO: Prokaryotic branch of the Tree of Life: A composition vector approach
259
(2) It does not require the selection of RNA or
protein-coding gene(s) as all translated protein prod-
ucts in a genome are used. Associated with this is the
immunity of CVTree to lateral gene transfer (LGT).
According to the analysis of Carl Woese on the role of
LGT in cell evolution (Woese, 2002) LGT as an
innovative factor in prokaryote evolution has hap-
pened in more and more restricted ecological niches.
As speciation of prokaryotes is largely caused by
differentiation of ecological environment the scope of
LGT may associate with the relatedness of species.
When all protein products are taken into account as
the CVTree method does, some LGT may even help
to group together closely related species.
(3) While the evaluation of traditional phyloge-
netic trees relies more or less on compatibility and
stability arguments and various statistical tests such as
bootstrapping or Jack-knifing have been invoked in
this spirit, the CVTree results are verified by direct
comparison with systematic bacteriology (Gao et al.,
2007). The CVTrees constructed for from 69 to 432
organisms over the past 5 years bear a stable topology
in major branching patterns from phyla down to
species and strains. As compared to some traditional
phylogenetic tree construction methods, the CVTree
approach enjoys a nice feature of “the more genomes
the better agreement” with taxonomy.
(4) Moreover, the CVTree provides a parame-
ter-free method that takes the collection of all proteins
of the organisms under study as input and generates a
distance matrix as output. The peptide length K,
though appearing like a parameter, effectively controls
the resolution power of the method. In fact, the
CVTree method has shown rather high resolution to
elucidate the evolutionary relationship among differ-
ent strains of one and the same species.
(5) The high resolution power of the CVTrees
provides a means to elucidate evolutionary relation-
ships among different strains of one and the same
species when the 16S rRNA analysis may not be
strong enough to resolve too closely related strains,
for example, the 21 species/strains of Streptococcus
form a monophyletic group with converging branch-
ing patterns for K=3 to 6 (Gao et al., 2007).
(6) While the 16S rRNA analysis cannot be ap-
plied to the phylogeny of viruses as the latter do not
possess a ribosome, the CVTree method has been
successfully used to construct phylogeny of coronavi-
ruses including human SARS virus (Gao et al., 2003)
and double-strand DNA viruses (Gao & Qi, 2007). It
has been applied to chloroplasts as well (Chu et al.,
2004).
In Fig. 1 we show the highest rank CVTree at
K=5, adopted from the Supplementary Material of
Gao et al. (2007). Excluded is a highly degenerated
genome of Candidatus Carsonella ruddii with ge-
nome size less than 160 kbp and 182 genes, much
smaller than any known free-living bacteria. Among
the 431 organisms 424 are grouped under the correct
phylum; the 7 outliers are not far from where they
might be placed. Detailed organisms CVTrees for
K=3 to 6 may be found in the Supplementary Material
of Gao et al. (2007).
2 Comparison of CVTree phylogeny with
systematic bacteriology
Recently, we have performed an exhaustive
comparison of CVTrees based on 31 Archaea and 401
Bacteria genomes available on 31 December 2006
with biologists’ systematics (Gao et al., 2007). Ac-
cording to the Bergey’s taxonomy (Garrity et al.,
2004) these genomes represent 18 phyla, 35 classes,
79 orders, 120 families, 190 genera and 327 species
(We analyzed but do not mention strains here as there
is no taxonomic standard at the strain level.) Among
this hierarchy there are 145 taxa that contain two or
more lower taxa, e.g., 62 genera that contain more
than 2 species. These 145 cases were subject to com-
parison with the CVTrees. It turned out that in 103
(71%) cases the phylogeny was consistent with tax-
onomy, whereas some differences were observed in 42
(29%) cases. The Gao et al. (2007) paper and its
Supplementary Material described these discrepancies
case by case. It is a significant fact that most of these
42 cases have been known to biologists (see examples
below.)
Since the submission of Gao et al. (2007) there
has appeared a new release of Taxonomic Outline of
Bacteria and Archaea (abbreviated as TOBA 7.7
below, see Garrity et al., 2007) and more than 200
new prokaryotic genomes have been sequenced.
Comparison of CVTrees built on more genomes with
newer or alternative taxonomic schemes have re-
moved some more of the 42 discrepant cases. We list
a few examples.
(1) In the Archaea branch of CVTrees at all K=3
to 6 the class Thermoplasmata appears in Phylum
Crenarchaeota. This is a cross-phylum discrepancy in
comparison with the Bergey’s Manual where the order
is listed under Euryarchaeota. However, this place-
ment in CVTrees agrees with the scheme given in the
book Five Kingdoms (Margulis & Schwartz, 1998).
Journal of Systematics and Evolution Vol. 46 No. 3 2008 260
Fig. 1. The highest rank CVTree at K=5. A taxon name represents a monophyletic cluster with the number of organisms given in parentheses. For
example, Gamma(100-1) is the cluster of Gammaproteobacteria with 99 organisms, the “outlier” Thidn actually finds its correct placement in the
Epsilon group. Outliers are given in smaller font. Given on the right are the phylum numbers in Bergey’s Outline Rel. 5 (Garrity et al., 2004). The
black dot denotes the trifurcation point of the main domains of life. Note that this is an unrooted tree and the branches are not to scale.
(2) In the genus tree representing 31 Archaea
species (Gao et al., 2007), Aeropyrum pernix from the
Desulfurococcales prevents the order Thermopro-
teales from forming a monophyletic branch; the
species Archaeoglobus fulgidus from the class Ar-
chaeoglobi prevents the class Methanomicrobia from
forming a monophyletic group. However, in our
newly produced CVTree for 47 Archaea and 569
Bacteria (unpublished) there are 6 more species in the
former group and 3 more in the latter group. Each of
the above-mentioned orders/classes form mono-
phyletic groups, and the whole Archaea branch of the
HAO & GAO: Prokaryotic branch of the Tree of Life: A composition vector approach
261
CVTree has reached full agreement with the TOBA
7.7 taxonomy. This is one of the examples of “the
more the better” mentioned above.
(3) The placement of Oceanobacillus was a
cross-phylum disagreement with older releases of
Bergey’s Outline (Garrity et al., 2002), where it was
listed under Proteobacteria. In all CVTrees it joins
other species of Class Bacilli of Phylum Firmicutes. It
was moved to Firmicutes in more recent releases of
the Outline (Garrity et al., 2003). Being already
consistent with Bergey’s Outline from 2004 on this
case was not counted in the 42 differences, but kept as
a historical record.
(4) In Outline Rel.5 (Garrity et al., 2004), the
genus Thiomicrospira contains the two species T.
crunogena and T. denitrificans. However, there was a
footnote on page 87: “The identity of T. denitrificans
is questionable as it belongs within the Epsilonpro-
teobacteria.” In our CVTrees T. denitrificans appears
within Epsilon group at all K. In TOBA 7.7 (Garrity et
al., 2007) T. denitrificans was renamed Sulfurimonas
denitrificans and put in Epsilon group of Proteobacte-
ria.
(5) In all CVTrees from K=3 to 6 the four organ-
isms Synechococcus sp. WH8102, sp. CC9605, sp.
9902, and sp. CC9311 form a stable monophyletic
branch which does not join other Synechococcus
species but falls into the Prochlorococcus cluster. In
our recent paper (Gao et al., 2007) we suggested that
these organisms should be ascribed to Prochlorococ-
cus. In TOBA 7.7 (Garrity et al., 2007) the only listed
strain of Synechococcus, sp. WH8102, indeed appears
under Prochlorococcus.
(6) In CVTrees at K=5 and 6, Pelodictyon luteo-
lum falls among the three species from genus
Chlorobium, preventing the latter from forming a
monophyletic group. It was suggested in Gao et al.
(2007) to move P. luteolum into genus Chlorobium.
Indeed, it is seen in TOBA 7.7 (Garrity et al., 2007) as
Chlorobium euteolum.
What described above shows the predictive ef-
fectiveness of the CVTree approach. In fact, our
CVTrees indicate or hint to some more taxonomic
revisions. The efficiency of the CVTree Web Server is
being significantly improved to cope with situations
when 5,000 to 6,000 prokaryotic genomes will be-
come available in the upcoming few years according
to “Sequencing the Bergey’s Project” (2007).
3 Discussion
The CVTree approach is not meant to replace the
16S rRNA analysis, but to complement it. Being an
independent method, it supports the latter in an over-
whelming majority of cases and provides valuable
suggestions on taxonomic revisions. When 16S rRNA
analysis does not possess enough resolution, as in the
case of multiple strains of a species, the CVTree
method supplies additional useful information.
The use of complete genomes is both a merit and
a demerit of the CVTree approach. It is a merit as no
choice of genes is made and even lateral gene transfer
is taken into account to some extent. It is a demerit
because the number of available complete prokaryotic
genomes is always limited. However, with the pro-
gress of “Sequencing the Bergey’s Project” this
limitation will soon become less severe. With wide
taxonomic coverage of selected sequences it may
contribute to the establishment of a whole-genome
backbone for the prokaryotic branch of the Tree of
Life. We mention, in addition, that CVTree method
has been tested for protein families and could yield
meaningful results (Wei et al., 2004).
It would be useful to see the CVTree method ap-
plied to such eukaryotes as fungi. As a new and
successful approach the foundation of the CVTree
method is still being scrutinized, see, e.g., Shi et al.
(2007). For the time being the CVTree approach
yields only un-rooted trees with topology consistent
with taxonomy and the calibration of branch lengths
requires further study.
Acknowledgements The authors thank Dr. Yin-
Long QIU for carefully reading the manuscript and
making valuable suggestions. This research was
partially supported by the National Basic Research
Program of China (973 Program) (Grant No.
2007CB814800) and the Shanghai Leading Academic
Discipline Project (No. B111).
References
Bergey’s Manual Trust. 2001–2009. Bergey’s manual of
systematic bacteriology. 2nd ed. Vol. 1–5. New York:
Springer-Verlag.
Chu KH, Qi J (戚继), Yu Z-G (喻祖国), Ahn V. 2004. Origin
and phylogeny of chloroplasts revealed by a simple
correlation analysis of complete genomes. Molecular
Biology and Evolution 28: 70–76.
Felsenstein J. 1980–2008. PHYLIP (Phylogeny Inference
Package) version 3.5c [online]. Available from evolution.
genetics.washington.edu/phylip.html [accessed 23 January
2008].
Gao L (高雷), Qi J (戚继). 2007. Whole genome molecular
phylogeny of large dsDNA viruses using composition
vector method. BMC Evolutionary Biology 7: 41.
Gao L (高雷), Qi J (戚继), Sun J-D (孙健冬), Hao B-L (郝柏
Journal of Systematics and Evolution Vol. 46 No. 3 2008 262
林). 2007. Prokaryote phylogeny meets taxonomy: An
exhaustive comparison of composition vector trees with
systematic bacteriology. Science in China Ser. C (中国科
学C辑) 50: 587–599.
Gao L (高雷), Qi J (戚继), Wei H-B (卫海滨), Sun Y-G (孙弈
钢), Hao B-L (郝柏林). 2003. Molecular phylogeny of
Coronaviruses including human SARS-CoV. Chinese
Science Bulletin (科学通报) 48: 1170–1174.
Garrity GM, Bell JA, Lilburn TG. 2003. Taxonomic Outline of
Prokaryotes. Bergey’s manual of systematic bacteriology
[online]. 2nd ed. Rel. 4.0. Springer-Verlag. doi: 10.1007/
bergeysoutline200310. Available from 141.150.157.80/
bergeysoutline/main.htm [accessed 23 January 2008].
Garrity GM, Bell JA, Lilburn TG. 2004. Taxonomic Outline of
Prokaryotes. Bergey’s Manual of Systematic Bacteriology
[online]. 2nd ed. Rel. 5.0. Springer-Verlag. doi: 10.1007/
bergeysoutline200405. Available from 141.150.157.80/
bergeysoutline/main.htm. [accessed 23 January 2008].
Garrity GM, Johnson KL, Bell J, Searles DB. 2002. Taxonomic
Outline of Prokaryotes. Bergey’s manual of systematic
bacteriology. 2nd ed. Rel. 3.0 [online]. Springer-Verlag.
doi: 10.1007/bergeysoutline200210. Available from
141.150.157.80/bergeysoutline/main.htm [accessed 23
January 2008].
Garrity GM, Lilburn TG, Cole JR, Harrison SH, Enzeby J,
Tindall BJ. 2007. Taxonomic Outline of Bacteria and
Archaea (TOBA). Rel 7.7, 6 March 2007. Michigan State
University [online]. Available from www.taxonomi-
coutline.org [accessed 23 January 2008].
Hao B-L (郝柏林), Qi J (戚继). 2004. Prokaryote phylogeny
without sequence alignment: from avoidance signature to
composition distance. Journal of Bioinformatics and
Computational Biology 2: 1–19.
Hao B-L (郝柏林), Qi J (戚继), Wang B (王彬). 2003.
Prokaryote phylogeny based on complete genomes
without sequence alignment. Modern Physics Letters B17:
91–94.
Kimura M. 1983. The neutral theory of molecular evolution.
Cambridge: Cambridge University Press.
Margulis L, Schwartz KV. 1998. Five Kingdoms: An illustrated
guide to the phyla of life on earth. 3rd ed. San Francisco:
W H Freeman.
Pruitt KD, Tatusova T, Maglott DR. 2007. NCBI reference
sequences (RefSeq): a curated non-redundant sequence
database of genomes, transcripts and proteins. Nuclear
Acids Research 35, Database Issue: D61–D65.
Qi J (戚继), Wang B (王彬), Hao B-L (郝柏林). 2004a. Whole
genome prokaryote phylogeny without sequences
alignment: a K-string composition approach. Journal of
Molecular Evolution 58: 1–11.
Qi J (戚继), Luo H (罗红), Hao B-L (郝柏林). 2004b. CVTree:
a phylogenetic tree reconstruction tool based on whole
genomes. Nuclear Acids Research 32. Web Server Issue:
W45–W47.
Sequencing the Bergey’s Project. 2007. Available from
www.sequencingbergeys.org [accessed 15 September
2007].
Shi X-L (史晓黎), Xie H-M (谢惠民), Zhang S-Y (张淑誉),
Hao B-L ( 郝 柏 林 ). 2007. Decomposition and
reconstruction of protein sequences: the problem of
uniqueness and factorizable language. Journal of Korean
Physical Society 50: 118–124.
Wei H-B (卫海滨), Qi J (戚继), Hao B-L (郝柏林). 2004.
Prokaryote phylogeny based on ribosomal proteins and
aminoacyl tRNA synthetases by using the composition
distance approach. Science in China Ser. C (中国科学C
辑) 47: 313–321.
Woese CR. 2000. Prokaryote systematics: the evolution of a
science. In: Balows A, Trupper HG, Dworkin M, Harder
W, Schleifer KH eds. The Prokaryotes. Vol. 3. New York:
Springer-Verlag.
Woese CR. 2002. On the evolution of cells. Proceedings of
National Academy of Sciences USA 99: 8742–8747.
Woese CR, Fox GE. 1977. Phylogenetic structure of the
prokaryotic domain: the primary kingdoms. Proceedings
of National Academy of Sciences USA 74: 5088–5090.