Schedule

Genome Phylogenies and Dynamics

9:00 Eric Alm The Phylogenomic Landscapes of Genes are Shallow
9:30 Csaba Pal Evolution of minimal metabolic networks
10:00 Cristos Ouzounis Quantifying vertical and horizontal gene flow across species and time

Comparative, Population, and Environmental Genomics

11:00 Manolis Kellis Computational Comparative Genomics: Genes, Regulation, Evolution
11:30 Dawn Field Large-scale Comparative Genomic Ranking of Taxonomically Restricted Genes (TRGs) in Bacterial and Archaeal Genomes using the “Quality Index for Predicted Proteins” (QIPP)
1:00 Edward Feil Weakened selection and shifting patterns of polymorphism in Shigella
1:30 Patrick Schloss Developing a statistical toolbox to describe and compare communities using metagenomic sequences

Elucidation of Gene Regulation in Microbes

2:00 Mikhail Gelfand Evolution of bacterial regulatory systems
2:30 Erik van Nimwegen Comparative genomic inference of bacterial regulatory systems
3:00 Richard Bonneau Learning global transcriptional dynamics with the Inferelator and cMonkey

Computational Tools and Databases for Microbial Analysis

4:00 Andrei Osterman Genomic Reconstruction and Experimental Assessment of Metabolic Subsystems in Bacteria
4:30 Natalia Maltsev Evolutionary analysis of genomes and gene networks from taxonomic and phenotypic perspective
5:00 Dmitrij Frishman Software tools for high-throughput genome annotation
5:30 Peter Karp The BioCyc Collection of 260+ Pathway/Genome Databases
6:00 Tatiana Tatusov Microbial Genome Resources at NCBI

Developing a statistical toolbox to describe and compare communities using metagenomic sequences

Speaker: Patrick Schloss

University of Massachusetts
Amherst, MA 01003, USA

Metagenomic shotgun sequencing has made it possible to interrogate the genetic diversity of as yet uncultured microbes allowing for the study of microbial ecology at the genomic level. Two common refrains from these studies is the general inability to assemble large genomic fragments and the inability to assign a function to most of the predicted open reading frames (ORFs). Both of these limit the ability to make important ecological inferences from metagenomic shotgun sequencing. Therefore, we have begun to develop a statistical toolbox that uses individual sequence reads and does not require an annotation to describe and compare microbial communities. We have successfully adapted tools developed for analysis of sequence collections generated from PCR-derived clone libraries to the collection of ORFs extracted from individual sequence reads. These tools enable us to estimate the richness of operationally defined protein families (OPFs) and the richness of these families that are shared between communities. They also allow us to compare the relative abundance of these families. Using these approaches we have been able to identify OPFs, which dominate their community suggesting their ecological importance; however, these OPFs lack homology to proteins with a known function. Also, we have been able to predict the number of OPFs that are shared between disparate communities. Advancing statistical tools for describing and comparing microbial communities is essential to the design of future studies and the analysis of the data that are generated.

The Phylogenomic Landscapes of Genes are Shallow

Speaker: Eric J. Alm

Lawrence David1 and Eric J. Alm1,2,3,4,5

Program in Computational and Systems Biology1 and Departments of Biological2 and Civil & Environmental3 Engineering, Massachusetts Institute of Technology, Cambridge, MA; The Virtual institute of Microbial Stress and Survival4, Berkeley, CA; The Broad Institute of MIT and Harvard 5, Cambridge, MA.

Gene trees are often discordant to species phylogenies, and this is especially true for bacterial genes. Whether this is the result of horizontal gene transfer, gene duplication and loss, or simply statistical uncertainty in phylogenetic reconstruction remains unclear. To investigate the origin of phylogenetic incongruence, we implemented an exact algorithm called AnGST (ANalysis of Gene and Species Trees) to infer maximum parsimony ‘phylogenomic’ reconstructions of the gene content of 60 prokaryotic genomes broadly distributed across the tree of life. In its simplest form, the algorithm computed the most parsimonious ‘phylogenomic’ reconciliation of gene and species trees (consisting of duplication, loss, and transfer events). A more sophisticated version allows for an ensemble of gene trees as input, which are combined to produce the optimal ‘hybrid’ reconciled tree giving the most parsimonious evolutionary reconstruction. We then tested the methods on their ability to reconstruct simulated evolutionary scenarios as well as on real data. Surprisingly, even the more sophisticated approach that considered uncertainty in gene trees generated a large number of evolutionary scenarios nearly equal in score, indicating that the ‘phylogenomic landscape’ of genes is relatively flat, with multiple distinct evolutionary scenarios equally compatible with observed trees. Our results suggest that the ‘phylogenomic method’ of tree reconciliation, considered the ‘gold standard’ for inferring gene histories, can lead to a high rate of false predictions even if trees are manually curated.

Micro-Sig Abstracts

Section: Genome Phylogenies and Dynamics

The Phylogenomic Landscapes of Genes are Shallow

Speaker: Eric J. Alm

Lawrence David1 and Eric J. Alm1,2,3,4,5

Program in Computational and Systems Biology1 and Departments of Biological2 and Civil & Environmental3 Engineering, Massachusetts Institute of Technology, Cambridge, MA; The Virtual institute of Microbial Stress and Survival4, Berkeley, CA; The Broad Institute of MIT and Harvard 5, Cambridge, MA.

Gene trees are often discordant to species phylogenies, and this is especially true for bacterial genes. Whether this is the result of horizontal gene transfer, gene duplication and loss, or simply statistical uncertainty in phylogenetic reconstruction remains unclear. To investigate the origin of phylogenetic incongruence, we implemented an exact algorithm called AnGST (ANalysis of Gene and Species Trees) to infer maximum parsimony ‘phylogenomic’ reconstructions of the gene content of 60 prokaryotic genomes broadly distributed across the tree of life. In its simplest form, the algorithm computed the most parsimonious ‘phylogenomic’ reconciliation of gene and species trees (consisting of duplication, loss, and transfer events). A more sophisticated version allows for an ensemble of gene trees as input, which are combined to produce the optimal ‘hybrid’ reconciled tree giving the most parsimonious evolutionary reconstruction. We then tested the methods on their ability to reconstruct simulated evolutionary scenarios as well as on real data. Surprisingly, even the more sophisticated approach that considered uncertainty in gene trees generated a large number of evolutionary scenarios nearly equal in score, indicating that the ‘phylogenomic landscape’ of genes is relatively flat, with multiple distinct evolutionary scenarios equally compatible with observed trees. Our results suggest that the ‘phylogenomic method’ of tree reconciliation, considered the ‘gold standard’ for inferring gene histories, can lead to a high rate of false predictions even if trees are manually curated.

Evolution of minimal metabolic networks

Speaker: Csaba Pal

It is possible to infer aspects of an organism’s lifestyle from its gene content. Can the reverse also be done? Here we consider this issue by modelling evolution of the reduced genomes of endosymbiotic bacteria. The diversity of gene content in these bacteria may reflect both variation in selective forces and contingency-dependent loss of alternative pathways. Using an in silico representation of the metabolic network of Escherichia coli, we examine the role of contingency by repeatedly simulating the successive loss of genes while controlling for the environment. The minimal networks that result are variable in both gene content and number. Partially different metabolisms can thus evolve owing to contingency alone. The simulation outcomes do preserve a core metabolism, however, which is over-represented in strict intracellular bacteria. Moreover, differences between minimal networks based on lifestyle are predictable: by simulating their respective environmental conditions, we can model evolution of the gene content in Buchnera aphidicola and Wigglesworthia glossinidia with over 80% accuracy.

Quantifying vertical and horizontal gene flow across species and time

Speaker: Christos Ouzounis

Using large-scale comparative genomics with appropriate algorithms and databases, described briefly, we present a phylogeny of microbial species that takes into account
both vertical and horizontal gene transfer (HGT) events. So far, all phylogenetic reconstructions have presented microbial relationships as trees rather than networks. This first attempt to reconstruct such an evolutionary network is termed as the “net of life”. We use available reconstruction methods to infer vertical inheritance, and use an ancestral state inference algorithm to map HGT events on the tree. We demonstrate that vertical inheritance constitutes the bulk of gene transfer on the tree of life. We term the bulk of horizontal gene flow between species as “vines”, and show that multiple, yet tiny, vines interconnect throughout the tree. These results strongly suggest that the HGT network is a scale-free graph, a finding with important implications for genome evolution. Finally, we propose that genes might propagate extremely rapidly across microbial species through the HGT network, using certain organisms as hubs. Open problems and possible future directions for research will also be discussed.

Further reading:

Ouzounis CA (2005) Ancestral state reconstructions for genomes. Curr. Opin. Genet. Devel. 15, 595-600.

Kunin V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net of life: Reconstructing the microbial phylogenetic network. Genome Research 15, 954-959.

Kunin V, Ouzounis CA (2003) The balance of driving forces during genome evolution in prokaryotes. Genome Research 13, 1589-1594.


Section: Comparative, Population, and Environmental Genomics

Computational Comparative Genomics: Genes, Regulation, Evolution

Speaker: Manolis Kellis

Comparative genomics of multiple related species has emerged as one of the most powerful and systematic ways to identify the functional elements encoded in any genome, by virtue of their conservation across millions of years of evolution. These elements are subtle and hidden in millions of non-functional nucleotides, requiring new techniques to decipher them and to understand their roles across diverse experimental and functional datasets.

In my group, we have used comparative genomics of multiple closely related species for the de-novo discovery of genes, regulatory motifs, microRNAs, gene targets, and enhancer elements. By studying the conservation properties of known functional elements, we define evolutionary signatures, specific to each type of functional element and dictated by the precise selective constraints that it evolves under. We have successfully applied such approaches to study multiple fungal, fly, and mammalian genomes, refining the annotation of existing elements and revealing hundreds of new functional elements. In addition, comparative genomics has enabled us to study the evolutionary dynamics of genomes, gene families, and regulatory motifs, and their role in the emergence of new gene functions.

This talk will cover our recent work analyzing 12 flies and 17 fungi. In particular, I will focus on new probabilistic methods for the accurate reconstruction of gene family evolution across complex phylogenies and complete genomes, and the identification of unusual gene structures suggesting new mechanisms of post-transcriptional regulation.

Large-scale Comparative Genomic Ranking of Taxonomically Restricted Genes (TRGs) in Bacterial and Archaeal Genomes using the “Quality Index for Predicted Proteins” (QIPP)

Speaker: Dawn Field

G.A. Wilson1, E.J. Feil2, A.K. Lilley1, and D. Field1

1 CEH-Oxford, Mansfield Road, Oxford, OX1 3SR, UK
2 Department of Biology and Biochemistry, University of Bath, Claverton Down, Bath, UK BA2 7AY

Lineage-specific, or taxonomically restricted genes (TRGs), especially those which are species and strain-specific, are of special interest because they are expected to play a role in defining exclusive ecological adaptations to particular niches. Despite this, they are relatively poorly studied and little understood, in large part because many are still orphans or only have homologues in very closely related isolates. This lack of homology confounds attempts to establish the likelihood that a hypothetical gene is expressed and, if so, to determine the putative function of the protein. We have developed “QIPP” (”Quality Index for Predicted Proteins”), an index that scores the ‘quality’ of a protein based on non-homology-based criteria. QIPP can be used to assign a value between zero and one to any protein based on comparing its features to other proteins in a given genome. We have used QIPP to rank the predicted proteins in the proteomes of Bacteria and Archaea. This ranking reveals that there is a large amount of variation in QIPP scores and identifies many high-scoring orphans as potentially ‘authentic’ (expressed) orphans. There are significant differences in the distributions of QIPP scores between orphan and non-orphan genes for many genomes and a trend for less well-conserved genes to have lower QIPP scores. The implication of this work is that QIPP scores can be used to further annotate predicted proteins with information that is independent of homology. Such information can be used to prioritize candidates for further analysis. Data generated for this study can be found in the OrphanMine at http://www.genomics.ceh.ac.uk/orphan_mine.

Weakened selection and shifting patterns of polymorphism in Shigella

Speaker: Edward Feil

Dept Biology and Biochemistry, University of Bath

Most de novo mutations are neither advantageous to the bacterium, nor excessively harmful, but are instead slightly deleterious. Most of these mutations will be removed from the population by purifying selection, but this process takes time. The efficiency of selection depends upon the selective coefficient, s, and the effective population size, Ne. In small populations, such as those corresponding to an ecological “island”, selection is relatively inefficient and deleterious mutations accumulate. Here I will discuss the implications of this process, and will present an comparative analysis of E. coli and Shigella to illustrate how examining polymorphism frequencies provides a far more sensitive test for the efficiency of selection than analyses of base composition.

Developing a statistical toolbox to describe and compare communities using metagenomic sequences

Speaker: Patrick Schloss

University of Massachusetts
Amherst, MA 01003, USA

Metagenomic shotgun sequencing has made it possible to interrogate the genetic diversity of as yet uncultured microbes allowing for the study of microbial ecology at the genomic level. Two common refrains from these studies is the general inability to assemble large genomic fragments and the inability to assign a function to most of the predicted open reading frames (ORFs). Both of these limit the ability to make important ecological inferences from metagenomic shotgun sequencing. Therefore, we have begun to develop a statistical toolbox that uses individual sequence reads and does not require an annotation to describe and compare microbial communities. We have successfully adapted tools developed for analysis of sequence collections generated from PCR-derived clone libraries to the collection of ORFs extracted from individual sequence reads. These tools enable us to estimate the richness of operationally defined protein families (OPFs) and the richness of these families that are shared between communities. They also allow us to compare the relative abundance of these families. Using these approaches we have been able to identify OPFs, which dominate their community suggesting their ecological importance; however, these OPFs lack homology to proteins with a known function. Also, we have been able to predict the number of OPFs that are shared between disparate communities. Advancing statistical tools for describing and comparing microbial communities is essential to the design of future studies and the analysis of the data that are generated.


Section: Elucidation of Gene Regulation in Microbes

Evolution of bacterial regulatory systems

Speaker: Mikhail S. Gelfand

Institute for Information Transmission Problems, RAS
Bolshoi Karetny pereulok 19, Moscow 127994, Russia

When one has hundreds bacterial genomes to compare, with dozens of genomes in some “popular” taxa (alpha- and gamma-proteobacteria, Firmicutes), it becomes possible to study evolution of regulatory systems. These studies demonstrate remarkable flexibility of these systems with rapid regulon expansion, changes in the regulator specificity, merging of reglons following regulator loss, complete replacement of regulatory systems, etc. I believe it is somewhat premature to speak about quantitative description of such events, but we can now compile an inventory of the main types of events. It is also possible to reconstruct detailed evolutionary history of some regulatory systems. I will try to do that using three main examples: the LacI family of transcription factors, RNA-level T-box regulation in the Firmicutes, and transcription factors regulating iron homeostasis in alpha-proteobacteria.

Comparative genomic inference of bacterial regulatory systems

Speaker: Erik van Nimwegen

In this talk I will briefly discuss two recent projects. First, we have developed a Bayesian model for predicting two-component system interactions across all sequenced bacteria directly from the amino acid sequences of the kinases and receivers. Tests on predicting known cognate interactions, and comparison of predictions with known orphan interactions in Caulobacter suggests that the method accurately reconstructs two-component signaling networks. Second, we have developed a method for quantifying selection at non-coding positions genome-wide (both intergenic and at silent positions) from multiple alignments of clades of related bacterial genomes. The method explicitly models evolution at non-coding positions taking into account the phylogenetic relations between the species, base composition biases, and codon biases. Using this method we find that there are strikingly universal characteristics of selection patterns across all sequenced clades with some interesting implications. For example, whereas silent sites evolve according to a neutral background model, intergenic regions show significant evidence of selection in all clades with consistently more selection upstream than downstream of genes. However, although the number of transcription factors grows approximately quadratically with the number of genes in bacterial genomes, our analysis strongly suggests that the number of regulatory elements per upstream region is the same in bacteria with large and small genomes. This has important implications for the structure of regulatory networks between large and small bacterial genomes. We also find a universal pattern of high adenosine frequency, significant selection at silent positions, and strong avoidance of RNA secondary structure in the area immediately around translation start sites in all clades. This suggests that selection for translation initiation efficiency shapes the sequence composition around translation start in all clades. In addition, this universal pattern may be used to improve accuracy of gene start annotation.

Learning global transcriptional dynamics with the Inferelator and cMonkey

Speaker: Richard Bonneau

Authors: David Reiss, Vesteinn Thorsson, Nitin Baliga, Aviv Madar, Peter Waltman, Thadeous Kacmarczyk, Richard Bonneau

Our system for network inference and modeling consists of two major components: cMonkey (a method for learning co-regulated biclusters and pathways), the Inferelator (regulatory network inference). We describe their integration into a functioning integrated system applied to several prokaryotic organisms.

cMonkey: We have developed an integrative biclustering algorithm, cMonkey, which groups genes and conditions into biclusters on the basis of 1) coherence in expression data across subsets of experimental conditions, 2) co-occurrence of putative cis-acting regulatory motifs in the regulatory regions of bicluster members and 3) the presence of highly connected sub-graphs in metabolic and functional association networks. We describe the algorithm and the results of extensive tests of several previously described methods, showing that cMonkey has several advantages in the context of regulatory network inference.

The Inferelator: We have described a network inference algorithm, the Inferelator, which infers regulatory influences for genes and/or gene clusters from mRNA and/or protein expression levels. The procedure can simultaneously model equilibrium and time-course expression levels, such that both kinetic and equilibrium expression levels may be predicted by the resulting models. Through the explicit inclusion of time, and gene-knockout information, the method is capable of learning causal relationships. It also includes a novel solution to the problem of encoding interactions between predictors. We discuss the results from an initial application of this method to the halophilic archaeon, Halobacterium NRC-1. We have found the network to be predictive of 130 newly collected microarray datasets and have also validated parts of the network using ChIP-chip. This network offers a means of deciphering how this organism maintains homeostasis and responds to wide varieties of metabolic, genetic and environmental states.

For Background on our integrated system see:
http://genomebiology.com/2006/7/5/R36
http://www.biomedcentral.com/1471-2105/7/280
http://www.biomedcentral.com/1471-2105/7/176


Section: Computational Tools and Databases for Microbial Analysis

Genomic Reconstruction and Experimental Assessment of Metabolic Subsystems in Bacteria

Speaker: Andrei Osterman

Burnham Institute for Medical Research, CA,
Fellowship for Interpretation of Genomes, IL

A large-scale sequencing and comparative analysis of diverse genomes strongly enhanced our ability to project the knowledge of genes and pathways from model organisms to others. This progress is particularly notable in reconstruction of the key metabolic subsystems, that are present, with some variations, in a variety of diverse microbial species. Capturing and analyzing functional variants of annotated subsystems projected across a large collection of integrated genomes is one of the goals of The SEED project (http://theseed.uchicago.edu/FIG/index.cgi). The analysis of functional and genomic context, most importantly, conserved operons and regulons, allows us to efficiently address open problems, eg propose the most likely gene candidates to fill-in the gaps in pathways. Although genomics-based functional predictions are emerging on a massive scale, the actual impact of this effort is contingent on our ability to complement it by systematic experimental verification. I will outline a concept of an annotation-reconstruction-prediction-verification pipeline, as an efficient and scalable approach to address this problem. The power of the approach will be illustrated by examples from our systematic survey of carbohydrate utilization pathways in environmental bacteria. This and related studies set the stage for genome-scale metabolic modeling of diverse microbial species and interpretation of the emerging metagenomic data.

Evolutionary analysis of genomes and gene networks from taxonomic and phenotypic perspective

Speaker: Natalia Maltsev

Mathematics and Computer Science division, Argonne National Laboratory,

Computation Institute, the University of Chicago

Evolutionary analysis of diverse sets of organisms is essential for understanding mechanisms of their adaptation to environments. Common ancestry of eukaryotes, prokaryotes and archaeobacteria leads to similarity of many molecular functions. However, differences in organisms’ structural complexity, physiology, and lifestyle result in divergent evolution and emergence of variants of molecular function, metabolic organization, and phenotypic features. Recent progress in genomics, bioinformatics and physiological studies now allows for systematic exploration of adaptive mechanisms that led to diversification of biological systems. Such adaptive changes usually are not limited to one component of the system, on the contrary, in the process of adaptation, organisms undergo co-adaptive changes, such as the complementary changes of protein sequences to accommodate changes in an enzyme’s active site or co-evolution of properties of different steps in metabolic pathways.

We have developed computational systems (e.g. PUMA2 [PUMA]), GNARE [GNARE] tools (e.g. Chisel [Chisel]) and databases (e.g. Sentra [Sentra]) that allow comparative and evolutionary analysis of genomes, metabolic networks and signal transduction systems in the framework of phenotypic and taxonomic information. Such organism-centric view allows exploring genomic data from a systems-level perspective. It facilitates tracing the evolutionary steps that underlie the emergence of the entire biochemical pathways, rather than individual gene products, and to document an evolutionary progression of organisms that house these pathways. It also allows identifying evolutionary patterns of organization characteristic of particular taxonomic groups and phenotypes.

The talk will describe the approach and its application to analysis of microbial genomes and metagenomes.

References:

[PUMA] Maltsev N, Glass E, Sulakhe D, Rodriguez A, Syed MH, Bompada T, Zhang Y, D’Souza M. PUMA2 – grid-based high-throughput analysis of genomes and metabolic pathways. Nucleic Acids Res. 2006 Jan 1;34(Database issue) :D369-372.

[GNARE] D. Sulakhe, M. D’Souza, M. Syed, A. Rodriguez, Y. Zhang, E. Glass, M. Romine, N. Maltsev. GNARE ­ A Grid-based Server for the Analysis of User Submitted Genomes. Accepted for publication in Nucleic Acids Res. 2007 (special issue). NAR-00335-Web-B-2007.R1

[Sentra] M. D’Souza, E. Glass, M. Syed, Yi Zhang, M. Galperin, N. Maltsev. Sentra – a database of Prokaryotic Signal Transduction. Nucleic Acids Res. 2007 Jan;35 (Database issue):D271-3. Epub 2006 Nov 29.

[Chisel]Chisel – http://compbio.mcs.anl.gov/CHISEL

Software tools for high-throughput genome annotation

Speaker: Dmitrij Frishman

Genomic sequences are being deciphered at unprecedented pace, and the demand in sequence data is also continuously growing, fueled by thrilling potential applications which range from personalized genome-based medicine and targeted cancer therapies to microbial strain optimization and bioterrorism prevention. In my talk I will focus on modern computational methods and software infrastructure for analyzing protein function at genomic scale. The new version of the PEDANT genome analysis system and its integration with SIMAP - a precomputed similarity matrix for six million amino acid sequences - will be described. Another novel software system - PROMPT - is a comprehensive bioinformatics software environment which enables the user to compare arbitrary protein sequence sets, revealing statistically significant differences in their annotation features. I will also discuss possible ways to reduce the error level of automatically generated annotation by using machine intelligence.

The BioCyc Collection of 260+ Pathway/Genome Databases

Speaker: Peter Karp

Peter D. Karp, Pallavi Kaipa, Alexander Shearer, SRI International

The BioCyc [1] collection of 260+ Pathway/Genome Databases (PGDBs) is available at URL BioCyc.org. BioCyc includes the EcoCyc DB, which describes the metabolic and genetic regulatory network of Escherichia coli; and the MetaCyc DB, which describes more than 900 metabolic pathways that were experimentally elucidated in more than 900
organisms.

In addition, BioCyc contains PGDBs for most organisms with completely sequenced genomes. Each BioCyc PGDB combines the annotated genome sequenced with predicted metabolic pathways, predictions of pathway hole fillers, and operon predictions. BioCyc is also available through SRI’s BioWarehouse relational database system, and as a downloadable Lisp executable.

Each BioCyc PGDB uses the same DB schema, facilitating comparisons among the DBs. We have developed a number of systems-level comparisons of pathway and genome information that will be presented. In addition, we will present several novel tools for analysis of omics datasets can be applied to any BioCyc PGDB.

[1] “Expansion of the BioCyc collection of pathway/genome databases to 160 genomes,” Nucleic Acids Research 19:6083-89 2005.

Microbial Genome Resources at NCBI

Speaker: Tatiana Tatusov

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

An ever increasing number of complete and unfinished microbial genomes are being submitted to the public databases. Proper genome analysis is of paramount importance in understanding genomic biology and in annotating genome sequences. The National Center for Biotechnology Information (NCBI) provides a number of resources to analyze microbial genomes and proteomes. These include various databases housing the original data submissions, as well as explicit links connecting them. Pre-calculated results for various similarity searches such as sequence, domain, and structural, are stored to facilitate quick retrieval. Data from both the original submission and pre-calculated results are browsable and retrievable through the various web portals and displays available at NCBI and also through the FTP site or through the set of programs collectively known as E-Utilities. A set of interactive tools provides additional access to a number of pre-calculated similarity results that allow users to manipulate the results in a biologically meaningful way. All of these resources provide a rich set of tools for the analysis and understanding of genomic biology.

Learning global transcriptional dynamics with the Inferelator and cMonkey

Speaker: Richard Bonneau

Authors: David Reiss, Vesteinn Thorsson, Nitin Baliga, Aviv Madar, Peter Waltman, Thadeous Kacmarczyk, Richard Bonneau

Our system for network inference and modeling consists of two major components: cMonkey (a method for learning co-regulated biclusters and pathways), the Inferelator (regulatory network inference). We describe their integration into a functioning integrated system applied to several prokaryotic organisms.

cMonkey: We have developed an integrative biclustering algorithm, cMonkey, which groups genes and conditions into biclusters on the basis of 1) coherence in expression data across subsets of experimental conditions, 2) co-occurrence of putative cis-acting regulatory motifs in the regulatory regions of bicluster members and 3) the presence of highly connected sub-graphs in metabolic and functional association networks. We describe the algorithm and the results of extensive tests of several previously described methods, showing that cMonkey has several advantages in the context of regulatory network inference.

The Inferelator: We have described a network inference algorithm, the Inferelator, which infers regulatory influences for genes and/or gene clusters from mRNA and/or protein expression levels. The procedure can simultaneously model equilibrium and time-course expression levels, such that both kinetic and equilibrium expression levels may be predicted by the resulting models. Through the explicit inclusion of time, and gene-knockout information, the method is capable of learning causal relationships. It also includes a novel solution to the problem of encoding interactions between predictors. We discuss the results from an initial application of this method to the halophilic archaeon, Halobacterium NRC-1. We have found the network to be predictive of 130 newly collected microarray datasets and have also validated parts of the network using ChIP-chip. This network offers a means of deciphering how this organism maintains homeostasis and responds to wide varieties of metabolic, genetic and environmental states.

For Background on our integrated system see:
http://genomebiology.com/2006/7/5/R36
http://www.biomedcentral.com/1471-2105/7/280
http://www.biomedcentral.com/1471-2105/7/176

Microbial Genome Resources at NCBI

Speaker: Tatiana Tatusov

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

An ever increasing number of complete and unfinished microbial genomes are being submitted to the public databases. Proper genome analysis is of paramount importance in understanding genomic biology and in annotating genome sequences. The National Center for Biotechnology Information (NCBI) provides a number of resources to analyze microbial genomes and proteomes. These include various databases housing the original data submissions, as well as explicit links connecting them. Pre-calculated results for various similarity searches such as sequence, domain, and structural, are stored to facilitate quick retrieval. Data from both the original submission and pre-calculated results are browsable and retrievable through the various web portals and displays available at NCBI and also through the FTP site or through the set of programs collectively known as E-Utilities. A set of interactive tools provides additional access to a number of pre-calculated similarity results that allow users to manipulate the results in a biologically meaningful way. All of these resources provide a rich set of tools for the analysis and understanding of genomic biology.

Evolutionary analysis of genomes and gene networks from taxonomic and phenotypic perspective

Speaker: Natalia Maltsev

Mathematics and Computer Science division, Argonne National Laboratory,

Computation Institute, the University of Chicago

Evolutionary analysis of diverse sets of organisms is essential for understanding mechanisms of their adaptation to environments. Common ancestry of eukaryotes, prokaryotes and archaeobacteria leads to similarity of many molecular functions. However, differences in organisms’ structural complexity, physiology, and lifestyle result in divergent evolution and emergence of variants of molecular function, metabolic organization, and phenotypic features. Recent progress in genomics, bioinformatics and physiological studies now allows for systematic exploration of adaptive mechanisms that led to diversification of biological systems. Such adaptive changes usually are not limited to one component of the system, on the contrary, in the process of adaptation, organisms undergo co-adaptive changes, such as the complementary changes of protein sequences to accommodate changes in an enzyme’s active site or co-evolution of properties of different steps in metabolic pathways.

We have developed computational systems (e.g. PUMA2 [PUMA]), GNARE [GNARE] tools (e.g. Chisel [Chisel]) and databases (e.g. Sentra [Sentra]) that allow comparative and evolutionary analysis of genomes, metabolic networks and signal transduction systems in the framework of phenotypic and taxonomic information. Such organism-centric view allows exploring genomic data from a systems-level perspective. It facilitates tracing the evolutionary steps that underlie the emergence of the entire biochemical pathways, rather than individual gene products, and to document an evolutionary progression of organisms that house these pathways. It also allows identifying evolutionary patterns of organization characteristic of particular taxonomic groups and phenotypes.

The talk will describe the approach and its application to analysis of microbial genomes and metagenomes.

References:

[PUMA] Maltsev N, Glass E, Sulakhe D, Rodriguez A, Syed MH, Bompada T, Zhang Y, D’Souza M. PUMA2 – grid-based high-throughput analysis of genomes and metabolic pathways. Nucleic Acids Res. 2006 Jan 1;34(Database issue) :D369-372.

[GNARE] D. Sulakhe, M. D’Souza, M. Syed, A. Rodriguez, Y. Zhang, E. Glass, M. Romine, N. Maltsev. GNARE ­ A Grid-based Server for the Analysis of User Submitted Genomes. Accepted for publication in Nucleic Acids Res. 2007 (special issue). NAR-00335-Web-B-2007.R1

[Sentra] M. D’Souza, E. Glass, M. Syed, Yi Zhang, M. Galperin, N. Maltsev. Sentra – a database of Prokaryotic Signal Transduction. Nucleic Acids Res. 2007 Jan;35 (Database issue):D271-3. Epub 2006 Nov 29.

[Chisel]Chisel – http://compbio.mcs.anl.gov/CHISEL

Comparative genomic inference of bacterial regulatory systems

Speaker: Erik van Nimwegen

In this talk I will briefly discuss two recent projects. First, we have developed a Bayesian model for predicting two-component system interactions across all sequenced bacteria directly from the amino acid sequences of the kinases and receivers. Tests on predicting known cognate interactions, and comparison of predictions with known orphan interactions in Caulobacter suggests that the method accurately reconstructs two-component signaling networks. Second, we have developed a method for quantifying selection at non-coding positions genome-wide (both intergenic and at silent positions) from multiple alignments of clades of related bacterial genomes. The method explicitly models evolution at non-coding positions taking into account the phylogenetic relations between the species, base composition biases, and codon biases. Using this method we find that there are strikingly universal characteristics of selection patterns across all sequenced clades with some interesting implications. For example, whereas silent sites evolve according to a neutral background model, intergenic regions show significant evidence of selection in all clades with consistently more selection upstream than downstream of genes. However, although the number of transcription factors grows approximately quadratically with the number of genes in bacterial genomes, our analysis strongly suggests that the number of regulatory elements per upstream region is the same in bacteria with large and small genomes. This has important implications for the structure of regulatory networks between large and small bacterial genomes. We also find a universal pattern of high adenosine frequency, significant selection at silent positions, and strong avoidance of RNA secondary structure in the area immediately around translation start sites in all clades. This suggests that selection for translation initiation efficiency shapes the sequence composition around translation start in all clades. In addition, this universal pattern may be used to improve accuracy of gene start annotation.

Quantifying vertical and horizontal gene flow across species and time

Speaker: Christos Ouzounis

Using large-scale comparative genomics with appropriate algorithms and databases, described briefly, we present a phylogeny of microbial species that takes into account
both vertical and horizontal gene transfer (HGT) events. So far, all phylogenetic reconstructions have presented microbial relationships as trees rather than networks. This first attempt to reconstruct such an evolutionary network is termed as the “net of life”. We use available reconstruction methods to infer vertical inheritance, and use an ancestral state inference algorithm to map HGT events on the tree. We demonstrate that vertical inheritance constitutes the bulk of gene transfer on the tree of life. We term the bulk of horizontal gene flow between species as “vines”, and show that multiple, yet tiny, vines interconnect throughout the tree. These results strongly suggest that the HGT network is a scale-free graph, a finding with important implications for genome evolution. Finally, we propose that genes might propagate extremely rapidly across microbial species through the HGT network, using certain organisms as hubs. Open problems and possible future directions for research will also be discussed.

Further reading:

Ouzounis CA (2005) Ancestral state reconstructions for genomes. Curr. Opin. Genet. Devel. 15, 595-600.

Kunin V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net of life: Reconstructing the microbial phylogenetic network. Genome Research 15, 954-959.

Kunin V, Ouzounis CA (2003) The balance of driving forces during genome evolution in prokaryotes. Genome Research 13, 1589-1594.

Weakened selection and shifting patterns of polymorphism in Shigella

Speaker: Edward Feil

Dept Biology and Biochemistry, University of Bath

Most de novo mutations are neither advantageous to the bacterium, nor excessively harmful, but are instead slightly deleterious. Most of these mutations will be removed from the population by purifying selection, but this process takes time. The efficiency of selection depends upon the selective coefficient, s, and the effective population size, Ne. In small populations, such as those corresponding to an ecological “island”, selection is relatively inefficient and deleterious mutations accumulate. Here I will discuss the implications of this process, and will present an comparative analysis of E. coli and Shigella to illustrate how examining polymorphism frequencies provides a far more sensitive test for the efficiency of selection than analyses of base composition.