The computer age has ushered in the discipline of bioinformatics, the effort to solve biological problems with the power of silicon. The ability to use a computer in biological studies has allowed researchers to perform tasks that might have taken weeks to complete in the wet lab. One particular application for bioinformatics which has thrived in the past five years is genome sequencing. Genome sequencing, at its most basic, entails extracting the sequence of nucleotides in an organism’s DNA. Of course, the process is much more complicated than that, taking years to complete even with the aid of computers. A library of the organism’s DNA must be created, then the DNA is sequenced, gaps in the sequence are closed using both the laboratory and the computer, and the sequence is annotated.
This last phase is crucial, for it reflects one of the more important reasons for genome sequencing. As any student of biology knows, the fundamental mechanism driving an organism’s existence is the protein. Each organism has a specific set of proteins determined by what genes it has, genes being the coding parts of the organism’s DNA. Annotation is the process of identifying the location and purpose of genes within the genome sequence. Without annotation, we could only determine descriptive statistics and study interesting parts of the nucleotide sequence, such as repeats. With annotation, we can begin to understand an organism – its reproductive habits, its metabolism, its environment, its place on the evolutionary ladder – even without observing it. Two major methods of annotating a genome are stochastic models and comparative genomics. Stochastic models often take the form of Markov models, where a probability model is trained with the sequence of known genes. The non-annotated sequence runs through the model, and genes are identified probabilistically. However, this does not help identify the gene’s purpose, so we may turn to comparative genomics, which uses the simple concept that conserved sequences among species will share the same function. The most popular form of comparative genomics is a BLAST search, where the non-annotated sequence is queried against a database of annotated sequences to search for similarities.
I have been working with the Streptococcus sanguinis genome sequencing project here at VCU. The project has produced a single, uninterrupted sequence that represents the entire genome of S. sanguinis. Furthermore, bioinformaticians on the project have run a preliminary annotation to identify the location and function of genes. Therefore, the project is nearing termination of its goals and publication of the results. However, some work remains. I will be helping with some or all of the following activities.
First, the annotation must be double-checked manually. Although computers allow us to do things unimaginable a couple of decades ago, the possibility of fault remains (the algorithms and programs run for annotation are still written by humans, after all). Furthermore, comparative genomics is a process just as dependent on probability as Markov models, though in a different capacity. Therefore, members of the project will perform what has been termed ‘manual curation’ of the genome. Assigned certain areas of the sequence, we will go through the annotations manually to search for any ‘interesting’ elements, such as inconsistencies in the annotations, annotations existing in places where there could not be genes (or no annotations where there is a start codon), and the like.
Second, I will examine the GC content and codon usage of the genome. Most genome studies report the GC content and codon usage for the entire genome, but I will be looking at segments. For example, I will look at 1000bp subsequences of the genome and note if any are significantly different. The idea for doing this is as follows: if one part of a genome has a GC content significantly different from the rest of the genome, that part may have something interesting worth examining and reporting. The same goes for codon usage: if the codon usage for several amino acids is different than the norm in a subsequence, it may be worth examining what is going on there.
Finally, we must find and describe several genic/protein processes that are unique to S. sanguinis, in effect making the study of the organism worth the time. After manual curation, this entails determining the gene/protein networks and pathways, and comparing them to other organisms. Any strange or novel processes found in S. sanguinis will be worth reporting and publishing.