The Proteome Challenge

John S. Garavelli

Bioinformatics Scientist

About the Lecture

Publications heralding the "completion" of the human genome on February 15, 2001, were destined to be regarded as marking an important event in human history. Political, legal and commercial considerations were ultimately more important in the timing of the simultaneous publications than the science. What had been announced as the completion of the human genome was the point at which computer analysis could assemble a reliably consistent overlapped mapping of the sequenced fragments of DNA with only a small number of gaps whose size and composition could be estimated. A lot of hard work remains to be done. The private human genome collaboration reported that 12,809 (41%) of the probable gene sequences could not be assigned a function based on similarity with any other sequences of known function. There are four major questions to be answered in molecular biology. Knowing the sequence of a gene, can we accurately and reliably predict the sequence of a protein? Knowing the sequence of a protein, can we reliably predict its structural conformations? Knowing the structure of a protein, can we predict the mechanism of its function? Knowing the functional mechanism of a protein, the timing and location of its expression, and the ensemble of molecules with which it interacts, can we predict its metabolic role? Meeting this challenge will require database development, bioinformatics tools, and computational resources to support proteomic research and protein function assignment.

About the Speaker

John S. Garavelli received a B.Sc. in chemistry from Duke University in 1969 and, after service at the Walter Reed Army Institute of Research in 1970 and 1971, earned a Ph.D. in biochemistry at Washington University, Saint Louis, in 1975. He did post-doctoral work at the Duke University Marine Laboratory and taught at the University of Delaware and Texas A&M University. He was a National Research Council Senior Research Fellow at the Extraterrestrial Research Division, NASA Ames Research Center. Since 1989 he has been a Senior Research Scientist at the National Biomedical Research Foundation, and was been Associate Director of the Protein Information Resource from 1997 to 2001. He has conducted research in biotechnology database operation and bioinformatics, computational chemistry, molecular evolution, information theory, and space biology.

Minutes

President Robert Collins called the 2140th meeting to order at 8:15 pm January 25, 2002. Former President Ron Hietala read the minutes of the 2138th meeting for the Recording Secretary and they were approved. Mr. Collins introduced Mr. Garavelli, who needed no introduction. The first challenge, Mr. Garavelli said, is explaining what a proteome is. It is a neologism based on “genome”. A genome is an ensemble of all the genes of a cell. The proteome is the ensemble of all the proteins expressed in a particular cell under particular conditions and the functional roles of those proteins. Using a typical gene, Mr. Garavelli illustrated how to determine a protein sequence from a gene sequence. The DNA bases are represented as sequences of g's, c's, a's, and t's. The points where the transcription of DNA to RNA begins and where RNA to protein begins are identified. Three bases at a time are translated into one of twenty amino acids. At some points, the translation appears to stop in the middle of a letter. This is an “intron,” an apparently inserted piece of nucleotide sequence that is transcribed from the gene into RNA, and then removed from the RNA. This removal of introns and splicing of coding sections, or exons, is done by RNA enzymes in the nucleus. Two of the biggest surprises in molecular biology were the discovery of intron sequences, and then a few years later that the enzymes that spliced the exons were not proteins, but other molecules of RNA. Interpreting a gene is analogous to interpreting a file on a computer disk. To read them, you must find the correct groupings of 8 bits of on the computer disk and 3 bases of DNA in the nucleic acid. You must find the notations that tell where the file begins and ends. Just as files are in small sections on the disk that the operating system must organize, the cell must somehow splice together the exons to make messenger RNA. That briefly tells how genes are found and interpreted in a genome. The “completion” of the human genome, announced February 15, 2001, was heralded as an important event in human history. The political and commercial considerations were more important in the orchestration of the publications than the science. What was called the “completion“ of the human genome was the point at which computer analysis produced a reliable overlapped map of the sequenced fragments of DNA with only a few gaps whose size and composition could be estimated. Many announcements will follow, not as genes are identified but as their functions are determined. The public may well become bored with them. Both collaborations were surprised by the small number of genes found, 30,000–35,000 in one case and about 39,000 in the other. Estimates had predicted 50,000 to over 140,000. Two years ago, few believed there were fewer than 40,000 genes. One possible reason for the overestimates is that predicting smaller numbers would not have generated greater funding. Second, many of the high estimates were based on multiplying proteome sizes by the estimated number of cell types using the naïve assumption — “one gene, one protein,” not recognizing that using alternative introns or splice sites, different proteins are made from the same gene. Thus, “one gene, many proteins”. Some think the missing genes are there and want to keep looking. William Haseltine of Human Genome Sciences [PSW Meeting 2038] told the Boston Globe on April 9, “We believe they have missed as many as two-thirds of the genes …" and even mentioned "sloppy science and sloppy conclusions.” The President of Double Twist has said he will prove there are more than 35,000. From the figures reported by the private group, at least 6.1% of a set of 4512 known human genes, or about 275 genes, would not have been detected. One reason is that there are some very short exons interspersed with extremely long introns; another is the occurrence of overlapping reading frames. Both of these masked some genes. Knowing how they are hidden, they can be found relatively easily, once the right methods are employed. Even with less than the expected number of genes, tens of thousands of the protein sequences have no known function. The private collaboration reported that 41% (12,809) of them could not be assigned a function. Upon entering the post-genomic era, we have four big questions in molecular biology. Knowing the sequence of a gene, can we predict the sequence of a protein? This is genomics, and as we have seen, it is essentially done; it is just not yet infallible. Knowing the sequence of a protein, can we reliably predict its structural conformations? This question has previously been discussed here by George Rose [PSW Meeting 2060] and by John Moult [PSW Meeting 2101]. Knowing the structure of a protein, can we predict the mechanism of its function? Knowing the functional mechanism of a protein, the timing and location of its expression, and the ensemble of molecules with which it interacts, can we predict its metabolic role? These last three questions are the challenges of proteomics. The typical questions addressed in experimental proteomics that can't be resolved by genome sequence are: What are the relative abundances of proteins in cells? What and where are the post-translational modifications? Where are proteins localized within cells? What is a protein's turnover rate? With which other molecules does a protein interact? Seven years ago, Mr. Garavelli began producing a database of post-translational modifications. That database however, addresses only a small part of what needs to be done. With the general failure of public annotation schemes, there is no alternative to data mining in the post-genomic sequence era. Improvements in genomic, gene expression and proteomic databases and data mining methods will be critical for predicting protein function in processes such as metabolic pathways and regulation. Mr. Garavelli described several bioinformatics tools or methods, including the hidden Markov model profile method (HMM), which is used for active site prediction, recognition and classification of protein structural domains. It can be a useful tool for predicting the function of protein sequences with no similarities detected by the standard comparison methods. Attempts to assign functional names by sequence similarity without taking into account the equivocation in functional names have led to “transitive identification catastrophe,” or “genome rot.” For more accurate identification, sequences should be classified on several scales, on the level of overall architecture, on the level of structural homology domains, and at the level of local motifs. In preparing annotations for sequence databases, it has become increasingly necessary to rely on automated methods. Annotation scripts were first introduced in 1995. In 1998, Mr. Garavelli developed automated procedures using scripts to prepare feature annotations for sequences once they have been classified or when a homology domain was detected in a sequence. For quality control of the annotations, he also developed procedures that monitored entries that failed to receive expected annotations. Mr. Garavelli turned to the problem of predicting of protein conformation. Since 1968, knowledgeable people thought proteins, especially those that spontaneously renature, contain within their amino acid sequences a code that makes them assume their native conformations. In 34 years of research, no one has deciphered a protein conformation code that predicts or determines the protein conformation from its amino acids. Although sequence homologies on the scale of twenty or more residues can be used to predict conformation, protein sequences with less than 20% sequence similarity manage to adopt the same conformation. The difficulties in predicting from first principles the structural conformations of a protein sequence have discouraged neither academics nor entrepreneurs. Everyone seems certain the problem will be solved when enough computational power is harnessed. The remaining two problems, predicting function from structure and predicting metabolic role, are thought by most researchers to be solvable if information about expression, processing, and interaction can be collected and retrieved from well designed databases. But that is precisely where academic and commercial interests come to cross-purposes. It appears that the economic, social, and political issues are the more difficult ones, and the scientific issues and the wonderful prospects they promise must await their solution. Mr. Garavelli kindly answered questions from the floor. President Collins thanked Mr. Garavelli, a long-time member and former President of the Society. The President announced the next meeting and the parking rules, invited everyone to stay for refreshments, and adjourned the 2140th meeting to the social hour at 9:25 p.m. Attendance: 54 Temperature: 9.7°C Weather: clear Links: http://home.earthlink.net/~jsgaravelli/ Respectfully submitted, Ronald O. Hietala For the Recording Secretary

The 2,140th Meeting of the Society

January 25, 2002

The Proteome Challenge

John S. Garavelli

Bioinformatics Scientist

About the Lecture

About the Speaker

Minutes

The 2,140^th Meeting of the Society