How Many Human Genes Are There?

Steven Salzberg

Hurvitz Professor of Computer Science
Director, Center for Bioinformatics and Computational Biology
University of Maryland

About the Lecture

Ever since the discovery of the genetic code in the early 1960’s, scientists have been working to identify all the genes in the human genome. At the same time, many scientists have been looking, thus far without much success, for genes that make humans special compared to other species. As these efforts have progressed and DNA sequencing technology improved, initial estimates of the number of human genes have been revised, and over time have steadily decreased. Many scientists expected this question to be finally answered by determining the complete DNA sequence of the human genome. But the information obtained at completion of the sequence in 2001 did not, as it turned out, allow scientists to identify all the genes or determine how many there are. Since then, estimates of the gene count have continued to fluctuate, both up and down. And comparisons of the sequences of the human genome with the genomes of other species seems to show that nothing about the human gene count is exceptional. In fact, while some simpler organisms have fewer genes, other “lower” species have considerable larger genomes and more genes than humans. This talk will review the history of efforts to identify human genes and explain the evidence for our own current best estimate of 22,333 genes.

About the Speaker

STEVEN SALZBERG is Horvitz Professor of Computer Science and Director, Center for Bioinformatics and Computational Biology, University of Maryland, College Park. He works on better ways of finding genes, assembling genomes, and understanding evolution. He has developed open source software for DNA sequence analysis that has been used globally by thousands of scientists. He has contributed to many seminal sequencing projects, including the Human Genome Project and the Bacillus anthracis sequencing project after the anthrax attacks in 2001. He also co-founded the Influenza Genome Sequencing Project, which sequenced thousands of viruses. His work has been featured in the national press including the New York Times, the Wall St. Journal, the Washington Post, and National Public Radio. Dr. Salzberg received his B.A., M.S., and M.Phil. degrees from Yale University, and his Ph.D. in Computer Science from Harvard University. He joined Johns Hopkins University as an Assistant Professor in 1989, and moved to The Institute for Genomic Research (TIGR) in 1997. He joined the Computer Science Department at the University of Maryland in 2005, where he now holds joint appointments in Genetics and Bioengineering. He has published over 175 scientific papers and two books. He is a Fellow of the American Association for the Advancement of Science and a past member of the Board of Scientific Counselors of the National Center for Biotechnology Information at NIH. For more information visit his website at http://cbcb.umd.edu/~salzberg or his science blog at http://genome.fieldofscience.com.

Minutes

President Robin Taylor called the 2,274th meeting to order at 8:21 pm October 29, 2010 in the Powell Auditorium of the Cosmos Club. Ms. Taylor made some announcements and introduced six new members of the Society. The minutes of the 2,273rd meeting were read and approved.

Ms. Taylor then introduced the speaker of the evening, Mr. Steven Salzberg, Director of the Center for Bioinformatics and Computational Biology at the University of Maryland. Mr. Salzberg spoke on “How Many Human Genes Are There?

DNA, Mr. Salzberg began, is the stuff of life. He showed a picture of some human cells and reminded us that in each of them there is a long strand of DNA, a molecule of deoxyribonucleic acid. The molecule contains all of the organism’s genes; it’s about 3 billion bases long.

Mr. Salzberg treats the genes as information. The 3 billion letters allow us to do everything we do and to pass on characteristics to offspring.

Genes differ by about .1% from one individual to another. Twenty years ago, it was assumed that humans have the most genes. We don’t. We are well ahead of fungi and bacteria, but some plants have more than we, and some amphibians. Some plants have slightly more than 1011; some amphibians have just under that number. Loblolly pine trees, very simple trees, have 24 billion bases in their genes.

One of the first genes discovered, in 1840, was hemoglobin. Its structure was determined in 1959. Following that, in the “letters to nature” section of Nature, one F. Vogel detailed a set of observations and assumptions and announced a preliminary estimate of 6.7 million genes. The number of bases Mr. Vogel thought was typical of all genes was wrong and the assumption that everything in the DNA encodes proteins was also wrong. Still, it was a reasonable guess for the time.

He illustrated how genes are embedded in DNA strands by showing a large page of random letters. Embedded within them there was a small amount of meaningful text. That, unfortunately, is not how genes are. Large parts of the genes are not meaningful, that is, they do not determine the structure of proteins. Between the “start” codon and the “stop” codon, there are exons, which are meaningful, and introns, which are not. So before the gene can be identified, the start codon must be located, then the stop codon, and then the exons, the meaningful sequences, must be isolated from the introns, or meaningless sequences. The amount of signal in the string is very small.

Gene identification is a four-part process – ab initio gene finding, alignment of electronic sequence tags (ESTs) and full-length cDNA sequence, alignment of protein sequences to genomic DNA, and finally, combining evidence together.

Ab initio gene finding is a kind of statistical, logical code breaking process. They find the start codons, which are AGT sequences. Near that, both before and after, there are sequences that are, with known probabilities, characteristic of genes. Theoretical models are programmed to score the likelihood of these being genes. These models are specie specific. The human model will work fairly well for humans, but poorly for plants.

Expressed sequence tag (EST) alignment focuses on the meaningful parts, the exons, the parts determine the kind of protein produced. By looking at the structure of protein produced, they get an indication of which parts of the gene are exons.

Mr. Salzberg’s group combines evidence with a program called JIGSAW. There was a study published in 2006 in Genome Biology, a “bakeoff,” as he called it, to compare all the programs to identify genes using all four kinds of analysis, and JIGSAW was among the highest in specificity and accuracy. JIGSAW, which was a combination of a number of other algorithms, is about 80% in sensitivity, meaning 80% of the exons identified were exactly right.

Does that enable us to say how many genes there are? In the early 80's, people thought that once all the base pairs were identified, they would be able to identify genes exactly and would know how many there are. Not so. There are too many uncertainties in the methods of gene verification.

In the 1990's, estimates ranged from 50,000 to 100,000. By 2,000, they had come down to somewhat below 50,000, although there was one estimate of 120,000. The lead of an article in Science in 2000 showed a gambling wheel and an invitation to “Place your bet.” One paper, in Nature in 2001, said 30,000 to 40,000 genes were likely but admitted a “large degree of uncertainty.”

Another paper in 2001 featured a list of authors at least a page long. Our speaker’s name appeared about a third of the way down, about 90 names after the lead name, J. Craig Venter. Your recording secretary is not sure the list did not continue on more pages, although the last name on the first page did begin with “Z.” In a blush of confidence, perhaps fortified by the strength of numbers, these folks put the number of genes at precisely 26,588.

In 2004, the International Human Genome Sequencing Consortium published a document announcing “Finishing the Euchromatic Sequence of the Human Genome.” Lacking the boldness, or the temerity, of the Venter group, they retreated from the five-significant-digit level of precision to about 1 ½ digits, putting the number at 20,000 to 25,000 genes.

Perhaps wishing things had been simpler, Mr. Salzberg noted that e. coli have genes that are easy to find. E. coli genes appear to number slightly more than 4000 and people are confident that number is fairly close.

So, where are we now, he asked. We are between a chicken and a grape. Grapes seem to have about 30,434 genes, humans 22,333, and chickens about 16,736. And the estimates of the number still vary considerably, depending on methods and assumptions. However, he considers these about the best current estimates for these organisms.

There are also different definitions of genes. There are pseudogenes and redundant genes. Some people include RNA genes and others do not. He gave a list of several databases where people can get information on genes.

He showed a dramatic chart of how the number of estimated genes came down after 1990. Initial numbers were up around 100,000. In 2000, they ranged at 50,000 and lower. Now they nest from 19,000 to 22,000 or so.

But it’s not so simple. A new technique is revealing many new genes. It’s called RNA-seq. Machines process the full collection of RNA in a cell, which is called the transcriptome. The record is aligned back to the genome to produce a clear picture. One thing they’ve learned: there are new proteins they did not know about, because 90% or more of genes undergo alternative splicing.

Another surprise is the number of differences in the pan-genome. There are about 5 million different base-pairs between an Asian and an African human.

Lest all this might depress us, he ended with a positive suggestion. After all this, we can still eat both the chicken and the grape.

In the Q&A, one question was, how many genes are in the part of the chromosome that goes directly from mothers to children? “We don’t know,” he said.

Another person asked if the definition of humans gets vague as they sample DNA from past millennia. Yes, Mr. Salzberg said. They have done it from samples 35 - 40,000 years old. Ne noted that some of us have Neanderthal genes in us and others do not.

Are there any overlapping genes? No, at least not usually.

Someone suggested working on viruses, which are much simpler. They are, he admitted, very much simpler. A great deal of work is done on them; indeed, they were the first thing sequenced.

After the talk, Ms. Taylor presented to the speaker a plaque commemorating the occasion. She made the usual housekeeping announcements. She invited visitors to apply for membership. She announced some upcoming meetings. Finally, at 9:51 pm, she adjourned the 2,274th meeting to the social hour.

Attendance: 66
The weather: Cool, slightly cloudy
The temperature: 10°C
Respectfully submitted,

Ronald O. Hietala,
Recording secretary

The 2,274th Meeting of the Society

October 29, 2010 at 8:00 PM

Powell Auditorium at the Cosmos Club