WHAT IS THE HUMAN GENOME AND WHY DO WE NEED TO SEQUENCE IT?
The Human Genome project is one of the most ambitious and challenging quests ever undertaken by science. Its goal is to completely map and sequence all of the genetic material that makes us human. When it is done, we will have a new and profoundly powerful tool to help us to unravel the mysteries of how the human body grows and functions.
The cells in our bodies each contain a master program which controls how and when they develop and how they should function. This information is organised in units called genes, which are arrayed, one after the other along long polymers called chromosomes. We have 46 chromosomes, arranged in pairs kept in the nucleus of most cells. The chromosomes are made of deoxyribonucleic acid, or DNA. Chemically, DNA is one of the simplest molecules in the cell. It is comprised of just four building blocks, or residues, strung together in enormously long strings. The residues combine to make our genes, and our genes string together to make our chromosomes.
The sequence of the building blocks is not random. It is inherited from our parents who in turn inherited from their parents. The sequence has been moulded over many aeons of environmental influences and directs our responses to the environmental stimuli we face today. To some extent our genome dictates our future. It may hold versions of genes that predispose us to certain illnesses, or conversely to good health and perhaps longevity. Even the basis of our personality may owe some debt to our genome.
GENES ENCODE PROTEINS, THE "WORKERS" OF THE CELL
The order of the residues in our genes encodes the ultimate structure of the proteins in a cell. It also controls when particular proteins will be produced. All of these control functions are themselves switched on or off by the action of yet more proteins. A knowledge of these proteins, their physical structure and how and when they are turned on or off can help biologists decipher many of the secrets of cell development and regulation.
This in turn would lead to a better understanding of how our bodies function normally and therefore a better comprehension of the processes that lead to disease. It is also thought, in as yet a not completely crystallized manner, that new biological tools will spring from a knowledge of the sequence of the human genome. These tools would rely heavily on computer manipulation of the huge amounts of data emanating from the human genome project.
THE GENOME PROJECT EMPLOYS A "DIVIDE AND CONQUER" STRATEGY.
The human genome is comprised of about three billion building-blocks or residues. This is a lot of information. If each residue was the equivalent of one byte of computer memory, the sequence of the genome from just one person would fill a respectably large hard disk.
The sequence of the residues is not easy to work out. The order can only be generated after millions of chemical reactions which allow us to deduce the sequence. The laboratory methods for doing this are very small scale and still carried out with technology similar to that described by the English scientist, Fred Sanger in the 1970's.
The ideal approach would be to start sequencing at the beginning of a chromosome and stop once we reached the other end. Unfortunately this not possible. With existing technology, only strings of around 1000 residues in length can be sequenced at a time. This means that many such strings must be sequenced and then the final sequence assembled from millions of such smaller sequences. This task is akin to taking 10 copies of the complete Oxford English Dictionary, all 12 volumes, ripping each page into 300,000 minute pieces, placing all of the pieces in to a large barrel, thoroughly mixing them and then trying to put all the pieces back together again.
This is near impossible. However if we did it one page at a time, the task can been made much easier. And different teams can work on different pages at the same time, because the page numbers act as a type of scaffold which tells us where each page belongs.
A similar approach is being used to sequence the human genome. A scaffold has been built on which to place these millions of smaller sequences and to act as the template for breaking the huge task into many smaller tasks. This scaffold is the physical map of the chromosomes.
THE PHYSICAL MAP OF THE GENOME PROVIDES THE SCAFFOLD FOR THE SEQUENCING PROJECT.
The process of building the scaffold for the human genome sequencing effort has almost been completed. This process is called physical mapping. It involves making large scale maps of landmarks that lie along the landscape of the chromosomal DNA. The landmarks that have been used are short pieces of DNA that have already been sequenced. These sequences are then used as tags for their chromosomal environ, a little like one would use the name of a town on a map.
The order of these tags relative to all other tags on a chromosome is then deduced by another series of biochemical tricks. These tricks involve smashing chromosomes into small pieces, finding out which of our landmarks belong to which chromosomal fragment and then trying to reassemble the whole into some semblance of its former self. During this process, the order of the landmarks can be deduced. The whole process is analogous to taking several copies of an RACV strip-map of the Hume Highway connecting Sydney and Melbourne, cutting this into many pieces and then trying to reconstruct the original map from the fragments.
The best way to do this would be to find pieces of map containing a given town and overlapping these pieces. Let us take all pieces with Wangaratta, for example. Some of these pieces will also show Albury and others may well show Benalla. From this information it is possible to deduce the relative order of these three towns. Wangaratta must lie between Albury and Benalla. The orientation is not known yet, not is their actual proximity to either Sydney or Melbourne. As this algorithm is repeated over and over again, the original strip map will be reconstructed and the order of all towns along the Hume Highway can be deduced.
The same approach is used to map our chromosomal landmarks. These maps then become the basis for the scaffolding onto which will be pinned all the small pieces of sequenced DNA. It is no longer a biologic blind man's bluff, but an orderly progression of many labs performing small parts of this huge task and joining them together to form a whole based on a prior knowledge of the map of the human genome.
The way this works is analogous to our Oxford English Dictionary example, the correspondence to the page of the dictionary is a cloned, isolated fragment of genomic DNA. These fragments are some 100,000 residues in length, their position in the genome is known because their location on the scaffold has been found. These small clones are then fragmented into even smaller pieces that are sequenced. These smaller pieces of sequence can then be assembled to deduce the sequence of the original 100,000 residue genomic fragment. The sequence of the chromosome is then in turn deduced by overlapping the sequence from adjacent fragments.
PARTIAL SEQUENCING OF MANY GENES
Genes are the important functional units of our genome. There are somewhere between 60,000 and 100,000 genes in our genome, the actual number is not yet known (one benefit to be gained from the sequence of the entire genome). While the sequence of the genome will allow biologists to identify most of the genes, many people have been unwilling to wait the 5-7 years that it will take to sequence all our DNA. They have taken a shortcut.
Genes do not make protein directly. First a copy is made in another nucleic acid called ribonucleic acid, or RNA. This RNA is called messenger RNA because it takes the information encoded by the gene from its home in the cell nucleus to the machinery that translates it into protein in the cell cytoplasm. Messenger RNA can be captured and tamed by molecular biologists. They have been doing it for decades. It is a process called "cloning" and does not involve sheep.
Each molecule of messenger RNA is isolated into a single species of bacteria but now as a DNA molecule where it can be purified away from other molecules and amplified simply by letting the bacterium do what it does best, reproduce. These amplified, cloned copies of the messenger RNA can then be purified from the bacterium and the nucleotide sequence deduced. Part of the human genome project has been doing this on a grand scale and there are now several million partial RNA sequences available in databases.
There is a problem with this data as it has mostly been produced by a small number of companies that have seen a profit in the sale of such data. They therefore do not release their data to the general biological community. This has been partly overcome by Merck, a large multinational drug company which has been funding the Washington University Sequencing group to replicate this work and they have deposited some 600,000 partial messenger RNA sequences in public databases.
This is many times the number of actual genes but as there is no real way of telling how many genes there are, and, many genes produce vastly different levels of messenger RNA in various different cell types. This approach sequences the most common RNAs many times so as to be certain of seeing some of the more rare RNAs a small number of times. These RNA sequences initially came from genes, they are therefore "tags" for genes. Once sequenced they can also function as our biochemical signposts of the genome and can be integrated into the physical maps like the landmarks described above. The end result of this is that we now have many of the known genes placed onto the "scaffold" of the genome. This is important for geneticists who are looking for genes for diseases.
WHAT WILL HAPPEN WHEN WE HAVE THE SEQUENCE OF THE HUMAN GENOME?
This is what we thought in 1997 when this article was written: The sequencing of the human genome should be complete within five or six years. When it is complete, the availability of this immense amount of knowledge will spawn new areas of biology. The interface between computer science, statistics and biology will need to be greatly enhanced to cope with this amount of data. With a list of all genes in the cell and the knowledge of when these are turned on and off, computer scientists can begin to start modeling biological processes inside our cells. Biologists will be able to use the information in new applications. At the moment there is a small glimpse of how this will happen. A company in the United States, Affymetrix, has designed a silicon chip that has DNA synthesised onto its surface. This DNA can represent many hundreds of thousands of genes and can be used by biologists to test any cell or tissue about the genes that are turned on at a given time. Undoubtedly many more clever ideas will turn the information encoded within the genome into techniques that will tell us more about biology.
This article was written in 1997 by Dr. Simon who was a Wellcome Senior Australian Science fellow at the Walter and Eliza Hall Institute of Medical research.