The human genome is old news. Next stop: the human proteome

The initial plans for an ambitious effort to begin mapping the complete human proteome: the set of all human proteins expressed in all of our cells at all points during our development and adult life.

This is a project of vastly greater magnitude and complexity than the sequencing of the human genome. Unlike the genome, which remains essentially static between cell types and over time, the proteome is tremendously dynamic, changing constantly in response to cell-cell signalling and environmental stimuli. Thus even though -with some small exceptions - every cell in your body carries the same genome, the proteome can be wildly different between different tissues and can change rapidly over time (the image on the left is the result of proteomic analysis of a single tissue, the human kidney; each spot represents one protein). In addition, the function of proteins can change depending on where they localise within the cell, and which other proteins are around for them to interact with.

The complete mapping of the human proteome would require analysing the expression, localisation and interactions of all proteins in human tissue samples from all tissues at all stages of development, and following exposure to all possible forms of environmental stimulus. That's completely impossible with current technology, so the architects of the human proteome project have drawn up a more realistic wish-list:

The plan is to tackle this with three different experimental approaches. One would use mass spectrometry to identify proteins and their quantities in tissue samples; another would generate antibodies to each protein and use these to show its location in tissues and cells; and the third would systematically identify, for each protein, which others it interacts with in protein complexes. The project would also involve a massive bioinformatics effort to ensure that the data could be pooled and accessed, and the production of shared reagents.

It's unclear exactly which tissue samples will be used for the first phase of the project, but it appears that this stage will rely heavily on pooling data from pre-existing studies. After that, the project may move onto a detailed analysis of the expression levels, cellular localisation and interaction partners of proteins encoded by genes on chromosome 21 (the smallest human chromosome); alternative suggestions include a comprehensive analysis of all of the proteins found in a specific cellular location such as the mitochondria or the cell membrane.

There are some daunting technical obstacles to overcome for this project to be successful. Given that the project will be carried out by multiple laboratories around the world, there needs to be a serious attempt at standardising the protocols used to extract and characterise proteins. The article notes that "results from the Human Plasma Proteome project and other proteomics efforts showed that different laboratories ? and even the same lab ? often identify very different sets of proteins from exactly the same sample".

The project will be complicated by the fact that many genes encode for multiple different proteins, differing from one another in various regions, through a process known as alternative splicing. The proposed solution to that problem is to ignore it altogether:

[...] the group plans to focus on only a single protein produced from each gene, rather than its many forms. ?We got rid of all this complexity,? Bergeron says.

That may simplify the analysis, but it will also significantly reduce the power of the project. The single protein isoform selected by the project will not necessarily be the most important isoform produced by that gene (this is likely to differ substantially between different tissues). That means that the project will miss crucial information about the function of many of the proteins it analyses.

Actually, there are caveats of varying severity for nearly all of the currently available technologies for separating, identifying and characterising proteins. It's extremely difficult to develop methods that can accurately examine both low- and high-abundance proteins in a single run. Generating antibodies that reliably and specifically bind to each protein in the proteome will be a mammoth undertaking, and will be confounded by the alternative splicing issues mentioned above. High-throughput methods for detecting protein-protein interactions, while they have been used extensively (for instance in characterising the yeast protein interaction network), still suffer from a range of problems that can result in both false-positive and false-negative findings.

However, these are largely technology-driven constraints. Similar negative arguments were thrown at the human genome project, and look how that turned out! If anything, it seems likely that a proteome project of this magnitude would provide strong incentives to overcome the technical hurdles and standardisation problems that currently plague proteomics in general.

As a useful side-effect, this project (or its successors) will provide information that will help in interpreting the results of whole-genome sequencing. As I've noted before, we still know so little about our own genome that it's likely that most of us will have complete genome sequences well before we really have the tools and understanding to decipher what that sequence actually means. In order to have any chance of figuring out what effects a rare variant in an unannotated gene might have on our health we will need to call on data from many different fields of biology.

At the very least, large-scale analysis of the human proteome should allow researchers to tentatively place many of our currently anonymous genes into functional pathways. That's a step forward for personal genomics: knowing that you have a loss-of-function mutation in a gene that may be involved in cholesterol biosynthesis is a lot more useful (in terms of guiding further clinical testing) than simply knowing that you have a mutation in hypothetical gene C11orf68.