Modelling interaction sites in protein domains with interaction profile hidden Markov models

From Interactome.com

Jump to: navigation, search

Modelling interaction sites in protein domains with interaction profile hidden Markov models

Torben Friedrich 1, Birgit Pils 1,2, Thomas Dandekar 1, Jörg Schultz 1 and Tobias Müller 1,*

1
Bioinformatik, Biozentrum, Am Hubland, Universität Würzburg 97074 Würzburg, Germany
2 Present: Wellcome Trust Centre for Human Genetics, University of Oxford Oxford, OX3 7BN, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 CONCLUSION
 REFERENCES
 

Motivation: Due to the growing number of completely sequenced genomes, functional annotation of proteins becomes a more and more important issue. Here, we describe a method for the prediction of sites within protein domains, which are part of protein–ligand interactions. As recently demonstrated, these sites are not trivial to detect because of a varying degree of conservation of their location and type within a domain family.

Results: The developed method for the prediction of protein–ligand interaction sites is based on a newly defined interaction profile hidden Markov model (ipHMM) topology that takes structural and sequence data into account. It is based on a homology search via a posterior decoding algorithm that yields probabilities for interacting sequence positions and inherits the efficiency and the power of the profile hidden Markov model (pHMM) methodology. The algorithm enhances the quality of interaction site predictions and is a suitable tool for large scale studies, which was already demonstrated for pHMMs.

Availability: The MATLAB-files are available on request from the first author.

Contact: tobias.mueller@biozentrum.uni-wuerzburg.de <SCRIPT type=text/javascript></SCRIPT>

Supplementary information: http://domains.bioapps.biozentrum.uni-wuerzburg.de/


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 CONCLUSION
 REFERENCES
 
To date, sequence databases grow with a steadily increasing pace. Most of these sequences are generated within large scale sequencing projects. As the experimental characterization of a protein is a time consuming process, the gap between uncharacterized and characterized protein sequences opens further and further. This is reflected for example in the size difference of TrEMBL (Wu et al., 2006), a database of translated DNA-sequences and Swiss-Prot (Boeckmann et al., 2003), containing manually curated entries. Whereas the first contains more than 2.2 M entries (release 10/10/2005), the second holds only 195 589 sequences (release 10/10/2005). This discrepancy underlines the importance of tools for the automated functional annotation of proteins.

As became clear in the last years, driven not only by different large scale projects, a major aspect of the function of a protein is its interaction with other proteins. Unraveling these partners allows placing a protein into its cellular context, giving insights into higher level function. Still, these data do not provide any details about the type of interaction or regions of the protein with substantial importance for the interaction. To address this problem, three dimensional reconstruction of protein complexes was performed (Aloy et al., 2004). Indeed, this approach does reveal many details of the structural basis of an interaction, but it might be too time-consuming and sophisticated for large scale applications. A trade-off will be the prediction of regions of a protein involved in interactions. Accordingly, different methods have been developed to analyse and predict residue patches involved in protein binding. For all of these tools, the Protein Data Bank (PDB) (Deshpande et al., 2005) is the standard source of verified structural information on proteins and protein–ligand complexes.

Three main strategies were followed to approach the detailed analysis of binding interfaces and their interaction sites. For a large amount of proteins no data on binding interfaces is available. Features of binding sites like the accessible surface area, the hydrophobicity or the interface residue propensity were inferred from resolved protein–ligand complexes (Jones and Thornton, 1997) and transferred to predictions for new structures via SVM (Bradford and Westhead, 2005; Koike and Takagi, 2004; Chung et al., 2006), neural networks (Zhou and Shan, 2001; Fariselli et al., 2002) and by homology using FastA and further tools (Hendlich et al., 2003; Milburn et al., 1998). Though these approaches are useful in transferring knowledge of binding interfaces to protein structures, their application is restricted to only a small amount of proteins with known structure.

A combination of sequence and structure information provides an indication of evolutionary distance of functional sites. The evolutionary trace (ET) method searches for a structural cluster of conserved residues in a protein within a set of homologous sequences (Lichtarge et al., 1996). All tools described above are restricted to work with protein structures as input.

As the amount of unidentified and uncharacterized protein sequences is growing very fast, the need of tools to automatically annotate them based on the existing knowledge is obvious. Ofran and Rost (2003) trained a neural network for the assignment of interaction sites in protein sequences where no structure information is available. This approach only performed quiet good in detecting interactions of strong evidence. Recently, a profile based heuristic method for the localization of binding patches for small molecules was published (Snyder et al., 2006). It transfers annotated binding interfaces of small molecules from PDB entries to a query sequence and ranks them by calculating a ligand score for the binding patch. In a first step domains of the query were detected with RPS-Blast. Though this approach might give reasonable results for small ligands, it probably will have difficulties to determine more variable interfaces like that for peptide or nucleotide ligands.

A challenge for the prediction of interaction sites arises from the fact, that even within one protein or domain family, the position and the type of these sites can vary as highlighted for example by the sterile {alpha} motif (SAM) domain. This domain is known to form homotypic and heterotypic oligomers (Thanos et al., 1999; Schultz et al., 1997). Other studies reported SAM-mediated protein–protein interactions like the interaction between the ELK and the Grb10/2 proteins (Schultz et al., 1997). In recent publications SAM was described to bind RNA (Edwards et al., 2005) and the domain is even thought to be involved in binding of p73 to lipid membranes (Barrera et al., 2003). As described by Kim and Bowie (2003) for oligomerization and RNA-binding, these interaction partners bind to different interfaces on the surface of the SAM domains. It was shown in a recent large scale analysis of structurally characterized protein domains, that the variability exhibited by the SAM domain is rather the rule than the exception. Within most of the analysed domain families, neither the position or the type of amino acids involved in an interaction was conserved (Pils et al., 2005). Obviously, this variability will hinder any straightforward prediction approaches simply transferring interaction sites of one family member to all other sequences members.

To address this challenge, we have adapted the statistical approach of hidden Markov models (HMMs) to learn the patterns of functional sites in homologous sequences. The theory of HMMs goes back to the 1960s and is established in several applications of computational biology since the early 1990s. Examples are programs like GENSCAN for the detection of coding regions in DNA sequences (Krogh et al., 1994a), TMHMM predicting transmembrane areas in protein sequences (Sonnhammer et al., 1998), HMMer (Eddy, 1998) and the Sequence Alignment and Modelling System (SAM) (Hughey and Krogh, 1996) for the assignment of homology for a protein sequence to protein or domain families. The underlying profile hidden Markov model (pHMM) of the latter two solutions enables a probabilistic representation of a protein or domain family, respectively. The databases SMART (Letunic et al., 2004; Schultz et al., 1998), Pfam (Bateman et al., 2004) and TIGRFAM (Haft et al., 2003) are sources of these kind of HMMs accessible via the Internet. SMART is a database of pHMMs of signalling, extracellular and chromatin-associated domains, while Pfam and TIGRFAM contain pHMMs from all types of domain families. A general introduction to HMMs was written by Rabiner (1989) and Durbin et al. (1998) specified their transfer to biological tasks.

Here, we describe a novel type of pHMMs integrating sequence and function information. One of its main features is the fully probabilistic detection of domains and interaction sites in proteins.


    SYSTEM AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 CONCLUSION
 REFERENCES
 
Hidden Markov models
The developed method for the prediction of protein–ligand interaction sites is based on the well established HMM theory (Rabiner, 1989; Durbin et al., 1998). A HMM is a probabilistic network of nodes Formula, so called states. One state qi is connected to another state qj by a transition probability {tau}ij. Non-silent states are able to emit an alphabet of symbols Formula. A special topology of HMMs, termed pHMM, is frequently used in homology detection of protein families (Krogh et al., 1994b; Eddy, 1998). The applied topology is an extended version of the HMMer plan-7 architecture (Eddy, 2003). Transition and emission probabilities were estimated by a maximum likelihood approach. This estimation procedure was combined with regularization by pseudocounts to adapt the method to predictions of remotely related sequences. A standard algorithm to get site and path dependent probabilities for every hidden state is the posterior decoding. It calculates state-specific posterior probabilities by the product of all possible paths before (forward) and after (backward) a certain sequence position divided by the total probability of all paths.

Protein interaction data
As described in more detail by Pils et al. (2005) a HMMer search of all PDB sequences (October 2004 version, 27 969 structures) against the SMART database was performed to get all SMART sequences with a structure representation in PDB. All structures without ligands and all homodimer complexes were excluded, because homodimers are often an artefact of the crystallization process. The remaining sequences were scanned for atom–atom distances smaller 4 Å between protein and ligand atoms. This length is consistent with distance between two oxygen atoms in a hydrogen bond. After filtering, the training set contained 5590 sequences each associated to one of 248 domains. Every sequence was linked to its ligand-specific interaction profiles. Interaction site information was grouped according to the three considered ligand categories: peptides, nucleotides and ions. Other ligand types were not incorporated in our analyses because of low amount of data or unclassifiable ligands. The distribution of sequences to the final set of domains within the three ligand groups is listed in Supplementary Tables S9–S11).

Validation with generated sequences
The recognition of self-emitted interaction profiles by an ipHMM is a necessary condition for the prediction of binding sites in new sequences. In a first validation step we used the feature of trained ipHMMs to emit domain-specific sequences according to their model parameters. Interaction sites of generated sequences were predicted and these predictions were compared with generated state paths in the same way as described below. This evaluation considered the same ligand-specific ipHMMs as in cross-validation tests with a limit of at least 20 sequences in the ipHMM-alignment. The process of generation was repeated 10 times for every domain.

Cross-validation and ROC curves
We tested the prediction accuracy of the interaction profile HMM with 5-fold cross-validation. This was done for all domains with at least 20 sequences in the training set. All domain specific sets of sequences were partitioned into five equally dimensioned parts. We isolated five times a unique part as test set and estimated ipHMMs with the remaining four parts. Testing was done by applying the developed posterior decoding algorithm on all test sequences of a domain and finding matches between the predicted binding sites and the extracted interaction profile of the sequences (true positives, TP). The initial threshold for the assignment of an interaction site was set to a posterior probability of 0.5. True negatives (TN) are defined as correctly predicted non-interacting sites, false positives (FP) are predicted interacting positions that are characterized as non-interacting in the preceding structure scan. Finally, the false negative (FN) definition is the reverse case. We then calculated sensitivity, specificity and FP rates as shown in equations (1)–(3).

Formula 1 (1)

Formula 2 (2)

Formula 3 (3)
To get a closer look on the quality of our interaction site predictions we determined receiver operator characteristics (ROC) for ipHMMs from all ligand categories in the peptide binding category (see Fig. 2). Therefore, the FP rate of equation (3) was plotted against the sensitivity. These values were calculated for increasing discrimination thresholds (steps of 0.02) in the range from 0 to 1.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 CONCLUSION
 REFERENCES
 
Interaction profile hidden Markov model
We applied the probabilistic approach of HMMs to the problem of predicting protein–ligand interaction sites. Our approach is based on the assumption that sequence patterns encoding protein function are shared between members of a domain family. These patterns are often weak and variable. To describe the above mentioned features of domain families, a novel HMM topology was designed by the adaptation of the pHMM (Eddy, 1998; Krogh et al., 1994b) architecture, which is the method of choice for homology detection (Madera and Gough, 2002). The state repertoire was extended by one further match state, namely an interacting match state (Mi). It inherits all features of a match state in the pHMM architecture. The resulting HMM topology is shown in Figure 1.


Figure 1
View larger version (14K):
<nobr>[in this window]
[in a new window]
[Download PowerPoint slide]
 </nobr>
Fig. 1 Topology of the ipHMM following the restrictions and connectivity of the HMMer architecture. The match states of the classical pHMM are split into a non-interacting (Mni) and an interacting match state (Mi). Bold arrows indicate inserted transitions to or from new match states.

 

 
Every ipHMM is like a pHMM a probabilistic representation of a protein or domain family. The parameters of an ipHMM are estimated from a multiple sequence alignment of domain family members incorporating data on their binding sites and ligands from Pils et al. (2005). The same classification of alignment columns as for pHMMs is used here except that an additional occurrence of Mi states is allowed in matching columns. The new kind of states is provided with the same properties as a match state in the classic pHMM architecture. These interacting match states are able to emit all amino acid symbols with probabilities according to their fitted parameters. In Figure 1, all bold arrows indicate new transition possibilities. Transition events are restricted to delete, insert and the two match states (main states). The last non-interacting match state demands a transition to the end state. The information content of domain specific training data influences the accuracy of model parameters. Therefore, ipHMMs were only built for protein domains with more than 20 domain family members in heterocomplexes with resolved structure information in PDB. All sequence positions were labelled with the corresponding interaction status (0 for not interacting and 1 for interacting). The model estimation of the ipHMMs is achieved by maximum likelihood. Transition events were counted together with state emission. A position based weighting scheme (Henikoff and Henikoff, 1994) was applied to compensate for sequence redundancies that occur because of PDB-entries of one protein with different ligands. The weighting calculates sequence weights by associating column-specific weights with the degree of redundancy within one alignment column. The fact that there may be small amount of data in some domains requires the integration of a regularization method. It prevents zero probabilities in the HMM especially in case of small training sets. Weighted pseudocounts (Durbin et al., 1998) with a total value of 20 for emissions and 5 for a transition set of each type of states are used to solve these problems. The models were estimated for all ligand groups separately with the intention to increase the power of prediction.

Now the problem of applying ipHMMs to the prediction of binding sites in proteins of unknown function has to be faced. A major advantage of the approach is the adaptation of the posterior decoding to the new topology. The algorithm calculates probabilities for all emitting states at each sequence site as shown in Figure 3 for the EF-hand domain displaying probabilities of non-interacting and interacting match state. Additionally, delete state probabilities can be displayed to get alignment information corresponding to the domain family. It was necessary to adapt the recursion of forward and backward algorithm to the extended architecture of the ipHMMs.


Formula 4

(4)


Formula 5

(5)
In equations (4) and (5) the adaptation in the case of the forward and backward values of the interacting match state Mi at sequence site j and profile position k is presented. The emission probability is denoted by e for the indicated state corresponding to a certain sequence and profile position. The transition probability {tau} is subscripted with indices of the present and the following state as well as the profile position. Forward and backward probabilities for other states are achieved analogously (Rabiner, 1989; Durbin et al., 1998).

The posterior probabilities could be calculated from the knowledge of forward and backward values. The final step is the search of the state path with maximum posterior probabilities via back-tracking.

Validation tests with generated sequences
Estimated ipHMMs are able to emit typical sequences for the corresponding domain family. This feature was used to generate a large test set for a first validation of the prediction power of ipHMMs. The sequences were derived from ipHMM-specific emission and transition probabilities. The dependence of a sequence to model parameters of a certain domain family was the prerequisite of predicting its binding sites. The predicted state path was aligned to the one that was generated afterwards. Sensitivity, specificity and accuracy were calculated in the analysis of the alignment of state paths as mentioned below. Detailed results of all considered domains are listed in Supplementary Tables S5–S7 . The average sensitivity and specificity values of 0.64 and 0.70 reveal a good quality of predictions for domain-related sequences in contrast to alternative methods (see below).

Receiver operator characteristics
The evaluation of the prediction method was performed by ROC for several SMART domains. As shown in Figure 2a, ROC curves were calculated for peptide-ligand ipHMMs of the EF-Hand domain, the pancreatic RNAse domain, the alkaline phosphatase domain and the extension to Ser-/Thr-type protein kinase. Figure 2b presents ROC curves of the prediction of ion binding sites. The ipHMMs correspond to the alkaline phosphatase, EF-Hand, PBPe and Villin headpiece domain. In part C of Figure 2c ROC of ipHMMs focused on nucleotide-ligands comprising Pumilio-like repeats, pancreatic RNAse domain, HTH lactose operon repressor and C4 zinc finger domain were plotted. Supplementary Table S8 summarises test values of all considered ipHMMs. The evaluation consists of cross-validation for a varying discrimination threshold. In this case ROC curves allow to estimate the expected prediction quality of a predictor on new data. Accurate predictors exhibit areas under ROC curves near one.


Figure 2
View larger version (11K):
<nobr>[in this window]
[in a new window]
[Download PowerPoint slide]
 </nobr>
Fig. 2 ROC curves indicating the prediction power at various threshold for the predication of peptide, ion and nucleotide interaction sites. These calculations were performed for ipHMMs concerning peptide (a), ion (b) and nucleotide ligands (c).

 

 
The examined ipHMMs trained on nucleotide-ligand data showed on average the largest areas under their ROC curves (AUC) and consecutively the highest prediction power. The diversity of prediction quality is higher in the other two ligand-categories. The EF-hand ipHMM is an example of a non-optimal predictor in the cases of peptide- and ion-binding. This might be caused by too few or too similar training data. In contrast, the alkaline phosphatase-ipHMM turned out to be a good predictor of ion–ligand interaction sites, while the prediction of peptide-interactions is not perfect. Though the prediction quality of ipHMM varied in some cases, AUC values were overall at a high level.

Validation of predictions on SMART domains
Further testing was enlarged to the whole set of estimated ipHMMs with at least 20 sequences of known structure. We performed a 5-fold cross-validation to calculate the expected prediction accuracy on new data. Referring to results of the described ROCs, a discrimination threshold of 0.2 for posterior match probabilities was chosen as a switching point between an interaction and no interaction to balance average sensitivity and specificity.

With the evaluation of the new approach in mind, different prediction quality indicators were calculated including sensitivity, specificity and accuracy. Their values of all ipHMMs are given in the Supplementary material. Results of the validation methods and of ROC indicate the best prediction performance for sites, which interact with nucleotide ligands. These findings are supported by the higher sensitivity values for predictions of nucleotide binding sites.

An investigation of the prediction performance in case of the EF-Hand domain reveals accuracies between 0.73 and 0.86 depending on test and ligand type.

Interaction site prediction for the EF-hand domain
As an example of use the method was applied to EF-Hand domains of calmodulin from Xenopus leavis, whose structure has already been resolved in complex with a peptide of Caenorhabditis elegans CaM-kinase kinase (Kurokawa et al., 2001). Predictions of peptide binding sites were performed for all sequences of the EF-Hand family separately using the trained ipHMM for the EF-Hand domain. The output of posterior probabilities for the C-terminal domain is displayed in Figure 3. The initial threshold for the prediction of interacting sites was set to a posterior probability of 0.5. The upper graph contains probabilities of non-interacting match states while the graph below shows posterior values of possible interacting sites. Overall, we find tendencies for higher interacting probabilities for match states at the edges of the domain. These areas correspond to its {alpha}-helices. The observed interacting positions are localized at sites 1, 5, 9, 22, 25 and 26 while sites 4 and 8 are incorrectly predicted as interactions ( FP). The alignment of the EF-Hand sequence to the ipHMM resulted in six correct out of eight predicted interacting sites.


Figure 3
View larger version (29K):
<nobr>[in this window]
[in a new window]
[Download PowerPoint slide]
 </nobr>
Fig. 3 The Stacked bar graph represents the prediction result of the posterior decoding for the C-terminal EF-hand motif of X. laevis. It contains posterior probabilities of interacting (dark red) and non-interacting match states (orange) and delete states depending on the sequence position. The probabilities for all other states are not displayed because of their low level. All sites with a posterior probability higher than 0.5 for the interacting match were predicted to interact with a peptide ligand.

 

 
Figure 4 visualizes the mapping of correct and false predictions focused on peptide-binding for all four EF-Hand domains of calmodulin from X. laevis. All proposed interaction sites were located on the {alpha}-helices and their residues were orientated towards the ligand.
Figure 4
View larger version (69K):
<nobr>[in this window]
[in a new window]
[Download PowerPoint slide]
 </nobr>
Fig. 4 The 3-dimensional protein structure of X. laevis calmodulin in a calcium induced ligand binding conformation. The {alpha}-helix in the centre of the molecule is a ligand group from a CaM-kinase kinase. The four EF-hand domains are indicated in different blue colors. Residues marked in red are correctly predicted as interacting sites, orange residues are not detected as interactions and yellow residues are erroneous labelled as interacting.

 

 
Alternative approaches
An alternative approach to predict protein–protein interactions from sequence information only is based on a neural network with back-propagation (Ofran and Rost, 2003). The underlying dataset was derived from PDB by defining interactions as atom–atom distances smaller 6 Å. The method showed a high rate of contact site detection, when only trying to predict sites with highest interacting evidence. But when the algorithm was trimmed to predict less evident contact sites, the prediction quality dropped significantly. The application of the neural network to an unfiltered prediction of binding sites revealed a low sensitivity of ~30%. The results of this neural network based approach suggest, that ipHMMs will exhibit a better performance for predictions of whole binding interfaces.

Many studies dealt with the binding characteristics of protein sequences. Most of them focused, in contrast to the method presented here, on the investigation of binding patches in proteins of known structure. For this reason we will only concentrate on a comparison to the most recently published method for predicting small molecule binding interfaces of proteins (Snyder et al., 2006). In order to exemplify differences between the approaches, the peroxisome proliferator-activated receptor-gamma (PPAR-{gamma}) was choosen as query protein. A large scale comparison of both approaches was not reasonable because of the restriction of SMID-BLAST to small molecule interactions. The transcription factor in complex with coactivating ligands influences important cellular processes like adipogenesis, anti-inflammatory effects and anti-proliferating function in many types of cancer (Lehrke and Lazar, 2005). Experimentally determined interaction sites to a protein were derived from the homodimeric crystal structure of PPAR-{gamma} with a fragment of the steroid receptor coactivator 1 (SRC-1) and rosiglitazone, a high affinity ligand for PPAR-{gamma} [PDB-Identifiyer: 2PRG [PDB] , Fig. 5, Nolte et al. (1998)]. The PPAR-{gamma} sequence was excluded from the training set (consisting of 32 sequences) of a new HOLI-ipHMM, which was built for comparison purpose.


Figure 5
View larger version (79K):
<nobr>[in this window]
[in a new window]
[Download PowerPoint slide]
 </nobr>
Fig. 5 A comparison of SMID-BLAST and ipHMMs mapped to the crystal structure of the homodimeric PPAR-{gamma} in complex with an LXXLL helix of the SRC-1 co-activator (yellow). PPAR-{gamma} contains a ligand binding domain (SMART: HOLI, brick red) at 81 to 220 amino acids. All verified interacting residues were highlighted by red sticks. The colors blue and green represent false positives of SMID-BLAST and ipHMM predictions, respectively.

 

 
The binding interface of PPAR-{gamma} is located on the hormone receptor binding domain (SMART domain Holi). Then the interaction sites were determined as described above. The predictions are mapped on the structure of the PPAR-{gamma} complex. The holi domain is coloured in brick red, correctly predicted peptide-binding sites with the ipHMM are displayed in red colour. The SMID-BLAST prediction overlaps only at position 314Q with the experimentally derived binding interface. Further incorrectly assigned interactions by SMID-BLAST are shown in blue. SMID-BLAST provides a list of binding patches to the user, each binding one small molecule. While evaluating the results of SMID-BLAST, all top 10 binding patches were scanned to get an entire set of different predicted interaction sites.

The example prediction underlines the advantages of the HMM-based method in predicting the more variable protein interaction sites. The knowledge of interaction sites is in contrast to SMID-BLAST not directly transfered from single members of the domain family, but probabilistically assigned taking all known interactions in the domain at a given position into account. The ipHMM was able to find all verified peptide interactions except those at sites 20 and 26, while SMID-BLAST only found one contact site which overlaps with the binding patch derived from the structure of the complex. In contrast, SMID-BLAST seems to reveal a high rate of false positives in this case. The ligands proposed by SMID-BLAST were mainly ions or organic compounds. With the purpose of a general comparison of Blast-based approaches to ipHMMs, a simple predictor was created, which searches for homologous sequences in the dataset of proteins with verified interactions. The best hit at a given identity threshold was taken to transfer its interactions to the query. The sensitivity of the prediction was significantly lower compared to the average values observed for ipHMMs. Data on the results of this predictions can be found in Supplementary material.


    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 CONCLUSION
 REFERENCES
 
In this article, a new method for the prediction of protein binding sites to different types of protein ligands was introduced. It is the first in incorporating information about homology as well as binding sites in a HMM topology. Those HMMs have already been applied to various analytical tasks as a result of their efficiency and comparatively high accuracy. The main novelty in the architecture of the interaction profile HMM is a second match state that represents interacting sequence positions. It was demonstrated in validation tests and on the example of calmodulin binding sites that the algorithm is able to detect the majority of existing interactions in a protein sequence. Interacting positions were determined from structures of protein–ligand complexes according to the length of a hydrogen bond (4 Å).

The detection of a wide range of ligand binding sites is enabled with the introduced approach. In contrast to most alternative solutions, the ipHMMs predict interaction sites in the context of domain families, which leads to a higher prediction quality. The increase of predictive power is indicated by a significantly higher sensitivity. IpHMMs provide in comparison to alternative methods like SMID-BLAST a larger spectrum of predictable types of interaction sites. Furthermore, interfaces consisting of a novel combination of known interaction sites in a domain family could be detected by ipHMMs.

For all existing predictors including ipHMMs, initial structure information is necessary for the training process. Once the corresponding ipHMMs have been trained, binding sites could be determined in sequences of unknown structure. The sensitivity for contact site detection of ion ipHMMs is slightly lower than for peptide and nucleotide ipHMMs, because of lower sequence coverage. Increasing amounts of identified protein structures will improve the prediction power of ipHMMs in general and especially in cases where still little sequence information is available.

The proposed method provides further information on the quality of single interaction site predictions. The state-associated posterior probabilities of sequence positions indicate how well the used ipHMM can distinguish between state alternatives. This is a valuable assistance in interpreting prediction results. The new ipHMMs inherited all features of pHMMs. Once the amount of interaction site data reaches a certain level, existing HMMs in frequently used databases like SMART and Pfam could be replaced by ipHMMs.

Future perspective
Increasing data of protein sequences and structures will lead to a good sequence coverage for the majority of domain families and consecutively to improved ipHMMs. Furthermore, these new protein sequences and structures open up the possibility to build ipHMMs of new domain families or known families that are not yet included in the ipHMM library because of a small basis of data. Other types of binding interfaces like those for carbohydrates or lipids could easily be modelled with the same HMM-topology.


    Acknowledgments
 
The authors like to thank all members of the departments of bioinformatics in Würzburg for helpful discussion. Special thanks go to J. Engelmann, S. Pinkert, J. Thakar, P. Seibel. Furthermore The authors thank S. Rahmann for technical discussion. The authors gratefully acknowledge funding by DFG (B01099 [GenBank] /5-3) and Land Bayern (Foringen TP D1).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Golan Yona

Received on June 29, 2006; revised on September 8, 2006; accepted on September 15, 2006

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 CONCLUSION
 REFERENCES
 

    Aloy, P., et al. (2004) Structure-based assembly of protein complexes in yeast. Science, 303, 2026–2029<nobr>[Abstract/Free Full Text]</nobr>.

    Barrera, F.N., et al. (2003) Binding of the C-terminal sterile alpha motif (SAM) domain of human p73 to lipid membranes. J. Biol. Chem, . 278, 46878–46885<nobr>[Abstract/Free Full Text]</nobr>.

    Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res, . 32, D138–D141<nobr>[Abstract/Free Full Text]</nobr>.

    Boeckmann, B., et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, . 31, 365–370<nobr>[Abstract/Free Full Text]</nobr>.

    Bradford, J.R. and Westhead, D.R. (2005) Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics, 21, 1487–1494<nobr>[Abstract/Free Full Text]</nobr>.

    Chung, J.-L., et al. (2006) Exploiting sequence and structure homologs to identify protein-protein binding sites. Proteins, 62, 630–640[CrossRef][ISI][Medline].

    Deshpande, N., et al. (2005) The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res, . 33, D233–D237<nobr>[Abstract/Free Full Text]</nobr>.

    Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, (1998) , UK Cambridge University Press.

    Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763<nobr>[Abstract/Free Full Text]</nobr>.

    Eddy, S. (2003) Hmmer. Technical report, Howard Hughes Medical Institute, Dept. of Genetics; Washington University School of Medicine.

    Edwards, T.A., et al. (2005) Solution structure of the Vts1 SAM domain in the presence of RNA. J. Mol. Biol, . 356, 1065–1072[CrossRef][ISI][Medline].

    Fariselli, P., et al. (2002) Prediction of protein–protein interaction sites in heterocomplexes with neural networks. Eur. J. Biochem, . 269, 1356–1361<nobr>[Abstract/Free Full Text]</nobr>.

    Haft, D.H., et al. (2003) The TIGRFAMs database of protein families. Nucleic Acids Res, . 31, 371–373<nobr>[Abstract/Free Full Text]</nobr>.

    Hendlich, M., et al. (2003) Relibase: design and development of a database for comprehensive analysis of protein-ligand interactions. J. Mol. Biol, . 326, 607–620[CrossRef][ISI][Medline].

    Henikoff, S. and Henikoff, J.G. (1994) Position-based sequence weights. J. Mol. Biol, . 243, 574–578[CrossRef][ISI][Medline].

    Hughey, R. and Krogh, A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci, . 12, 95–107<nobr>[Abstract/Free Full Text]</nobr>.

    Jones, S. and Thornton, J.M. (1997) Prediction of protein–protein interaction sites using patch analysis. J. Mol. Biol, . 272, 133–143[CrossRef][ISI][Medline].

    Kim, C.A. and Bowie, J.U. (2003) SAM domains: uniform structure, diversity of function. Trends Biochem. Sci, . 28, 625–628[CrossRef][ISI][Medline].

    Koike, A. and Takagi, T. (2004) Prediction of protein–protein interaction sites using support vector machines. Protein Eng. Des. Sel, . 17, 165–173<nobr>[Abstract/Free Full Text]</nobr>.

    Krogh, A., et al. (1994a) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res, . 22, 4768–4778<nobr>[Abstract/Free Full Text]</nobr>.

    Krogh, A., et al. (1994b) Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol, . 235, 1501–1531[CrossRef][ISI][Medline].

    Kurokawa, H., et al. (2001) Target-induced conformational adaptation of calmodulin revealed by the crystal structure of a complex with nematode Ca(2+)/calmodulin-dependent kinase kinase peptide. J. Mol. Biol, . 312, 59–68[CrossRef][ISI][Medline].

    Lehrke, M. and Lazar, M.A. (2005) The many faces of PPARgamma. Cell, 123, 993–999[CrossRef][ISI][Medline].

    Letunic, I., et al. (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res, . 32, D142–D144<nobr>[Abstract/Free Full Text]</nobr>.

    Lichtarge, O., et al. (1996) An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol, . 257, 342–358[CrossRef][ISI][Medline].

    Madera, M. and Gough, J. (2002) A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res, . 30, 4321–4328<nobr>[Abstract/Free Full Text]</nobr>.

    Milburn, D., et al. (1998) Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis. Protein Eng, . 11, 855–859<nobr>[Abstract/Free Full Text]</nobr>.

    Nolte, R.T., et al. (1998) Ligand binding and co-activator assembly of the peroxisome proliferator-activated receptor-gamma. Nature, 395, 137–143[CrossRef][ISI][Medline].

    Ofran, Y. and Rost, B. (2003) Predicted protein–protein interaction sites from local sequence information. FEBS Lett, . 544, 236–239[CrossRef][ISI][Medline].

    Pils, B., et al. (2005) Variation in structural location and amino acid conservation of functional sites in protein domain families. BMC Bioinformatics, 6, 210[CrossRef][Medline].

    Rabiner, L.R. (1989) A tutorial on hidden Markov Models and selected applications in Speech Recognition. Proc. IEEE, 77, 257–286[CrossRef][ISI].

    Schultz, J., et al. (1997) SAM as a protein interaction domain involved in developmental regulation. Protein Sci, . 6, 249–253[Abstract].

    Schultz, J., et al. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. USA, 95, 5857–5864<nobr>[Abstract/Free Full Text]</nobr>.

    Snyder, K.A., et al. (2006) Domain-based small molecule binding site annotation. BMC Bioinformatics, 7, 152[CrossRef][Medline].

    Sonnhammer, E.L., et al. (1998) A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 6, 175–182[Medline].

    Thanos, C.D., et al. (1999) Oligomeric structure of the human EphB2 receptor SAM domain. Science, 283, 833–836<nobr>[Abstract/Free Full Text]</nobr>.

    Wu, C.H. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res, . 34, D187–D191<nobr>[Abstract/Free Full Text]</nobr>.

    Zhou, H. X. and Shan, Y. (2001) Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins, 44, 336–343[CrossRef][ISI][Medline].