Protein-protein interactions play an important role in vital biological
processes such as metabolic and signaling pathways, and cell cycle control.
A number of (mostly manually curated) databases have been created to
store protein interaction information in structured and standard formats.
However, the amount of biomedical literature regarding protein interactions
is increasing rapidly and it is difficult for interaction database curators
to detect and curate protein interaction information manually. Thus, most of
the protein interaction information remains hidden in the text of the papers
in the biomedical literature. Therefore, the development of information
extraction and text mining techniques for automatic extraction of protein
interaction information from free texts is an important research
area.
We are working on natural language processing and machine learning methods for automatically extracting protein
interaction information from biomedical articles.
System Description
Below is a sample biomedical text, where protein names are marked in red and the interacting pairs are connected with dashed lines.
.

Our aim is to identify the interacting protein pairs and extract the sentences that describe their interaction. Below figure shows our system description.
The first step is identifying the protein names. Next, we
use dependency parsing to extract features from the sentences. Unlike a
syntactic parse
(which describes the syntactic constituent structure of a sentence), the
dependency parse of a sentence captures the semantic predicate-argument
relationships among its words. The figure below
shows the dependency parse tree of the last sentence in the above biomedical
text.

We extract the paths between each protein pair from the dependency parse trees of the sentences and define similarity functions among them. We use these path similarity measures with supervised and semi-supervised nearest neighbor (k-NN and harmonic functions) and kernel based (SVM and TSVM) machine learning approaches.
System Availability
Our protein interaction extraction approach is a component of the GIN (Gene Interaction Network) system.
Publications
Gunes Erkan, Arzucan Ozgur, and Dragomir R. Radev, "Semi-supervised classification for extracting protein interaction sentences using dependency parsing", In Proceedings of the Conference of Empirical Methods in Natural Language Processing (EMNLP '07), Prague, Czech Republic, June 28-30 2007. (PDF)
Gunes Erkan, Arzucan Ozgur, and Dragomir R. Radev,
"Extracting Interacting Protein Pairs and Evidence Sentences by using
Dependency Parsing and Machine Learning Techniques", Proceedings of the
Second BioCreative Challenge Evaluation Workshop, ISBN 84-933255-6-2.
Madrid, Spain, April 2007. (PDF)
Gunes Erkan, Arzucan Ozgur, and Dragomir R. Radev, "Extracting protein interactions using syntactic dependencies", NCIBI All-Hands Meeting Poster Session, 2007. (PDF)
This work was supported in part by grants R01-LM008106 and U54-DA021519 from the US National Institutes of Health.