Protein interaction recognition in text

Overview

Protein-protein interactions play an important role in vital biological processes such as metabolic and signaling pathways, and cell cycle control. A number of (mostly manually curated) databases have been created to store protein interaction information in structured and standard formats. However, the amount of biomedical literature regarding protein interactions is increasing rapidly and it is difficult for interaction database curators to detect and curate protein interaction information manually. Thus, most of the protein interaction information remains hidden in the text of the papers in the biomedical literature. Therefore, the development of information extraction and text mining techniques for automatic extraction of protein interaction information from free texts is an important research area.

We are work
ing on natural language processing and machine learning methods for automatically extracting protein interaction information from biomedical articles.

 

System Description

 

Below is a sample biomedical text, where protein names are marked in red and the interacting pairs are connected with dashed lines.

.

 

Our aim is to identify the interacting protein pairs and extract the sentences that describe their interaction. Below figure shows our system description.

 

The first step is identifying the protein names. Next, we use dependency parsing to extract features from the sentences. Unlike a syntactic parse
(which describes the syntactic constituent structure of a sentence), the dependency parse of a sentence captures the semantic predicate-argument relationships among its words. The figure
below shows the dependency parse tree of the last sentence in the above biomedical text.

 

 

 

 

We extract the paths between each protein pair from the dependency parse trees of the sentences and define similarity functions among them. We use these path similarity measures with supervised and semi-supervised nearest neighbor (k-NN and harmonic functions) and kernel based (SVM and TSVM) machine learning approaches.

 

System Availability

 

Our protein interaction extraction approach is a component of the GIN (Gene Interaction Network) system.

 

Publications      

 

Gunes Erkan, Arzucan Ozgur, and Dragomir R. Radev, "Semi-supervised classification for extracting protein interaction sentences using dependency parsing", In Proceedings of the Conference of Empirical Methods in Natural Language Processing (EMNLP '07), Prague, Czech Republic, June 28-30 2007. (PDF)

 

Gunes Erkan, Arzucan Ozgur, and Dragomir R. Radev, "Extracting Interacting Protein Pairs and Evidence Sentences by using Dependency Parsing and Machine Learning Techniques", Proceedings of the Second BioCreative Challenge Evaluation Workshop, ISBN 84-933255-6-2. Madrid, Spain, April 2007. (PDF)
 

Gunes Erkan, Arzucan Ozgur, and Dragomir R. Radev, "Extracting protein interactions using syntactic dependencies", NCIBI All-Hands Meeting Poster Session, 2007. (PDF)

Funding

This work was supported in part by grants R01-LM008106 and U54-DA021519 from the US National Institutes of Health.