Extracting Protein Names from Biological Literature

Name entity recognition is an essential task in extracting biological knowledge. In biological corpus, protein names and other terminologies are mixed in natural language sentences. Sometimes whether an abbreviation is a protein name or not depends on the context. Protein names are often composed ofgene names, cell names, or even drug names. Moreover, the number of newly coined protein names is increasing. Even withthe assistance of a dictionary, it is still hard to correctlyautomatically identify all protein names in a biological corpus. We modify a hierarchical model of protein name tokens. On theone hand, we choose rule-base method to improve protein name recognition prediction accuracy rate. On the other hand, we usethe N-gram language model to determine the boundary of protein name. Numerous studies mentioned that the hardest part is toidentify abbreviations and words beginning with uppercase. In order to enhance the recognition performance, we use a dictionary to strengthen recognition for abbreviations and words beginning with uppercase. Experimental results show that about 10% increase in performance.We use YAPEX corpus andGENIA corpus datasets for experiment. In our study, an F-score can achieve 0.697 on the YAPEX corpus and 0.691 on theGENIA corpus. Finally, strengthening the abbreviation for part recognition, we use the Uniprot dictionary database to recognize, an F-score can achieve 0.797 on the YAPEX corpus and 0.806 on the GENIA corpus.


Huang-Cheng Kuo

Department of Computer Science and Information Engineering National Chiayi University, Chia-Yi City, Taiwan

Ken-I Lin

Department of Computer Science and Information Engineering National Chiayi University, Chia-Yi City, Taiwan