Extracting Protein Names from Biological Literature

سال انتشار: 1392
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 581

فایل این مقاله در 11 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_ACSIJ-3-2_009

تاریخ نمایه سازی: 24 فروردین 1393

چکیده مقاله:

Name entity recognition is an essential task in extracting biological knowledge. In biological corpus, protein names and other terminologies are mixed in natural language sentences. Sometimes whether an abbreviation is a protein name or not depends on the context. Protein names are often composed ofgene names, cell names, or even drug names. Moreover, the number of newly coined protein names is increasing. Even withthe assistance of a dictionary, it is still hard to correctlyautomatically identify all protein names in a biological corpus. We modify a hierarchical model of protein name tokens. On theone hand, we choose rule-base method to improve protein name recognition prediction accuracy rate. On the other hand, we usethe N-gram language model to determine the boundary of protein name. Numerous studies mentioned that the hardest part is toidentify abbreviations and words beginning with uppercase. In order to enhance the recognition performance, we use a dictionary to strengthen recognition for abbreviations and words beginning with uppercase. Experimental results show that about 10% increase in performance.We use YAPEX corpus andGENIA corpus datasets for experiment. In our study, an F-score can achieve 0.697 on the YAPEX corpus and 0.691 on theGENIA corpus. Finally, strengthening the abbreviation for part recognition, we use the Uniprot dictionary database to recognize, an F-score can achieve 0.797 on the YAPEX corpus and 0.806 on the GENIA corpus.

نویسندگان

Huang-Cheng Kuo

Department of Computer Science and Information Engineering National Chiayi University, Chia-Yi City, Taiwan

Ken-I Lin

Department of Computer Science and Information Engineering National Chiayi University, Chia-Yi City, Taiwan