TM-Bench: A benchmark dataset for thermophilic-mesophilic proteins classification

سال انتشار: 1400
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 189

نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد

این مقاله در بخشهای موضوعی زیر دسته بندی شده است:

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

IBIS10_041

تاریخ نمایه سازی: 5 تیر 1401

چکیده مقاله:

Recently, machine learning approaches have become conventional methods in order to solve biologicalproblems. Thermal stability of thermophilic and hyper-thermophilic proteins has made them suitablecandidates for medical and industrial applications. Thus, various machine learning methods have beenintroduced to predict the thermophilic proteins and discriminate them from their mesophilic counterpartsbased on the sequence information of these proteins. Most of these studies have reported accuracies of morethan ۹۰ percent, whereas it seems to be optimistic. Using an inappropriate dataset can be the main source ofthis overestimation. Hence, comparing the various approaches has become challenging due to the lack of agold standard dataset. Here we introduce TM-Bench dataset. Zhang and Fang made an effort for the first timeto discriminate thermophilic and mesophilic proteins via pattern recognition methods. Since then, a varietyof approaches such as SVM, artificial neural network, decision tree, k-nearest neighbor, genetic algorithm,and Naive Bayes have been adopted for the classification of thermophilic and non-thermophilic proteinssolely based on proteins sequence information. In this study, we used the BacDive database in order to extracta list of thermophilic and mesophilic organisms based on their optimum growth temperature. Next, afterhaving extracted the corresponding protein sequences from Swiss-Prot database, redundancies in the primarydataset were removed by the CD-HIT tool. Subsequently, the balanced and imbalanced datasets were fed tothe above-mentioned methods for re-evaluating their performance. Our results indicate that sensitivity,specificity, and accuracy were lower than previously reported measures for balanced data set, and withimbalanced data set, sensitivity drops dramatically. Overall, Multi-Layer Perceptron and Logit Boost showedbetter performance than other methods with the balanced dataset, with ۸۱% and ۷۸% accuracy, the sensitivityof ۸۲% and ۷۹%, and specificity of ۸۰% and ۷۷%, respectively.

نویسندگان

Saber Mohammadi

Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran

Danial Khadivi

Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran

Javad Zahiri

Department of Neuroscience, University of California San Diego, California, USA- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA

Seyed Shahriar Arab

Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran