TM-Bench: A benchmark dataset for thermophilic-mesophilic proteins classification
سال انتشار: 1400
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 217
نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
این مقاله در بخشهای موضوعی زیر دسته بندی شده است:
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
IBIS10_041
تاریخ نمایه سازی: 5 تیر 1401
چکیده مقاله:
Recently, machine learning approaches have become conventional methods in order to solve biologicalproblems. Thermal stability of thermophilic and hyper-thermophilic proteins has made them suitablecandidates for medical and industrial applications. Thus, various machine learning methods have beenintroduced to predict the thermophilic proteins and discriminate them from their mesophilic counterpartsbased on the sequence information of these proteins. Most of these studies have reported accuracies of morethan ۹۰ percent, whereas it seems to be optimistic. Using an inappropriate dataset can be the main source ofthis overestimation. Hence, comparing the various approaches has become challenging due to the lack of agold standard dataset. Here we introduce TM-Bench dataset. Zhang and Fang made an effort for the first timeto discriminate thermophilic and mesophilic proteins via pattern recognition methods. Since then, a varietyof approaches such as SVM, artificial neural network, decision tree, k-nearest neighbor, genetic algorithm,and Naive Bayes have been adopted for the classification of thermophilic and non-thermophilic proteinssolely based on proteins sequence information. In this study, we used the BacDive database in order to extracta list of thermophilic and mesophilic organisms based on their optimum growth temperature. Next, afterhaving extracted the corresponding protein sequences from Swiss-Prot database, redundancies in the primarydataset were removed by the CD-HIT tool. Subsequently, the balanced and imbalanced datasets were fed tothe above-mentioned methods for re-evaluating their performance. Our results indicate that sensitivity,specificity, and accuracy were lower than previously reported measures for balanced data set, and withimbalanced data set, sensitivity drops dramatically. Overall, Multi-Layer Perceptron and Logit Boost showedbetter performance than other methods with the balanced dataset, with ۸۱% and ۷۸% accuracy, the sensitivityof ۸۲% and ۷۹%, and specificity of ۸۰% and ۷۷%, respectively.
کلیدواژه ها:
نویسندگان
Saber Mohammadi
Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
Danial Khadivi
Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
Javad Zahiri
Department of Neuroscience, University of California San Diego, California, USA- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
Seyed Shahriar Arab
Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran