Preprocessing breast cancer dataset to improve data quality for classification
محل انتشار: دوازدهمین کنگره بین المللی سرطان پستان
سال انتشار: 1394
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 327
نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
این مقاله در بخشهای موضوعی زیر دسته بندی شده است:
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
ICBCMED12_116
تاریخ نمایه سازی: 2 تیر 1397
چکیده مقاله:
Introduction Nowadays, large amount of data is available in various databases of medical centers in the field of breast cancer. Although it is possible to discover valuable knowledge using these data sources, important challenges should be addressed including data quality (data consistency, correctness, …) and data cleaning (missing data, noise, …). Data mining is a useful and reliable analysis task for knowledge discovery only if qualified data is used. In this regard, appropriate data preprocessing methods should be used to eliminate data flaws.Materials and Methods: In the present study, a dataset of breast cancer related to patients referred to the Reza Radiation Oncology Center in Mashhad was investigated. The study was conducted retrospectivelyusing the malignant breast cancer patients data collected from 2009-2014, consist of 1923 samples and 101 features. The patients are divided into two categories: recurrence and non-recurrence. Data mining was performed employing major classification algorithms namely: K-Nearest Neighbor, Naïve Bayes, and Sequential minimal optimization (SMO). At first, the classifications were conducted on four subsets of the original data features. Then, in order to improve the quality of data, the irrelevant and non-essential features were removed, the errors in the features and samples were eliminated, and some of the no-content features were filled using the rules discovered based on the features. Finally, the data mining algorithms were carried out on the preprocessed dataset and the results of before and after data preprocessing were compared. Results: To evaluate the outputs and results, the accuracy and sensitivity parameters were used. The results showed that the predictions of all three classification algorithms had been improved after data preprocessing. However, the accuracy and sensitivity of the SMO classification had respectively been improved to 99.33 and 84.61, then those of the Naïve Bayes classification had respectively been improved to 98.84 and 77.69, and in the end, those of the K-Nearest Neighbor classification had respectively been improved to 98.06 and 70.64. Conclusion: According to the results, data quality and subsequently the performance of classification algorithms had been improved significantly, using data preprocessing techniques. Appropriate data preprocessing techniques should be selected and used in a proper sequence before applying the classification algorithm.
کلیدواژه ها:
نویسندگان
Zeinab Sajjadnia
M.Sc. student, Computer Software Engineering, Shiraz University of Technology
Seyed Raof Khayami
Assistant Professor, Department of Computer and IT engineering, Shiraz University of Technology
Seyed Mohammad Reza Moosavi
Assistant Professor, Department of Computer Science and Engineering and IT, Shiraz University
Mahdieh Dayyani
Radiation Oncologist, Director of Education and Research Committee, Reza Radiation Oncology Center in Mashhad