Preprocessing breast cancer dataset to improve data quality for classification

سال انتشار: 1394
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 327

نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد

این مقاله در بخشهای موضوعی زیر دسته بندی شده است:

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

ICBCMED12_116

تاریخ نمایه سازی: 2 تیر 1397

چکیده مقاله:

Introduction Nowadays, large amount of data is available in various databases of medical centers in the field of breast cancer. Although it is possible to discover valuable knowledge using these data sources, important challenges should be addressed including data quality (data consistency, correctness, …) and data cleaning (missing data, noise, …). Data mining is a useful and reliable analysis task for knowledge discovery only if qualified data is used. In this regard, appropriate data preprocessing methods should be used to eliminate data flaws.Materials and Methods: In the present study, a dataset of breast cancer related to patients referred to the Reza Radiation Oncology Center in Mashhad was investigated. The study was conducted retrospectivelyusing the malignant breast cancer patients data collected from 2009-2014, consist of 1923 samples and 101 features. The patients are divided into two categories: recurrence and non-recurrence. Data mining was performed employing major classification algorithms namely: K-Nearest Neighbor, Naïve Bayes, and Sequential minimal optimization (SMO). At first, the classifications were conducted on four subsets of the original data features. Then, in order to improve the quality of data, the irrelevant and non-essential features were removed, the errors in the features and samples were eliminated, and some of the no-content features were filled using the rules discovered based on the features. Finally, the data mining algorithms were carried out on the preprocessed dataset and the results of before and after data preprocessing were compared. Results: To evaluate the outputs and results, the accuracy and sensitivity parameters were used. The results showed that the predictions of all three classification algorithms had been improved after data preprocessing. However, the accuracy and sensitivity of the SMO classification had respectively been improved to 99.33 and 84.61, then those of the Naïve Bayes classification had respectively been improved to 98.84 and 77.69, and in the end, those of the K-Nearest Neighbor classification had respectively been improved to 98.06 and 70.64. Conclusion: According to the results, data quality and subsequently the performance of classification algorithms had been improved significantly, using data preprocessing techniques. Appropriate data preprocessing techniques should be selected and used in a proper sequence before applying the classification algorithm.

نویسندگان

Zeinab Sajjadnia

M.Sc. student, Computer Software Engineering, Shiraz University of Technology

Seyed Raof Khayami

Assistant Professor, Department of Computer and IT engineering, Shiraz University of Technology

Seyed Mohammad Reza Moosavi

Assistant Professor, Department of Computer Science and Engineering and IT, Shiraz University

Mahdieh Dayyani

Radiation Oncologist, Director of Education and Research Committee, Reza Radiation Oncology Center in Mashhad