Missing data in machine learning: Evaluation of missing data in the dataset obtained from a study to determine the risk of mortality in critically ill COVID-۱۹ patients with kidney disease as an example of a dataset in the field of medical research
محل انتشار: اولین کنگره بین المللی هوش مصنوعی در علوم پزشکی
سال انتشار: 1402
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 207
نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
این مقاله در بخشهای موضوعی زیر دسته بندی شده است:
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
AIMS01_152
تاریخ نمایه سازی: 1 مرداد 1402
چکیده مقاله:
Background and aims: Missing data is an important issue in machine learning. The missing valuesproblem is usually common in medical research and causes different issues like performancedegradation, data analysis problems and biased outcomes lead by the differences in missing andcomplete values. In this study, we analyzed a medical dataset to realize the magnitude of missingdata.Method: We evaluated the dataset from a study that prospectively compared the risk of mortalityand length of hospitalization of ۲۹۶ patients with COVID-۱۹ with kidney disease and patientswithout this condition admitted to the ICU in Imam Khomeini hospital in Sari from February toAugust ۲۰۲۰. Evaluation of this dataset was performed by Using the missingno Python libraryand Little MCAR test to figure out and visualize missing data, and the mechanisms of creatingmissing data including Missing at random (MAR), Missing not at random (MNAR), and MissingCompletely at Random (MCAR).Results: In this study, ۳۷ variables were used. The resulting data set contained ۱۰۹۵۲ values withabout ۸۴۹ missing data. The mechanism leading to these missing data was MAR and MNARincluding ۱۱۷ cases of MAR, and ۷۳۹ MNAR (based on Visual interpretation of matrix plot frommissingno package in python); There is no case of MCAR (little MCAR test p. value < ۰.۰۵). Thepercentage of missing data was less than ۵% in ۲۷ variables, ۵-۵۰% in ۸ variables, and more than۵۰% in ۲ variables.Conclusion: Even in carefully designed studies, we may encounter missing data in studies in themedical field. Identifying and handling appropriately missing data is an important key componentprior to applying machine learning algorithms.
کلیدواژه ها:
نویسندگان
Masoud Alyali
Mazandaran University of Medical Sciences, Sari, IRAN
Najmeh Sadeghian
Mazandaran University of Medical Sciences, Sari, IRAN
Zeynab Barzegar
School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN
Hamidreza Sadeghsalehi
School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN
Hamid Ariannejad
School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN
Ashkan Vatankhah
School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN