Missing data in machine learning: Evaluation of missing data in the dataset obtained from a study to determine the risk of mortality in critically ill COVID-۱۹ patients with kidney disease as an example of a dataset in the field of medical research

سال انتشار: 1402
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 207

نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد

این مقاله در بخشهای موضوعی زیر دسته بندی شده است:

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

AIMS01_152

تاریخ نمایه سازی: 1 مرداد 1402

چکیده مقاله:

Background and aims: Missing data is an important issue in machine learning. The missing valuesproblem is usually common in medical research and causes different issues like performancedegradation, data analysis problems and biased outcomes lead by the differences in missing andcomplete values. In this study, we analyzed a medical dataset to realize the magnitude of missingdata.Method: We evaluated the dataset from a study that prospectively compared the risk of mortalityand length of hospitalization of ۲۹۶ patients with COVID-۱۹ with kidney disease and patientswithout this condition admitted to the ICU in Imam Khomeini hospital in Sari from February toAugust ۲۰۲۰. Evaluation of this dataset was performed by Using the missingno Python libraryand Little MCAR test to figure out and visualize missing data, and the mechanisms of creatingmissing data including Missing at random (MAR), Missing not at random (MNAR), and MissingCompletely at Random (MCAR).Results: In this study, ۳۷ variables were used. The resulting data set contained ۱۰۹۵۲ values withabout ۸۴۹ missing data. The mechanism leading to these missing data was MAR and MNARincluding ۱۱۷ cases of MAR, and ۷۳۹ MNAR (based on Visual interpretation of matrix plot frommissingno package in python); There is no case of MCAR (little MCAR test p. value < ۰.۰۵). Thepercentage of missing data was less than ۵% in ۲۷ variables, ۵-۵۰% in ۸ variables, and more than۵۰% in ۲ variables.Conclusion: Even in carefully designed studies, we may encounter missing data in studies in themedical field. Identifying and handling appropriately missing data is an important key componentprior to applying machine learning algorithms.

نویسندگان

Masoud Alyali

Mazandaran University of Medical Sciences, Sari, IRAN

Najmeh Sadeghian

Mazandaran University of Medical Sciences, Sari, IRAN

Zeynab Barzegar

School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN

Hamidreza Sadeghsalehi

School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN

Hamid Ariannejad

School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN

Ashkan Vatankhah

School of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, IRAN