Structure-Aware Initialization for K-Means Clustering: An IQR-Weighted Approach to Mitigate Outlier Impact
محل انتشار: چهارمین کنفرانس بین المللی و نهمین کنفرانس ملی کامپیوتر، فناوری اطلاعات و کاربردهای هوش مصنوعی
سال انتشار: 1404
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 8
فایل این مقاله در 5 صفحه با فرمت PDF قابل دریافت می باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
CEITCONF09_023
تاریخ نمایه سازی: 24 خرداد 1405
چکیده مقاله:
K-Means algorithm remains a basic algorithm within the field of data mining, and its linear time complexity has made it extremely popular. However, its performance has remained sensitive to seed point selection, often ending up at local optima. Although probabilistic seed selection using K-Means++ is theoretically more effective, its results are still stochastic, thus requiring higher computations due to multiple passes over the dataset. Moreover, traditional deterministic seed selection does not consider the divergent discriminatory capabilities of features while handling higher dimensional datasets, thus considering noise and signal features equivalently. This paper proposes a new deterministic seed selection algorithm called IQR Weighted Initializer, where Interquartile Range values are used to weigh feature importance while choosing seed points. By assigning higher importance values to structural features with large dispersal, the algorithm suppresses the effects of outliers. Experimentation on ۱۰ UCI datasets shows that the proposed algorithm performs better than current best deterministic and hierarchical algorithms. Moreover, on datasets where outliers are pertinent, like Glass Identification, the algorithm shows a reduction of approximately ۷% Sum of Squared Errors over Bisecting KMeans, thus preventing the algorithm from becoming stuck at local optima, where hierarchical algorithms fail. Additionally, Silhouette Score and Adjusted Rand Index validation shows that the algorithm groups features into better-structured classes, closer to actual labels.
کلیدواژه ها:
نویسندگان
Mohammad Hamzeei
Department of Computer Engineering Birjand University of Technology Birjand, Iran
Mostafa Sabzekar
Department of Computer Engineering Birjand University of Technology Birjand, Iran