Density Measure in Context Clustering for Distributional Semantics of Word Sense Induction

Masood Ghayoomi

Density Measure in Context Clustering for Distributional Semantics of Word Sense Induction

محل انتشار: فصلنامه سیستم های اطلاعاتی و مخابرات، دوره: 8، شماره: 1

سال انتشار: 1399

نوع سند: مقاله ژورنالی

زبان: انگلیسی

مشاهده: 245

فایل این مقاله در 10 صفحه با فرمت PDF قابل دریافت می باشد

دریافت فایل کامل مقاله

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

این مقاله در بخشهای موضوعی زیر دسته بندی شده است:

هوش مصنوعی > یادگیری ماشین

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/1352390

شناسه ملی سند علمی:

JR_JIST-8-1_002

تاریخ نمایه سازی: 28 آذر 1400

چکیده مقاله:

Word Sense Induction (WSI) aims at inducing word senses from data without using a prior knowledge. Utilizing no labeled data motivated researchers to use clustering techniques for this task. There exist two types of clustering algorithm: parametric or non-parametric. Although non-parametric clustering algorithms are more suitable for inducing word senses, their shortcomings make them useless. Meanwhile, parametric clustering algorithms show competitive results, but they suffer from a major problem that is requiring to set a predefined fixed number of clusters in advance.The main contribution of this paper is to show that utilizing the silhouette score normally used as an internal evaluation metric to measure the clusters‟ density in a parametric clustering algorithm, such as K-means, in the WSI task captures words‟ senses better than the state-of-the-art models. To this end, word embedding approach is utilized to represent words‟ contextual information as vectors. To capture the context in the vectors, we propose two modes of experiments: either using the whole sentence, or limited number of surrounding words in the local context of the target word to build the vectors. The experimental results based on V-measure evaluation metric show that the two modes of our proposed model beat the state-of-the-art models by ۴.۴۸% and ۵.۳۹% improvement. Moreover, the average number of clusters and the maximum number of clusters in the outputs of our proposed models are relatively equal to the gold data.

کلیدواژه ها:

Word Sense Induction ، Word Embedding ، Clustering ، Silhouette Score ، Unsupervised Machine Learning ، Distributional Semantic ، Density

نویسندگان

Masood Ghayoomi

Faculty of Linguistics, Institute for Humanities and Cultural Studies, Tehran, Iran