An Analysis of Text Similarity Measures: Introducing a Lin Wang similarity measure

  • سال انتشار: 1402
  • محل انتشار: پنجمین کنفرانس بین المللی محاسبات نرم
  • کد COI اختصاصی: CSCG05_058
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 221
دانلود فایل این مقاله

نویسندگان

Alireza Pakgohar

Department of statistics, Payame Noor University (PNU);

Mehdi Fazli Aghdaei

Department of Mathematics, Payame Noor University, Tehran, Iran;

چکیده

Accurately measuring the similarity between texts is crucial for numerous natural language processing tasks, from plagiarism detection to information retrieval. This paper delves into various approaches to calculating text similarity, exploring their strengths and limitations. We begin by analyzing character-based methods, including the Jaro and N-gram algorithms, suitable for detecting typos and minor edits. Semantic and corpus-based approaches are then addressed, offering deeper insights into meaning and context. This includes techniques like Dice coefficient, Euclidean distance, and Cosine distance, which compare texts based on vector representations and set intersections. Finally, we introduce the statistically robust Lin-Wong Similarity measure, which quantifies the commonality between probability distributions of words, providing a powerful tool for capturing semantic similarity. By comparing and contrasting these diverse methods, we highlight the importance of choosing the right measure for the specific task and dataset. Moving forward, the paper identifies promising avenues for future research, suggesting the potential of knowledge graphs and deep learning techniques to further refine and advance the field of text similarity measurement. This comprehensive exploration equips researchers and practitioners with valuable knowledge and insights for analyzing and comparing textual data.

کلیدواژه ها

Lin-Wong Divergence,Similarity Measure,Editing Distance,Text Mining,Similarity Algorithm,Distance Measure.

مقالات مرتبط جدید

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.