Topic Detection on COVID-۱۹ Tweets: A Comparative Study on Clustering and Transfer Learning Models

  • سال انتشار: 1401
  • محل انتشار: فصلنامه مهندسی برق دانشگاه تبریز، دوره: 52، شماره: 4
  • کد COI اختصاصی: JR_TJEE-52-4_007
  • زبان مقاله: فارسی
  • تعداد مشاهده: 164
دانلود فایل این مقاله

نویسندگان

الناز زعفرانی معطر

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

محمدرضا کنگاوری

Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran

امیر مسعود رحمانی

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

چکیده

Automatic topic detection seems unavoidable in social media analysis due to big text data which their users generate. Clustering-based methods are one of the most important and up-to-date categories in topic detection. The goal of this research is to have a wide study on this category. Therefore, this paper aims to study the main components of clustering-based-topic-detection, which are embedding methods, distance metrics, and clustering algorithms. Transfer learning and consequently pretrained language models and word embeddings have been considered in recent years. Regarding the importance of embedding methods, the efficiency of five new embedding methods, from earlier to recent ones, are compared in this paper. To conduct our study, two commonly used distance metrics, in addition to five important clustering algorithms in the field of topic detection, are implemented by the authors. As COVID-۱۹ has turned into a hot trending topic on social networks in recent years, a dataset including one-month tweets collected with COVID-۱۹-related hashtags is used for this study. More than ۷۵۰۰ experiments are performed to determine tunable parameters. Then all combinations of embedding methods, distance metrics and clustering algorithms (۵۰ combinations) are evaluated using Silhouette metric. Results show that T۵ strongly outperforms other embedding methods, cosine distance is weakly better than other distance metrics, and DBSCAN is superior to other clustering algorithms.

کلیدواژه ها

Topic Detection, Transfer learning, Embedding Methods, Distance Metrics, Clustering Methods, Covid-۱۹

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.