Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

سال انتشار: 1403
محل انتشار: ماهنامه بین المللی مهندسی، دوره: 37، شماره: 7
کد COI اختصاصی: JR_IJE-37-7_008
زبان مقاله: انگلیسی
تعداد مشاهده: 35

نویسندگان

Department of Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Department of Computer Engineering, Sarvajanik College of Engineering and Technology, Gujarat Technological University, Ahmedabad, Gujarat, India

چکیده

Hostile post on social media is a crucial issue for individuals, governments and organizations. There is a critical need for an automated system that can investigate and identify hostile posts from large-scale data. In India, Gujarati is the sixth most spoken language. In this work, we have constructed a major hostile post dataset in the Gujarati language. The data are collected from Twitter, Instagram and Facebook. Our dataset consists of ۱,۵۱,۰۰۰ distinct comments having ۱۰,۰۰۰ manually annotated posts. These posts are labeled into the Hostile and Non-Hostile categories. We have used the dataset in two ways: (i) Original Gujarati Text Data and (ii) English data translated from Gujarati text. We have also checked the performance of pre-processing and without pre-processing data by removing extra symbols and substituting emoji descriptions in the text. We have conducted experiments using machine learning models based on supervised learning such as Support Vector Machine, Decision Tree, Random Forest, Gaussian Naive-Bayes, Logistic Regression, K-Nearest Neighbor and unsupervised learning based model such as k-means clustering. We have evaluated performance of these models for Bag-of-Words and TF-IDF feature extraction methods. It is observed that classification using TF-IDF features is efficient. Among these methods Logistic regression outperforms with an Accuracy of ۰.۶۸ and F۱-score of ۰.۶۷. The purpose of this research is to create a benchmark dataset and provide baseline results for detecting hostile posts in Gujarati Language.

کلیدواژه ها

Hostile Text Detection, Machine Learning, Hate Text Detection, Text Classification, Gujarati Text Dataset

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.