Classification of Persian News Articles using Machine Learning Techniques

  • سال انتشار: 1400
  • محل انتشار: مجله مهندسی کامپیوتر و دانش، دوره: 4، شماره: 1
  • کد COI اختصاصی: JR_CKE-4-1_001
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 179
دانلود فایل این مقاله

نویسندگان

Sareh Mostafavi

Department of Computational Linguistics, Regional Information Center for Science and Technology (RICeST), Shiraz, Fars, Iran

Bahareh Pahlevanzadeh

Department of Design and System Operations, Regional Information Center for Science and Technology (RICeST), Shiraz, Fars, Iran

Mohammad Reza Falahati Qadimi Fumani

Department of Computational Linguistics, Regional Information Center for Science and Technology (RICeST), Shiraz, Fars, Iran

چکیده

Automatic text classification, which is defined as the process of automatically classifying texts into predefined categories, has many applications in our everyday life and it has recently gained much attention due to the in-creased number of text documents available in electronic form. Classifying News articles is one of the applications of text classification. Automatic classification is a subset of machine learning techniques in which a classifier is built by learning from some pre-classified documents. Naïve Bayes and k-Nearest Neighbor are among the most common algorithms of machine learning for text classification. In this paper, we suggest a way to improve the performance of a text classifier using Mutual information and Chi-square feature selection algorithms. We have observed that MI feature selection method can improve the accuracy of Naïve Bayes classifier up to ۱۰%. Experimental results show that the proposed model achieves an average accuracy of ۸۰% and an average F۱-measure of ۸۰%.

کلیدواژه ها

Automatic Persian text classification, k-Nearest Neighbor, Naïve Bayes, News text classification, Text mining

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.