Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

سال انتشار: 1398
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 379

فایل این مقاله در 12 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_JACET-5-2_006

تاریخ نمایه سازی: 18 تیر 1398

چکیده مقاله:

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for the creation of Arabic text corpora. In particular, we create a text classification process for Arabic news articles downloaded from web news portals and sites. The suggested procedure is a pilot project that uses some human predefined set of documents that have been assigned to some subjects or categories. A vectorized Term Frequency, Inverse Document Frequency (TF-IDF) based information processing was used for the initial verification of the categories. The resulting validated categories used to predict categories for new documents. The experiment used 1000 initial documents pre-assigned into five categories of each with 200 documents assigned. An initial set of 2195 documents were downloaded from a number of Arabic news sources. They were pre-processed for use in testing the utility of the suggested classification procedure using the cosine similarity as a classifier. Results were very encouraging with very satisfying precision, recall and F1-score. It is the intention of the authors to improve the procedure and to use it for Arabic corpora creation.

نویسندگان

Mohamed Elhadi

Computer Technology Department, Faculty of Information Technology, Zawia University, Libya