A text classification method based on combination of Information gain and graph clustering

  • سال انتشار: 1398
  • محل انتشار: مجله بین المللی ارتباطات و فناوری اطلاعات، دوره: 11، شماره: 4
  • کد COI اختصاصی: JR_ITRC-11-4_007
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 242
دانلود فایل این مقاله

نویسندگان

Alireza Abdollahpouri

University of Kurdistan

Shadi Rahimi

University of Kurdistan

Fatemeh Zamani

University of Kurdistan

Parham Moradi

University of Kurdistan

چکیده

Text classification has a wide range of applications such as: spam filtering, automated indexing of scientific articles, identification the genre of documents, news monitoring, and so on.  Text datasets usually contain much irrelevant and noisy information which eventually reduces the efficiency and cost of their classification. Therefore, for effective text classification, feature selection methods are widely used to handle the high dimensionality of data. In this paper, a novel feature selection method based on the combination of information gain and FAST algorithm is proposed. In our proposed method, at first, the information gain is calculated for the features and those with higher information gain are selected. The FAST algorithm is then used on the selected features which uses graph-theoretic clustering methods. To evaluate the performance of the proposed method, we carry out experiments on three text datasets and compare our algorithm with several feature selection techniques. The results confirm that the proposed method produces smaller feature subset in shorter time. I addition, The evaluation of a K-nearest neighborhood classifier on validation data show that, the novel algorithm gives higher classification accuracy.

کلیدواژه ها

Feature selection, Information gain, text categorization, FAST algorithm

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.