Language detection for classification and content-based web pages filtering

  • سال انتشار: 1390
  • محل انتشار: همایش ملی شهر الکترونیک
  • کد COI اختصاصی: IAUHNCEC01_064
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 1992
دانلود فایل این مقاله

نویسندگان

Saman Bashbaghi

Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran

Hassan Khotanlou

Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran

چکیده

According to daily increase of the documents increasing on the internet, automatic language detection is getting more important. In this paper we used language detection system to classify and filtering of the immoral web pages, based on their contents. This system could detect 10 most used languages in the immoral web pages, including FARSI language. As a technique we introduce a new combined method which consists of three parts; URL Processor, page encoding processor, and text processor. In order to generate proper results this system has a voter which combines the results of these three parts. We used the immoral web pages and labeled web pages as an input data set in order to make a linguistic model for each language and system evaluation. Our experiments show 95% accuracy success in accuracy of outcome results. because in this particular issue, it is possible that the name used in the address doesn’t show the page immorality. Another reason is that, there could be many web pages with different languages which used the same encoding. Consequently, each method could not solve the problem by itself. It is declared in this paper that combination of thesethree methods has a very promising result. The paper structure consists of related works, problemdefinition, solution introduction, results interpretation, conclusion and future works.

کلیدواژه ها

Text classification; automatic language detection; web page filtering; immoral web pages

مقالات مرتبط جدید

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.