Language detection for classification and content-based web pages filtering

سال انتشار: 1390
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 1,791

فایل این مقاله در 5 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:


تاریخ نمایه سازی: 18 تیر 1391

چکیده مقاله:

According to daily increase of the documents increasing on the internet, automatic language detection is getting more important. In this paper we used language detection system to classify and filtering of the immoral web pages, based on their contents. This system could detect 10 most used languages in the immoral web pages, including FARSI language. As a technique we introduce a new combined method which consists of three parts; URL Processor, page encoding processor, and text processor. In order to generate proper results this system has a voter which combines the results of these three parts. We used the immoral web pages and labeled web pages as an input data set in order to make a linguistic model for each language and system evaluation. Our experiments show 95% accuracy success in accuracy of outcome results. because in this particular issue, it is possible that the name used in the address doesn’t show the page immorality. Another reason is that, there could be many web pages with different languages which used the same encoding. Consequently, each method could not solve the problem by itself. It is declared in this paper that combination of thesethree methods has a very promising result. The paper structure consists of related works, problemdefinition, solution introduction, results interpretation, conclusion and future works.


Saman Bashbaghi

Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran

Hassan Khotanlou

Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran

مراجع و منابع این مقاله:

لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :
  • J. Ropelato, Internet Pornography Statistics, TopTenReview S, 2O7. ...
  • G. Churcher, Distinctive character sequences, Personal commun ication, 1994. ...
  • G. Grefenstette، :Comparing two language identification schemes"، In Proceedings of ...
  • W.B. Cavnar، J. M. Trenkle، "N-gram-based text categorization"، In Symposium ...
  • Lena Grothe, Ernesto William De Luca and Andreas N urnberger, ...
  • نمایش کامل مراجع