CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

Employing a novel content-based similarity measure for a machine learning-driven focused crawler

عنوان مقاله: Employing a novel content-based similarity measure for a machine learning-driven focused crawler
شناسه ملی مقاله: CEPS06_121
منتشر شده در ششمین کنفرانس ملی پژوهش های کاربردی در مهندسی کامپیوتر و فناوری اطلاعات در سال 1398
مشخصات نویسندگان مقاله:

Atiye Jabalameli - Department of Electrical and Computer Engineering, University of Kashan, Kashan, Iran
S. Mehdi Vahidipour - Department of Electrical and Computer Engineering, University of Kashan, Kashan, Iran
Mohammad Mahdi Mohammadi - Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran

خلاصه مقاله:
The volume of the World Wide Web is growing rapidly, reaching a point where governing data is challenging. Search engines are used to collect data across the web for users. Web crawlers as the major part of search engines are then used to retrieve relevant data on the web according to the user requests. Accordingly, a focused crawler considers a predefined subject and retrieves corresponding relevant pages. In this paper, we propose an efficient focused web crawling approach, which uses a combination of a content-based similarity measure and a Naive Bayes learning classifier in order to find relevant pages to a particular subject. Our first experimental studies show satisfactory improvements where accuracy and recall are increased by 4% and 1% respectively.

کلمات کلیدی:
Focused crawler, Web crawler, Naive Bayes classification, Relevant page, TF-IDF criteria

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/1011676/