Prioritizing the ordering of URL queue in focused crawler

  • سال انتشار: 1392
  • محل انتشار: مجله هوش مصنوعی و داده کاوی، دوره: 2، شماره: 1
  • کد COI اختصاصی: JR_JADM-2-1_004
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 851
دانلود فایل این مقاله

نویسندگان

d Koundal

University Institute of Engineering and Technology, Panjab University, Chandigarh, India

چکیده

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler, it is not a simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, and crawling is a key technique, which is able to crawl particular topical portions of the World Wide Web quickly without having to explore all web pages. Focused crawling is a technique, which is able to crawl particular topics quickly and efficiently without exploring all WebPages. The proposed approach does not only use keywords for the crawl, but also rely on high-level background knowledge with concepts and relations, which are compared with the texts of the searched page.In this paper, a combined crawling strategy is proposed that integrates the link analysis algorithm with association metric. An approach is followed to find out the relevant pages before the process of crawling and to prioritize the URL queue from downloading higher relevant pages to an optimal level based on domain dependent ontology. This strategy makes use of ontology to estimate the semantic contents of the URL without exploring which in turn strengthen the ordering metric for URL queue and leads to the retrieval of most relevant pages.

کلیدواژه ها

WebCrawler, Importance-metrics, Association - metric, Ontology

مقالات مرتبط جدید

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.