Pattern Matching for Extraction of Core Contentsfrom News Web Pages

  • سال انتشار: 1395
  • محل انتشار: دومین کنفرانس بین المللی وب پژوهی
  • کد COI اختصاصی: IRANWEB02_075
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 1034
دانلود فایل این مقاله

نویسندگان

Sandeep Sirsat

Associate Professor and Head Department of Computer ScienceShri Shivaji Science & Arts College, Chikhali,Maharashtra, India

Vinay Chavan

Associate Professor and Head Department of Computer ScienceS. K. Prowal College, KamptiNagpur, India

چکیده

Web pages, besides core contents, consist of otherelements, such as banners, navigational elements, copyrightinformation, external links, etc. This noisy content covers morearea of web pages and is typically not related to the main subjectsof the web pages. Most of the information available on web pagesis either represented in XML, or HTML, or XHTML format thatmostly contains semi-structured text documents, which lacksformatted document structure. This document does notdiscriminate between the text and the schema, and the amount ofstructure used to represent the text depends on the purpose. Nosemantic is applied to semi-structured documents. This requiresextracting core contents of text document to analyse words orsentences for retrieving relevant information. Although there aremany existing methods that formulate the actual contentidentification problem as a DOM tree node selection problem,each one has some sort of lacunae. Here we proposed an approachbased on pattern matching technique. This technique uses simpleheuristic for extraction of core contents from web pages which aremostly semi-structured in nature. It requires visiting theappropriate news web site using their URL, accessing thelinks related to each news page of specified category, extractingthe data including metadata from each of these news web pages.The approach uses devised algorithm that applies regularexpressions (regexes) to identify the correct pattern for extractingthe actual text contents from these news documents. Proposedapproach deals with news web pages of any size and extracts corecontents with efficiency and high accuracy.

کلیدواژه ها

Pattern matching, Information extraction, DocumentObject Module, tags

مقالات مرتبط جدید

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.