Web Data Extraction Using Textual Anchors

سال انتشار: 1394
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 425

فایل این مقاله در 8 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_JKBEI-2-4_004

تاریخ نمایه سازی: 17 شهریور 1395

چکیده مقاله:

In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.

کلیدواژه ها:

نویسندگان

Ahmad Pouramini

Department of Computer Engineering,Sirjan University of Technology, Sirjan, Iran