Data Extraction using Content-Based Handles

A. Pouramini; S. Khaje Hassani; Sh. Nasiri

Data Extraction using Content-Based Handles

محل انتشار: مجله هوش مصنوعی و داده کاوی، دوره: 6، شماره: 2

سال انتشار: 1397

نوع سند: مقاله ژورنالی

زبان: انگلیسی

مشاهده: 584

فایل این مقاله در 9 صفحه با فرمت PDF قابل دریافت می باشد

دریافت فایل کامل مقاله

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/894099

شناسه ملی سند علمی:

JR_JADM-6-2_015

تاریخ نمایه سازی: 19 تیر 1398

چکیده مقاله:

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information.

کلیدواژه ها:

Web Data Record Extraction ، Web Wrapper Generation ، Web Information Extraction

نویسندگان

A. Pouramini

Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.

S. Khaje Hassani

Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.

Sh. Nasiri

Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.