Identifying Duplicate Records by Using Estimation of Distribution Algorithms to Learn the Semantics

  • سال انتشار: 1384
  • محل انتشار: یازدهمین کنفرانس سالانه انجمن کامپیوتر ایران
  • کد COI اختصاصی: ACCSI11_218
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 1256
دانلود فایل این مقاله

نویسندگان

Saied Haidarian Shahri۱

Control and Intelligent Processing Center of Excellence (CIPCE)Department of Computer and Electrical Engineering University of Tehran, Tehran, Iran

Caro Lucas

Control and Intelligent Processing Center of Excellence (CIPCE)Department of Computer and Electrical Engineering University of Tehran, Tehran, Iran

Babak N. Araabi۱,۲

School of Cognitive Sciences, Institute for studies in theoretical Physics and Mathematics, Tehran, Iran

چکیده

When data is gathered from various sources to be included in integrated information systems, for example data warehouses, the likelihood of existence of duplicate and inconsistent data records increases. A flexible and automatic reasoning mechanism is required to clean the data, to enable the user to draw accurate statistics and reports from this wealth of data, which are to be used in the decision making of entrepreneurial enterprises. In this paper, we have employed an approach for deduplication, which takes advantage of a fuzzy logic framework. The fuzzy inference system is then optimized by means of the Bayesian Optimization Algorithm, a class of Estimation of Distribution Algorithms, which can learn complex multivariate relations of bounded order. This class of algorithms is inspired form the breeder genetic algorithm, which is used in the science of livestock breeding. The experiments reveal that this approach is capable of eliminating duplicates abound with uncertainty, and therefore the resultant data is of better quality.

کلیدواژه ها

Duplicate Elimination, Estimation of Distribution Algorithms,Fuzzy Inference System

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.