RePersian - A Fast Relation Extraction Tool in Persian

سال انتشار: 1398
محل انتشار: فصلنامه بین المللی وب پژوهی، دوره: 2، شماره: 2
کد COI اختصاصی: JR_IJWR-2-2_003
زبان مقاله: انگلیسی
تعداد مشاهده: 337

نویسندگان

IUST

Department of Computer Engineering Iran University of Science and Technology

Associate Professor and director of research in the Computer Engineering Department at Iran University of Science and Technology

چکیده

The task of extracting semantic relations from raw data is called relation extraction. One of the most important fields in open information extraction is the automatically extraction of relations in any domain, especially in web mining. There are many works and approaches for relation extraction in English and other languages. Some of these approaches are based on parsing trees. Dependency parsing in the Persian language is difficult and time-consuming, since Persian is a low resource language and has also a dependency grammar and lexical structure, which affects also the speed of relations extraction in Persian. In this paper we will introduce a fast relation extraction method in Persian called RePersian. RePersian is dependent on part-of-speech (POS) tags of a sentence and special relation patterns, which are extracted by analyzing sentence structures in Persian. For finding relation patterns, RePersian searches through POS-tags that are given in regular expression forms. By matching the correct POS pattern to a relation pattern, RePersian extracts the semantic relations in a sentence. We appraise RePersian in two different scenarios on the Dadegan Persian dependency tree dataset. RePersian had on average the precisions 78.05%, 80.4% and 54.85% in finding the first argument on a relation, the second argument and the right relation between them.

کلیدواژه ها

Relation Extraction, Persian language, Regex, POS Tag

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.