Text Anomalies Detection Using Histograms of Words

سال انتشار: 1395
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 316

فایل این مقاله در 6 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_ACSIJ-5-1_010

تاریخ نمایه سازی: 19 آبان 1397

چکیده مقاله:

Authors of written texts mainly can be characterized by some collection of attributes obtained from texts. Texts of the same author are very similar from the style point of view. We can consider that attributes of a full text are very similar to attributes of parts in the same text. In the same thoughts can be compared different parts of the same text. In the paper, we describe an algorithm based on histograms of a mapped text to interval 0,1 . In the mapping, it is kipped the word order as in the text. Histograms are analyzed from a cluster point of view. If a cluster dispersion is not large, the text is probably written by the same author. If the cluster dispersion is large, the text will be split in two or more parts and the same analysis will be done for the text parts. The experiments were done on English and Arabic texts. For combined English texts our algorithmcovers that texts were not written by one author. We have got the similar results for combined Arabic texts. Our algorithm can be used to basic text analysis if the text was written by one author.

نویسندگان

Abdulwahed Almarimi

Institute of Computer Science, Faculty of Science, P. J. Šafárik University in Košice ۰۴۰۰۱ Košice, Slovakia

Gabriela Andrejková

Institute of Computer Science, Faculty of Science, P. J. Šafárik University in Košice ۰۴۰۰۱ Košice, Slovakia