A Framework for Evaluating Word Boundary Detection in Persian Tokenizers

سال انتشار: 1404
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 64

فایل این مقاله در 14 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_JICSE-2-1_003

تاریخ نمایه سازی: 15 اسفند 1403

چکیده مقاله:

Tokenization is a critical stage in text preprocessing and presents numerous challenges in languages like Persian, where there is no deterministic word boundary. These challenges include the identification of multi-function morphemes, separation of punctuation marks, omission of spaces between tokens, and handling extra spaces inside words.Typically, the evaluation of tokenizers focuses on overall performance, and test data does not necessarily cover all challenging linguistic phenomena. As a result, strengths and weaknesses of tokenizers in addressing specific challenges are not independently assessed. This paper examines the challenges posed by the Persian script in detecting word boundaries and evaluates the performance of seven tokenizers in handling these issues. A test set of ۴۰۹۱ tokens across ۴۸۳ sentences was prepared, with ۱۰۱۰ considered as challenging tokens. The tokenizers were evaluated using this dataset.The results indicate varying performance among tokenizers when dealing with Persian orthography. Some tokenizers performed better in separating compound words, while others excelled in identifying and preserving zero-length joiners (half-space). A detailed comparison reveals that no tokenizer fully addresses all challenges, highlighting the need for improved algorithms and more sophisticated solutions for Persian word boundary detection.By introducing a comprehensive benchmark and identifying the strengths and weaknesses of available tokenizers, this study paves the way for the development of better Persian language processing tools.

نویسندگان

Mostafa Karimi Manesh

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

Mehrnoush Shamsfard

Faculty of Computer Science and Engineering, Shahid Beheshti University G.C, Tehran, Iran