Fine-Tuning BERT for Persian Poet Identification

Soroosh, Akef; Mohammad, Bahrani

Fine-Tuning BERT for Persian Poet Identification

عنوان مقاله: Fine-Tuning BERT for Persian Poet Identification
شناسه ملی مقاله: EMAA20_015
منتشر شده در بیستمین کنفرانس بین المللی پژوهش های نوین در علوم و فناوری در سال 1400

مشخصات نویسندگان مقاله:

Soroosh Akef - Languages and Linguistics Center, Sharif University of Technology,
Mohammad Bahrani - Department of Computer, Allameh Tabataba'i University,

خلاصه مقاله:

In this experiment, we have attempted the task of poet identification for four prominent Persian poets (i.e., Hafez, Omar Khayyam, Rumi, and Saadi Shirazi) by fine-tuning the BERT language representation model on a dataset of hemistichs. Among the challenges of this task was the imbalanced distribution of the hemistichs, with one class containing more than ۵۲۰۰۰ hemistichs while another class contained less than ۱۳۰۰ hemistichs. Moreover, the short length of the hemistichs made this task more challenging than poet identification using a whole poem or even a verse. It was also demonstrated that the diction used by the poets was similar to some degree, which further added to the challenge of the task at hand. The model attained a Matthews correlation coefficient of ۰.۶۴۶ on the test set, and the effectiveness of transfer learning in processing works of literature was demonstrated even in case of unsubstantial data, an imbalanced dataset, and similar diction.

کلمات کلیدی:

author identification, digital humanities, imbalanced data, poet identification, computational linguistics, bert

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/1412278/