Register Variation in Persian: A Corpus-Driven Study of Slang, Verbs, and Lexical Items across Informal and Formal Texts

سال انتشار: 1404
نوع سند: مقاله ژورنالی
زبان: فارسی
مشاهده: 53

نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_ICSS-27-NaN_027

تاریخ نمایه سازی: 15 آذر 1404

چکیده مقاله:

This study explores the rich tapestry of Persian lexical variation by analyzing the contrast between formal written language and the vibrant, ever-evolving vernacular found on social media. The research centers on slang and dialectal expressions that typically escape traditional corpora. It employs a corpus-based methodology that compares the formal Bijankhan Corpus with the informal Large-Scale Colloquial Persian (LSCP) corpus made of Persian tweets. Two major Persian corpora are compared in this study: the formal Bijankhan Corpus and the informal LSCP. Both datasets were tokenized, cleaned, and normalized through rigorous natural language processing (NLP) preprocessing. Frequency analyses were also conducted to uncover lexical items distinctive to each register. Especially attention was given to slang and colloquial terms prevalent in LSCP. This work sheds light on the vocabulary richness found in informal Persian, contributing to a more nuanced understanding of language variation. It also supports the use of different language forms in the NLP pipelines. Integrating such registers promises to improve the accuracy and cultural relevance of Persian language technologies. This comparison of corpora offers valuable insights into Persian lexical variation, emphasizing the need to augment linguistic analysis and enhance NLP tools with more informal language data.

کلیدواژه ها:

نویسندگان

Hossein Fallah Yakhdani

MA Student, Department of Linguistics, Allame Tabataba'i University, Tehran, Iran

Elham Mizban

PhD in Linguistics, Ferdowsi University of Mashhad, Iran