A Signal Processing Method for Text Language Identification

H. Hassanpour; M. M. AlyanNezhadi; M. Mohammadi

A Signal Processing Method for Text Language Identification

محل انتشار: ماهنامه بین المللی مهندسی، دوره: 34، شماره: 6

سال انتشار: 1400

نوع سند: مقاله ژورنالی

زبان: انگلیسی

مشاهده: 242

فایل این مقاله در 6 صفحه با فرمت PDF قابل دریافت می باشد

دریافت فایل کامل مقاله

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

این مقاله در بخشهای موضوعی زیر دسته بندی شده است:

هوش مصنوعی > شبکه عصبی

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/1224245

شناسه ملی سند علمی:

JR_IJE-34-6_004

تاریخ نمایه سازی: 12 خرداد 1400

چکیده مقاله:

Language identification is a critical step prior to any natural language processing. In this paper, a signal processing method for Language Identification is proposed. Sequence of characters in a word and the order of words in stream identify the language. The sequence of characters in a stream provides a signature to recognize the language without understanding its meaning. The signature can be extracted using signal processing techniques via converting texts into time series. Although several research and commercial software have been developed to identify text language, they need a standard dictionary for each language. We proposed a dictionary independent method consisting of three main steps, I) preprocessing, II) clustering and finally III) classification. First, the texts are converted to time series using UTF-۸ codes. Second, to group similar languages, the obtained series are clustered. Third, each cluster is decomposed into ۳۲ sub-bands using a Wavelet packet, and ۳۲ features are extracted from each sub-band. Also, a multilayer perceptron neural network is used to classify the extracted features. The proposed method was tested on our dataset with ۳۱۰۰۰ texts from ۳۱ different languages. The proposed method achieved ۷۲.۲۰% accuracy for language identification.

کلیدواژه ها:

Language Identification ، Signal processing ، Wavelet Packet Transform ، Artificial Neural Network

نویسندگان

H. Hassanpour

Image Processing & Data Mining Lab, Shahrood University of Technology, Shahrood, Iran

M. M. AlyanNezhadi

Department of Mathematics, University of Science and Technology of Mazandaran, Behshahr, Iran

M. Mohammadi

Department of Information Technology, Lebanese Frebch, University, Erbil, Kurdistan Region, Iraq

مراجع و منابع این مقاله:

لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :

Cai, W., Cai, Z., Liu, W., Wang, X. and Li, ...
Cavnar, W.B. and Trenkle, J.M., "N-gram-based text categorization", in Proceedings ...
Bobicev, V., "Native language identification with ppm", in Proceedings of ...
Duvenhage, B., "Short text language identification for under resourced languages", ...

نمایش کامل مراجع