Extracting Persian-English Parallel Sentences from DocumentLevel Aligned Comparable Corpus using Bi-DirectionalTranslation

Bilingual parallel corpora are very important in variousfiled of natural language processing (NLP). The quality of aStatistical Machine Translation (SMT) system stronglydependent upon the amount of training data. For low resourcelanguage pairs such as Persian-English, there are not enoughparallel sentences to build an accurate SMT system. This paperdescribes a new approach to use the Wikipedia as a comparablecorpus to extract Persian-English parallel sentences andeventually improve SMT system performance. This newapproach is also applicable to other low resource language pairs.In order to calculate the similarity score between two sentences, anovel bi-directional translation-based information retrievalsystem is proposed. A length penalty score is introduced toincrease the accuracy of extracted corpus. Using extractedparallel sentences, the performance of existing Persian-EnglishSMT is improved drastically

کلیدواژه ها:

comparable corpus ، bi-directional translation ، statistical machine translation ، Wikipedia ، information retrieval

نویسندگان

Ebrahim Ansari

Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Iran

Mohammad Hadi Sadreddin

Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Ira

Alireza Tabebordba

Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Iran

Richard WALLAC

Distributed Systems Architecture Research Group, Complutense UniversityMadrid, Spain

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/308917

شناسه ملی سند علمی:

JR_ACSIJ-3-5_008

تاریخ نمایه سازی: 12 آبان 1393

نحوه استناد به مقاله:

در صورتی که می خواهید در اثر پژوهشی خود به این مقاله ارجاع دهید، به سادگی می توانید از عبارت زیر در بخش منابع و مراجع استفاده نمایید:

Ansari, Ebrahim and Sadreddin, Mohammad Hadi and Tabebordba, Alireza and WALLAC, Richard,1393,Extracting Persian-English Parallel Sentences from DocumentLevel Aligned Comparable Corpus using Bi-DirectionalTranslation,https://civilica.com/doc/308917

در داخل متن نیز هر جا که به عبارت و یا دستاوردی از این مقاله اشاره شود پس از ذکر مطلب، در داخل پارانتز، مشخصات زیر نوشته می شود.
برای بار اول: (1393, Ansari, Ebrahim؛ Mohammad Hadi Sadreddin and Alireza Tabebordba and Richard WALLAC)
برای بار دوم به بعد: (1393, Ansari؛ Sadreddin and Tabebordba and WALLAC)
برای آشنایی کامل با نحوه مرجع نویسی لطفا بخش راهنمای سیویلیکا (مرجع دهی) را ملاحظه نمایید.

علم سنجی و رتبه بندی مقاله

مشخصات مرکز تولید کننده این مقاله به صورت زیر است:

رتبه علمی دانشگاه شیراز

نوع مرکز: دانشگاه دولتی

تعداد مقالات: 34,641

در بخش علم سنجی پایگاه سیویلیکا می توانید رتبه بندی علمی مراکز دانشگاهی و پژوهشی کشور را بر اساس آمار مقالات نمایه شده مشاهده نمایید.

مقالات مرتبط جدید