Extracting Persian-English Parallel Sentences from DocumentLevel Aligned Comparable Corpus using Bi-DirectionalTranslation
سال انتشار: 1393
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 766
فایل این مقاله در 7 صفحه با فرمت PDF قابل دریافت می باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
JR_ACSIJ-3-5_008
تاریخ نمایه سازی: 12 آبان 1393
چکیده مقاله:
Bilingual parallel corpora are very important in variousfiled of natural language processing (NLP). The quality of aStatistical Machine Translation (SMT) system stronglydependent upon the amount of training data. For low resourcelanguage pairs such as Persian-English, there are not enoughparallel sentences to build an accurate SMT system. This paperdescribes a new approach to use the Wikipedia as a comparablecorpus to extract Persian-English parallel sentences andeventually improve SMT system performance. This newapproach is also applicable to other low resource language pairs.In order to calculate the similarity score between two sentences, anovel bi-directional translation-based information retrievalsystem is proposed. A length penalty score is introduced toincrease the accuracy of extracted corpus. Using extractedparallel sentences, the performance of existing Persian-EnglishSMT is improved drastically
کلیدواژه ها:
نویسندگان
Ebrahim Ansari
Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Iran
Mohammad Hadi Sadreddin
Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Ira
Alireza Tabebordba
Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Iran
Richard WALLAC
Distributed Systems Architecture Research Group, Complutense UniversityMadrid, Spain