A New Dataset of Persian Handwritten Documentsand its Segmentation

سال انتشار: 1390
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 2,119

فایل این مقاله در 5 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

ICMVIP07_138

تاریخ نمایه سازی: 28 مرداد 1391

چکیده مقاله:

In document image analysis and especially inhandwritten document image recognition, standard datasets playvital roles for evaluating performances of algorithms andcomparing results obtained by different groups of researchers. Inthis paper, an unconstrained Persian handwritten text dataset(PHTD) is introduced. The PHTD contains 140 handwrittendocuments of three different categories written by 40 individuals.Total number of text-lines and words/subwords in the dataset are1787 and 27073, respectively. In most of the PHTD documentseither an overlapping or a touching text-lines is present. Theaverage number of text-lines in documents of the PHTD is 13.Two types of ground truths based on pixels information andcontent information are generated for the dataset. Providingthese two types of ground truths for the PHTD, it can be utilizedin many areas of document image processing such as sentencerecognition/understanding, text-line segmentation, wordsegmentation, word recognition, and character segmentation. Toprovide a framework for other researches, recent text-linesegmentation results on this dataset are also reported

نویسندگان

Alireza Alaei

Department of Studies in ComputerScience, University of MysoreMysore, ۵۷۰۰۰۶, India

P. Nagabhushan

Department of Studies in ComputerScience, University of MysoreMysore, ۵۷۰۰۰۶, India

Umapada Pal

Computer Vision and PatternRecognition Unit, Indian StatisticalInstitute, Kolkata–۱۰۸, India