Layout Analysis in textual information with NLP

  • سال انتشار: 1403
  • محل انتشار: بیست و سومین کنفرانس بین المللی فناوری اطلاعات، کامپیوتر و مخابرات
  • کد COI اختصاصی: ITCT23_002
  • زبان مقاله: انگلیسی
  • تعداد مشاهده: 145
دانلود فایل این مقاله

نویسندگان

Mohammadreza Faraji

B.Sc. in Computer of Engineering, Fouman Faculty of Engineering, College of Engineering, University of Tehran, Iran

Atefeh Hasan-Zadeh

Fouman Faculty of Engineering, College of Engineering, University of Tehran, Iran, P.O.Box: ۴۳۵۸۱-۳۹۱۱۵,

چکیده

As the number of scientific journals increases, analyzing trends and the latest technologies in a particular scientificfield turns into a very time consuming and tedious task. In response to the urgent need for information, which theexisting systematic review model does not make good use of, several review types have emerged, namely, quickreview and investigation of the limits. In this paper, we propose an NLP-enabled tool that automates most of the textdocument review process with automated analysis. On the other hand, the two main purposes of OCR are to recognizetext from images and to transform images into text. Currently, one of the tasks performed by OCR is layout analysis,which classifies text images. In fact, in layout analysis, we put the different parts of a text image, including tables,headings, paragraphs, etc., into separate classes; for this purpose, we have two general methods, which are: ۱- Visioncomputer method ۲- Natural language processing method. In this study, we have used the second method, which wewill examine in detail in this study. Natural language processing method applied in this paper gives us an accuracyof ۰.۷۴ in the evaluation section in textual information, which is significant and can be relied on as a result

کلیدواژه ها

OCR, layout analysis, Transformer, BERT, Layout LM

مقالات مرتبط جدید

اطلاعات بیشتر در مورد COI

COI مخفف عبارت CIVILICA Object Identifier به معنی شناسه سیویلیکا برای اسناد است. COI کدی است که مطابق محل انتشار، به مقالات کنفرانسها و ژورنالهای داخل کشور به هنگام نمایه سازی بر روی پایگاه استنادی سیویلیکا اختصاص می یابد.

کد COI به مفهوم کد ملی اسناد نمایه شده در سیویلیکا است و کدی یکتا و ثابت است و به همین دلیل همواره قابلیت استناد و پیگیری دارد.