DOSTE: Document Similarity Matching considering Informative Name Entities

سال انتشار: 1404
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 39

فایل این مقاله در 11 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_JADM-13-1_008

تاریخ نمایه سازی: 12 شهریور 1404

چکیده مقاله:

Document similarity matching is essential for efficient text retrieval, plagiarism detection, and content analysis. Existing studies in this field can be categorized into three approaches: statistical analysis, deep learning, and hybrid approaches. However, to the best of our knowledge, none have incorporated the importance of named entities into their methodologies. In this paper, we propose DOSTE, a method that first extracts name entities and then utilizes them to enhance document similarity matching through statistical and graph-based analysis. Empirical results indicate that DOSTE achieves better results by emphasizing named entities, resulting in an average improvement of ۹% in the average recall metric compared to baseline methods. Also, DOSTE unlike LLM-based approaches, does not require extensive GPU resources. Additionally, non-empirical interpretations of the results indicate that DOSTE is particularly effective in identifying similarity in short documents and complex document comparisons.

نویسندگان

Milad Allhgholi

School of Computer engineering, Iran University of Science and Technology, Tehran, Iran.

Hossein Rahmani

School of Computer engineering, Iran University of Science and Technology, Tehran, Iran.

Amirhossein Derakhshan

School of Computer engineering, Iran University of Science and Technology, Tehran, Iran.

Saman Mohammadi Raouf

School of Computer engineering, Iran University of Science and Technology, Tehran, Iran.

مراجع و منابع این مقاله:

لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :
  • P. Hambarde, "Information Retrieval: Recent Advances and Beyond," ۲۰۲۳ ...
  • S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, ...
  • K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. ...
  • M. W. Bilotti, P. Ogilvie, J. Callan, and E. Nyberg, ...
  • P. F. Brown, V. J. Della Pietra, P. V. Desouza, ...
  • C. Sammut and G. I. Webb, Encyclopedia of machine learning. ...
  • S. Fatima and B. Srinivasu, "Text Document categorization using support ...
  • S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng, ...
  • S. Jiang, G. Pang, M. Wu, and L. Kuang, "An ...
  • N. Reimers, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv preprint ...
  • C. Duan, L. Cui, X. Chen, F. Wei, C. Zhu, ...
  • C. Tan, F. Wei, W. Wang, W. Lv, and M. ...
  • C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. ...
  • A. Fan, S. Wang, and Y. Wang, "Legal Document Similarity ...
  • G. Wang, T. Zhang, G. Xu, Y. Zheng, Z. Du, ...
  • F. Safi-Esfahani, S. Rakian, and M. Nadimi-Shahraki, "English-Persian Plagiarism Detection ...
  • N. Jiang and M.-C. de Marneffe, "Do you know that ...
  • I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. ...
  • Q. Wang et al., "Learning deep transformer models for machine ...
  • M. Ostendorff, T. Ruas, M. Schubotz, G. Rehm, and B. ...
  • P. Bafna, D. Pramod, and A. Vaidya, "Document clustering: TF-IDF ...
  • M. A. El-Rashidy, R. G. Mohamed, N. A. El-Fishawy, and ...
  • L. Yang, M. Zhang, C. Li, M. Bendersky, and M. ...
  • M. Ding, C. Zhou, H. Yang, and J. Tang, "Cogltx: ...
  • A. Sharma and S. Kumar, "Ontology-based semantic retrieval of documents ...
  • R. Wu, "RecBERT: Semantic recommendation engine with large language model ...
  • N. B. Korade, M. B. Salunke, A. A. Bhosle, P. ...
  • A. Jha, V. Rakesh, J. Chandrashekar, A. Samavedhi, and C. ...
  • H. Wang, K. Tian, Z. Wu, and L. Wang, "A ...
  • F. Ahmad and M. Faisal, "A novel hybrid methodology for ...
  • W. Yu, C. Xu, J. Xu, L. Pang, and J.-R. ...
  • D. Viji and S. Revathy, "A hybrid approach of Weighted ...
  • P. Li, G.-J. Ren, A. L. Gentile, C. DeLuca, D. ...
  • F. Mashhadirajab, M. Shamsfard, R. Adelkhah, F. Shafiee, and C. ...
  • M. R. Sharifabadi and S. A. Eftekhari, "Mahak Samim: A ...
  • S. Abnar, M. Dehghani, H. Zamani, and A. Shakery, "Expanded ...
  • K. Khoshnavataher, V. Zarrabi, S. Mohtaj, and H. Asghari, "Developing ...
  • A. C. Marco, A. Myers, S. J. Graham, P. D'Agostino, ...
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, ...
  • A. Trischler et al., "Newsqa: A machine comprehension dataset," arXiv ...
  • Z. Yang et al., "HotpotQA: A dataset for diverse, explainable ...
  • D. D. Lewis, Y. Yang, T. Russell-Rose, and F. Li, ...
  • D. D. Lewis, "text categorization test collection," ed: Tech. Rep., ...
  • H. Asghari, S. Mohtaj, O. Fatemi, H. Faili, P. Rosso, ...
  • نمایش کامل مراجع