Analyzing Diagnostic Patterns in Scientific Cancer Articles Using Machine Learning Algorithms

سال انتشار: 1404
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 32

متن کامل این مقاله منتشر نشده است و فقط به صورت چکیده یا چکیده مبسوط در پایگاه موجود می باشد.
توضیح: معمولا کلیه مقالاتی که کمتر از ۵ صفحه باشند در پایگاه سیویلیکا اصل مقاله (فول تکست) محسوب نمی شوند و فقط کاربران عضو بدون کسر اعتبار می توانند فایل آنها را دریافت نمایند.

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

AIMS02_503

تاریخ نمایه سازی: 29 تیر 1404

چکیده مقاله:

Background and Aims: The accurate diagnosis of cancer remains a pivotal challenge in medical research, necessitating sophisticated analytical approaches to enhance both precision and efficiency. This study aimed to analyze the abstracts of ۷,۵۷۰ scientific articles related to thyroid, colon, and lung cancers to uncover patterns and refine diagnostic methodologies using machine learning models. The underlying hypothesis was that patterns embedded within the text of these abstracts could yield valuable insights for cancer classification and prediction. Methods: The dataset was obtained from Kaggle and consisted of abstracts categorized by cancer type. Data preprocessing included cleaning, vectorization, and splitting into training and testing subsets. Libraries such as Pandas were utilized for data handling, and the dataset was analyzed to assess the distribution of samples across cancer types. Key Python libraries were imported, and custom functions were defined for streamlined processing. Word cloud visualizations, generated using the WordCloud library, were employed to highlight frequently occurring terms in the abstracts, offering a graphical representation of dominant research themes. Five machine learning algorithms—Linear Regression, Logistic Regression, Decision Tree, Random Forest, and K-Nearest Neighbors—were implemented for model training and evaluation. The textual data was vectorized using appropriate techniques and subsequently partitioned into training and testing sets. Results: Among the five models evaluated, the Random Forest algorithm achieved the highest accuracy in classifying cancer types based on abstract content. Model performance was quantified using the Accuracy metric. Additionally, word cloud analysis revealed key terms frequently associated with each cancer type, shedding light on prevalent research themes. This approach facilitated the rapid identification of significant keywords within the corpus, providing contextual insights into the focus of existing literature. Conclusion: The results highlight the effectiveness of machine learning, especially the Random Forest algorithm, in analyzing large-scale cancer-related textual data. These findings suggest applications in automating literature reviews and improving diagnostic tools. The algorithm's superior performance underscores the value of ensemble methods for text classification. Future research could focus on integrating more datasets, exploring deep learning techniques, and refining preprocessing to boost accuracy.

نویسندگان

Nahideh Khoshmaram

Student Research Committee, Tabriz University of Medical Sciences, Tabriz, Iran

Fahimeh Khoshmaram

Faculty of Physical Education and Sports Sciences, Kharazmi University, Tehran, Iran