Addressing Left Censoring in Cancer Genomics: A Semi-Supervised Learning Approach
محل انتشار: دومین کنگره بین المللی کنسرژنومیکس
سال انتشار: 1403
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 185
نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
ICGCS02_354
تاریخ نمایه سازی: 17 دی 1403
چکیده مقاله:
Overall survival is a critical endpoint in cancer research, reflecting the length of time patients remain alive following diagnosis or treatment. However, in cancer research, clinical metadata often suffers from incompleteness, as it is challenging to track all patients from start to the end of the study duration. This issue is particularly significant in the genomics field, where data scarcity makes each data point invaluable (Stephens et al., ۲۰۱۵). Our study addresses a common problem in this field: left censoring, where the exact time of an event's occurrence is unknown, but it is known to have happened before a certain point (Gómez et al., ۲۰۰۹). To tackle this issue, we propose a novel approach using semi-supervised learning techniques to label data affected by left censoring. We employed two widely-used semi-supervised learning algorithms: label spreading and co-training models. Label spreading is an iterative algorithm that propagates labels from labeled to unlabeled data points based on their similarity. In this model, we used K-nearest Neighbors for the fitness function and Spearman distance because of the rank-dependent nature of the transcriptomic data. Co-training, on the other hand, uses multiple views of the data to train separate classifiers that then teach each other. In this model, we used K-nearest Neighbors and Random Forest for fitness algorithms (Zhu and Goldberg, ۲۰۰۹). To assess this approach, we applied these methods to an RNA-Seq gene expression dataset relating to lung adenocarcinoma, retrieved from The Cancer Genome Atlas (TCGA) program, consisting of ۵۱۴ samples with ۱۸۸ samples subject to left censoring. Our results demonstrate the superiority of these techniques in addressing this specific challenge. The label spreading model achieved perfect scores across all metrics, with accuracy, AUC, and F۱-score all reaching ۱.۰۰. The co-training model performed nearly as well, with scores of ۰.۹۸ for all three metrics. These findings suggest that semi-supervised learning approaches, particularly label spreading and co-training, could effectively address left censoring in genomics data. By accurately labeling previously unusable discarded data, our method has the potential to significantly expand the number of usable samples of a dataset in downstream analyses in cancer genomics research. This expansion could lead to more robust and statistically improved analyses and potentially new insights in the field. Future work could focus on benchmarking these techniques to larger and more diverse genomics datasets, as well as exploring their potential in other areas of medical research where left censoring is a common issue. References: ۱. Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., ... & Robinson, G. E. (۲۰۱۵). Big data: astronomical or genomical?. PLoS biology, ۱۳(۷), e۱۰۰۲۱۹۵. ۲. Gómez, G., Calle, M. L., Oller, R., & Langohr, K. (۲۰۰۹). Tutorial on methods for interval-censored data and their implementation in R. Statistical Modelling, ۹(۴), ۲۵۹-۲۹۷. ۳. Zhu, X., & Goldberg, A. B. (۲۰۰۹). Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, ۳(۱), ۱-۱۳۰.
کلیدواژه ها:
نویسندگان
Seyed Alireza Khanghahi
Department of Biophysics, Faculty of Biological Science, Tarbiat Modares University, Tehran, Iran
Sina Farazmandi
Department of Biophysics, Faculty of Biological Science, Tarbiat Modares University, Tehran, Iran
Parviz Abodlmaleki
Department of Biophysics, Faculty of Biological Science, Tarbiat Modares University, Tehran, Iran