Innovative Dictionary Learning and Sparse Coding for Effective Feature Selection in Cancer Gene Expression Data
محل انتشار: دومین کنگره بین المللی کنسرژنومیکس
سال انتشار: 1403
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 45
نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
ICGCS02_530
تاریخ نمایه سازی: 17 دی 1403
چکیده مقاله:
Feature extraction for predictive analytics involves selecting a subset of features that can most accurately predict an outcome of interest while minimizing the size of the selected feature set. In the context of cancer genomics, where gene expression data often includes tens of thousands of genes, the need for scalable and efficient feature selection methods is critical. Gene expression analysis has emerged as a powerful approach to assess gene activity, particularly in cancer genomics, where it can elucidate the molecular mechanisms driving cancer and reveal heterogeneity and functional diversity among cancer cell populations. Methods: This study proposes the use of dictionary learning methods, including k-Singular Value Decomposition (k-SVD) as an advanced approach for feature selection and dimensionality reduction in RNA sequencing data, specifically derived a breast cancer dataset from the TCGA BRCA project. Dictionary learning provides a framework for sparse representation of data, which is particularly well-suited for the high-dimensional nature of gene expression datasets. The algorithm iteratively updates a dictionary of features and sparse coefficients to extract the most informative features while discarding redundant information. We further analyzed the biological relevance of the selected features by mapping them to known cancer pathways. Additionally, we compared this technique with conventional methods such as Least Absolute Shrinkage and Selection Operator (LASSO), Group LASSO, and Orthogonal Matching Pursuit (OMP) to validate the efficacy of k-SVD in identifying the most relevant gene expression signatures in cancer studies. Results: Analyzing transcriptome sequencing datasets from cancer studies, our results indicate that dictionary learning methods, particularly k-SVD, significantly reduces the dimensionality of high-throughput RNA sequencing data while preserving critical predictive information. In our analysis of breast cancer gene expression dataset, the dictionary learning methods demonstrated superior performance in identifying key gene expression signatures associated with cancer initiation, progression and metastasis. Compared to traditional methods, dictionary learning consistently outperforms as well or better in feature selection and binary classification tasks, achieving high AUC (۰.۹۹۸۰), and balanced accuracy (۰.۹۹۳۴) in tumor vs. normal classification. Moreover, the robustness of feature selection with dictionary learning, including lower false positive rates and superior adaptability to the complex gene expression profiles, was evident. The computational efficiency and scalability of dictionary learning methods were further enhanced through parallel processing optimizations, making them highly suitable for large-scale gene expression datasets. Conclusion: Dictionary learning, exemplified by algorithms such as k-SVD, is a robust, scalable, and generalizable technique for feature selection in gene expression data analysis in cancer genomics. As the complexity of high-dimensional gene expression datasets in cancer research increases, traditional methods often face challenges such as handling high dimensionality and achieving convergence. Dictionary learning addresses these challenges by leveraging a sparse representation framework that enhances both feature selection efficiency and predictive power. Our extensive evaluations suggest that dictionary learning is a promising tool for decoding cancer transcriptomic profiles, providing new insights into cancer biology and potential therapeutic targets. Our future work will focus on optimizing parameter tuning and integrating dictionary learning with deep learning frameworks to further boost predictive modeling capabilities.
کلیدواژه ها:
نویسندگان
Bahar Mahdavi
Department of Computer Sciences, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran
Soheil Tabatabaei Mortazavi
Department of Computer Sciences, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran
Muhammad Moein Salehi Nejad
Department of Medical Genetics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran
Mehdi Rajabizadeh
Department of Biodiversity, Institute of Science, High Technology and Environmental Sciences, Graduate University of Advanced Technology, Kerman, Iran
Mansoor Rezghi
Department of Computer Sciences, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran