Compositional bias in metagenomic data analysis

سال انتشار: 1398
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 285

متن کامل این مقاله منتشر نشده است و فقط به صورت چکیده یا چکیده مبسوط در پایگاه موجود می باشد.
توضیح: معمولا کلیه مقالاتی که کمتر از ۵ صفحه باشند در پایگاه سیویلیکا اصل مقاله (فول تکست) محسوب نمی شوند و فقط کاربران عضو بدون کسر اعتبار می توانند فایل آنها را دریافت نمایند.

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

IBIS09_017

تاریخ نمایه سازی: 19 اسفند 1399

چکیده مقاله:

Gene abundance metagenomic data are affected by high levels of systematic variability, which can greatly reduce statistical power and increase false positives. There are many reasons for the systematic variation in metagenome data that can affect the observed abundance of genes and microorganisms. One of the important reasons is the difference in the depth of sequencing so that each sample has a different number of sequencing reads. Other reasons for systematic variability include inconsistencies in sampling methods, DNA extraction, variation in the quality of sequencing runs, errors in read mapping, and incomplete reference gene catalogs. In addition, the systematic variability can be due to differences in the average genome size of microorganisms, species richness, and GC-content related to reads, which can affect the observed gene abundance. NGS data are also inherently compositional. Compositional means that the relative abundance of each nucleotide fragments is dependent to the abundance of other fragments. This property is related to the sequencing equipment and underlying methodology, and the resulted sequences are affected by the bias involved in amplification and subsequent nucleotide sampling. Hence, the composition is a result of this ambiguity in measurements that are an unclear part of the whole (e.g., metagenomic count data generated by NGS sequencing). The Compositional Data Analysis (CoDA) refers to handling and resolving this bias. Metagenomic count data also faces more severe challenges compared to the other NGS data. One of these challenges is the highly variable number of sequenced reads or sequencing depth in different samples. The second challenge is the very high percentage of zeros in metagenomic count data referred to as zero-inflation (roughly between 50% to 90%). Also, metagenomic data are very high dimensional in comparison with the other NGS data (e.g., in a sample of the gut microbiome gene catalog, there are ~10M gene sequences or features). On the other hand, due to the low frequency of DNA sampling, the very rare taxa are not recorded, which is called technical zeros. Also, some taxa may not be captured through their missing population, known as structural zeros. Another challenges are the size of the study (the number of taxa/genes is much larger compared to the number of samples) and large variance in taxa distributions (over-dispersion).

نویسندگان

Mohammad Hossein Norouzi Beirami

Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran

Sayed Amir Marashi

Department of Biotechnology, College of Science, University of Tehran, Tehran, Iran

Ali Mohammad Banaei Moghaddam

Laboratory of Genomics and Epigenomics (LGE), Department of Biochemistry, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran

Kaveh Kavousi

Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran