Physlr-mol: Physical Map of Linked-reads for De-Novo Barcode to Molecule Deconvolution

سال انتشار: 1399
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 230

نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

CIGS16_006

تاریخ نمایه سازی: 14 اردیبهشت 1400

چکیده مقاله:

Background and Aim: Long-range sequence information extracted by novel genome sequencing technologies has drastically transformed our understanding of genomics. Despite expensive and error-prone long-read sequencing technologies like that of Oxford Nanopore and PacBio, linked-read technologies such as ۱۰xG Chromium provide long-range information while utilizing high-quality short-read platforms. Having assigned the short reads derived from the same long DNA molecule an identical barcode, they sequence the reads with short-read sequencing technologies and thus offer the same fidelity and cost of short-reads. One main challenge in analyzing linked reads arises from barcode reuse, whereby distinct molecules are assigned the same barcode.Methods: Here we present Physlr-molecule, a method that deconvolutes these barcodes into their component molecules without using a reference genome. A barcode overlap-graph is constructed, where each edge represents two barcodes that share minimizer k-mers. To split a barcode into molecules, we inspect each barcode’s neighbourhood graph, the vertex-induced subgraph of a barcode's immediate neighbours. This neighbourhood subgraph is composed of multiple communities, one community per molecule. Physlr-molecule detects these communities in millions of subgraphs each comprising hundreds to thousands of vertices. In such a setting, state-of-the-art community-detection algorithms fail to scale up. To reduce the running time of these superlinear-time algorithms, each subgraph is partitioned into chunks; communities are detected using k-clique percolation and cosine similarity measure, and then merged if needed.Results: The novel community-detection approach explained above reduces the running time from ۸ weeks to ۸ minutes for Drosophila melanogaster, and ۱۸ minutes for H.Sapiens (hg۰۰۴). The deconvoluted barcode set resulted in a chromosome-level assembly of the human genome (hg۰۰۴ ۱۰xG dataset) with NG۵۰ of ۵۹.۲ Mb compared to ۳۸.۵ Mb by Supernova the one-and-only existing tool for linke-reads assembly. It also decreased the number of miss-assemblies from ۱۰۷۱ to ۵۰۷. ۳ chromosomes are assembled in ۱ scaffold and ۶ are ۹۰% assembled in only ۲ pieces.Conclusion: Physlr-molecule efficiently deconvolutes barcodes of linked-reads. Making a physical-map of the deconvoluted set it produces a map by which it can scaffold draft genome assemblies and yield in assemblies of chromosome-level contiguity. Physlr accepts linked reads of different technologies like ۱۰xG chromium and MGI stLFR.

نویسندگان

Amirhossein Afshinfard

University of British Columbia, Vancouver, Canada; BC Cancer Genome Science Centre, Vancouver, Canada

Shaun Jackman

۱۰X Genomics, CA, USA

Lauren Coombe

BC Cancer Genome Science Centre, Vancouver, Canada

Justin Chu

BC Cancer Genome Science Centre, Vancouver, Canada

Johnathan wong

BC Cancer Genome Science Centre, Vancouver, Canada

Vladimir Nikolic

University of British Columbia, Vancouver, Canada; BC Cancer Genome Science Centre, Vancouver, Canada