计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (3): 201-206.DOI: 10.3778/j.issn.1002-8331.2008-0204

• 模式识别与人工智能 • 上一篇    下一篇

SuperLLEC:全新的链读和长读测序组装纠错算法

崔雅轩,张少强   

  1. 天津师范大学 计算机与信息工程学院,天津 300387
  • 出版日期:2022-02-01 发布日期:2022-01-28

SuperLLEC:New Assembly and Error Correction Algorithm for Long Reads and Linked-Reads

CUI Yaxuan, ZHANG Shaoqiang   

  1. College of Computer Information and Engineering, Tianjin Normal University, Tianjin 300387, China
  • Online:2022-02-01 Published:2022-01-28

摘要: 为了解决第三代测序数据较高的错误率和提高基因组组装精度,基于10X Genomics链读测序数据设计了一种针对PacBio长读数据的组装和纠错算法SuperLLEC。该算法使用Wtdbg2算法将PacBio长读测序数据拼接成支架序列,运用Bowtie2比对工具将链读序列比对到支架序列,并根据链读条码进一步组装支架序列;对不匹配的比对位点采用Fisher精确检验预测该位点为单核酸多态性或是PacBio测序错误的碱基。通过三组人类细胞的长读数据和链读数据的算法比较实验,证明该方法能够较明显地提高基因组组装的准确度、NG50长度和单核酸多态性位点预测精度。

关键词: 链读, 长读, 支架, 组装, 纠错, Fisher精确检验

Abstract: In order to solve the high error rate of the third-generation sequencing data and improve the accuracy of genome assembly, an assembly and error correction algorithm, called SuperLLEC, is designed for the long-read data of the PacBio based on the 10X Genomics linked-read sequencing data. Wtdbg2 is employed to assemble the PacBio long reads of a genome into scaffolds. Bowtie2 is used to align each linked-read to these scaffolds, and further assemble these scaffolds based on the barcodes of linked-reads. Fisher’s exact test is used to predict whether each mismatched alignment site is a single nucleotide polymorphism(SNP) or an error base sequenced by PacBio. Algorithm comparison experiments on the long-read and linked-read data from three groups of human cells show that SuperLLEC can significantly improve the accuracy of genome assembly, increase NG50 length, and recover more SNPs.

Key words: linked-reads, long-reads, scaffolds, assembly, error correction, Fisher’s exact test