We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context. Compression of DNA reads is performed according to LZ-style with the use of the 5–7th order model, while nucleotides’ scores are encoded with the 3rd order model.
Accepté le :
DOI : 10.1051/ro/2015039
Mots-clés : High-throughput DNA sequencing, data compression, FASTQ files
@article{RO_2016__50_2_351_0, author = {Chlopkowski, Marek and Antczak, Maciej and Slusarczyk, Michal and Wdowinski, Aleksander and Zajaczkowski, Michal and Kasprzak, Marta}, title = {High-order statistical compressor for long-term storage of {DNA} sequencing data}, journal = {RAIRO - Operations Research - Recherche Op\'erationnelle}, pages = {351--361}, publisher = {EDP-Sciences}, volume = {50}, number = {2}, year = {2016}, doi = {10.1051/ro/2015039}, mrnumber = {3479875}, language = {en}, url = {http://www.numdam.org/articles/10.1051/ro/2015039/} }
TY - JOUR AU - Chlopkowski, Marek AU - Antczak, Maciej AU - Slusarczyk, Michal AU - Wdowinski, Aleksander AU - Zajaczkowski, Michal AU - Kasprzak, Marta TI - High-order statistical compressor for long-term storage of DNA sequencing data JO - RAIRO - Operations Research - Recherche Opérationnelle PY - 2016 SP - 351 EP - 361 VL - 50 IS - 2 PB - EDP-Sciences UR - http://www.numdam.org/articles/10.1051/ro/2015039/ DO - 10.1051/ro/2015039 LA - en ID - RO_2016__50_2_351_0 ER -
%0 Journal Article %A Chlopkowski, Marek %A Antczak, Maciej %A Slusarczyk, Michal %A Wdowinski, Aleksander %A Zajaczkowski, Michal %A Kasprzak, Marta %T High-order statistical compressor for long-term storage of DNA sequencing data %J RAIRO - Operations Research - Recherche Opérationnelle %D 2016 %P 351-361 %V 50 %N 2 %I EDP-Sciences %U http://www.numdam.org/articles/10.1051/ro/2015039/ %R 10.1051/ro/2015039 %G en %F RO_2016__50_2_351_0
Chlopkowski, Marek; Antczak, Maciej; Slusarczyk, Michal; Wdowinski, Aleksander; Zajaczkowski, Michal; Kasprzak, Marta. High-order statistical compressor for long-term storage of DNA sequencing data. RAIRO - Operations Research - Recherche Opérationnelle, Special issue: Research on Optimization and Graph Theory dedicated to COSI 2013 / Special issue: Recent Advances in Operations Research in Computational Biology, Bioinformatics and Medicine, Tome 50 (2016) no. 2, pp. 351-361. doi : 10.1051/ro/2015039. http://www.numdam.org/articles/10.1051/ro/2015039/
A map of human genome variation from population-scale sequencing. Nature 467 (2010) 1061–1073. | DOI
, , , , , , and ,Whole genome assembly from 454 sequencing output via modified DNA graph concept. Comput. Biol. Chem. 33 (2009) 224–230. | DOI
, , , , , , , , and ,A general purpose lossless data compression method for GPU. J. Parallel Distrib. Comput. 75 (2015) 40–52. | DOI
and ,Facing growth in the European Nucleotide Archive. Nucleic Acids Res. 41 (2013) D30–D35. | DOI
et al.The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38 (2010) 1767–1771. | DOI
, , , and ,Compression of DNA sequence reads in FASTQ format. Bioinform. 27 (2011) 860–862. | DOI
and ,Data compression for sequencing data. Algorithms Mol. Biol. 8 (2013) 25. | DOI
and ,SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinform. 28 (2012) 3051–3057. | DOI
, , and ,A method for the construction of minimum-redundancy codes. Proc. of the IRE 40 (1952) 1098–1101. | DOI | Zbl
.Inc. Illumina, CASAVA v1.8 changes. [on-line] http://support.illumina.com/documentation.html, January (2011).
Inc. Illumina, BaseSpace user guide. [on-line] http://support.illumina.com/documentation.html, May (2013).
Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40 (2012) e171. | DOI
, , and .Compressing genomic sequence fragments using SlimGene. J. Comput. Biol. 18 (2011) 401–413. | DOI | MR
, , , and .M. Nelson. [on-line] http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/.
DSRC 2 - industry-oriented compression of FASTQ files. Bioinform. 30 (2014) 2213–2215. | DOI
and ,D.S.H. Rosenthal, D. Rosenthal, E.L. Miller, I. Adams, M.W. Storer and E. Zadok, The economics of long-term digital storage. In The Memory of the World in the Digital Age: Digitization and Preservation, September (2012).
D. Salomon, Data Compression: The Complete Reference. With contributions by Giovanni Motta and David Bryant. Springer, London (2007). | MR
A mathematical theory of communication. The Bell Syst. Tech. J. 27 (1948) 379–423, 623–656. | DOI | MR | Zbl
,Preprocessing and storing high-throughput sequencing data. Comput. Methods Sci. Technol. 20 (2014) 9–20. | DOI
, , , , , , and ,DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res. 30 (2002) 27–30. | DOI
, , , , , and ,A technique for high-performance data compression. Computer 17 (1984) 8–19. | DOI
.Arithmetic coding for data compression. Commun. ACM 30 (1987) 520–540. | DOI
, and ,A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23 (1977) 337–343. | DOI | MR | Zbl
and .Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24 (1978) 530–536. | DOI | MR | Zbl
and ,Cité par Sources :