Digital search trees and chaos game representation
ESAIM: Probability and Statistics, Tome 13 (2009), pp. 15-37.

In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which one can visualize repetitions of subwords (seen as suffixes of subsequences). The CGR-tree turns a sequence of letters into a Digital Search Tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height, the insertion depth for DST built from independent successive random sequences having the same distribution. Here the successive inserted words are strongly dependent. We give the asymptotic behaviour of the insertion depth and the length of branches for the CGR-tree obtained from the suffixes of a reversed i.i.d. or markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a markovian sequence are obtained.

DOI : 10.1051/ps:2007043
Classification : 60C05, 68R15, 92D20, 05D40
Mots-clés : random tree, digital search tree, CGR, lengths of the paths, height, insertion depth, asymptotic growth, strong convergence
@article{PS_2009__13__15_0,
     author = {C\'enac, Peggy and Chauvin, Brigitte and Ginouillac, St\'ephane and Pouyanne, Nicolas},
     title = {Digital search trees and chaos game representation},
     journal = {ESAIM: Probability and Statistics},
     pages = {15--37},
     publisher = {EDP-Sciences},
     volume = {13},
     year = {2009},
     doi = {10.1051/ps:2007043},
     mrnumber = {2493853},
     language = {en},
     url = {http://www.numdam.org/articles/10.1051/ps:2007043/}
}
TY  - JOUR
AU  - Cénac, Peggy
AU  - Chauvin, Brigitte
AU  - Ginouillac, Stéphane
AU  - Pouyanne, Nicolas
TI  - Digital search trees and chaos game representation
JO  - ESAIM: Probability and Statistics
PY  - 2009
SP  - 15
EP  - 37
VL  - 13
PB  - EDP-Sciences
UR  - http://www.numdam.org/articles/10.1051/ps:2007043/
DO  - 10.1051/ps:2007043
LA  - en
ID  - PS_2009__13__15_0
ER  - 
%0 Journal Article
%A Cénac, Peggy
%A Chauvin, Brigitte
%A Ginouillac, Stéphane
%A Pouyanne, Nicolas
%T Digital search trees and chaos game representation
%J ESAIM: Probability and Statistics
%D 2009
%P 15-37
%V 13
%I EDP-Sciences
%U http://www.numdam.org/articles/10.1051/ps:2007043/
%R 10.1051/ps:2007043
%G en
%F PS_2009__13__15_0
Cénac, Peggy; Chauvin, Brigitte; Ginouillac, Stéphane; Pouyanne, Nicolas. Digital search trees and chaos game representation. ESAIM: Probability and Statistics, Tome 13 (2009), pp. 15-37. doi : 10.1051/ps:2007043. http://www.numdam.org/articles/10.1051/ps:2007043/

[1] M. Abramowitz and I.A. Stegun, Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series 55. For sale by the superintendent of Documents, U.S. Government Printing Office, Washington, D.C. (1964). | MR | Zbl

[2] D. Aldous and P. Shields, A diffusion limit for a class of randomly-growing binary search trees. Probab. Theory Related Fields 79 (1998) 509-542. | MR | Zbl

[3] J.S. Almeida, J.A. Carriço, A. Maretzek, P.A. Noble and M. Fletcher, Analysis of genomic sequences by Chaos Game Representation. Bioinformatics 17 (2001) 429-437.

[4] G. Blom and D. Thorburn, How many random digits are required until given sequences are obtained? J. Appl. Probab. 19 (1982) 518-531. | MR | Zbl

[5] P. Cénac, Test on the structure of biological sequences via chaos game representation. Stat. Appl. Genet. Mol. Biol. 4 (2005) 36 (electronic). | MR | Zbl

[6] P. Cénac, G. Fayolle and J.M. Lasgouttes, Dynamical systems in the analysis of biological sequences. Technical Report 5351, INRIA (2004).

[7] M. Drmota, The variance of the height of digital search trees. Acta Informatica 38 (2002) 261-276. | MR | Zbl

[8] M. Duflo, Random Iterative Models. Springer (1997). | MR | Zbl

[9] P. Erdős and P. Révész, On the length of the longest head run, in Topics in Information Theory 16 (1975) 219-228, I. Csizàr and P. Elias, Eds. North-Holland, Amsterdam Colloq. Math. Soc. Jànos Bolyai. | MR | Zbl

[10] P. Erdős and P. Révész, On the length of the longest head-run. In Topics in information theory (Second Colloq., Keszthely, 1975). Colloq. Math. Soc. János Bolyai 16 (1977) 219-228. | MR | Zbl

[11] J. Fayolle, Compression de données sans perte et combinatoire analytique. Thèse de l'université Paris VI (2006), available at http://www.lri.fr/ fayolle/these.pdf.

[12] J.C. Fu, Bounds for reliability of large consecutive-k-out-of-n:f system. IEEE Trans. Reliability 35 (1986) 316-319. | Zbl

[13] J.C. Fu and M.V. Koutras, Distribution theory of runs: a markov chain approach. J. Amer. Statist. Soc. 89 (1994) 1050-1058. | MR | Zbl

[14] H. Gerber and S. Li, The occurence of sequence patterns in repeated experiments and hitting times in a markov chain. Stochastic Process. Appl. 11 (1981) 101-108. | MR | Zbl

[15] N. Goldman, Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 21 (1993) 2487-2491.

[16] L. Gordon, M.F. Schilling and M.S. Waterman, An extreme value theory for long head runs. Probab. Theory Related Fields 72 (1986) 279-287. | MR | Zbl

[17] H.J. Jeffrey, Chaos Game Representation of gene structure. Nucleic Acid. Res. 18 (1990) 2163-2170.

[18] M.V. Koutras, Waiting times and number of appearances of events in a sequence of discrete random variables, in Advances in combinatorial methods and applications to probability and statistics, Stat. Ind. Technol., Birkhäuser Boston, Boston, MA (1997) 363-384. | MR | Zbl

[19] Shuo-Yen Robert Li, A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Probab. 8 (1980) 1171-1176. | MR | Zbl

[20] H.M. Mahmoud, Evolution of random search trees. Wiley-Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons Inc., New York (1992). | MR | Zbl

[21] W. Penney, Problem: Penney-ante. J. Recreational Math. 2 (1969) 241.

[22] V. Petrov, On the probabilities of large deviations for sums of independent random variables. Theory Prob. Appl. (1965) 287-298. | MR | Zbl

[23] B. Pittel, Asymptotic growth of a class of random trees. Annals Probab. 13 (1985) 414-427. | MR | Zbl

[24] V. Pozdnyakov, J. Glaz, M. Kulldorff and J.M. Steele, A martingale approach to scan statistics. Ann. Inst. Statist. Math. 57 (2005) 21-37. | MR | Zbl

[25] M. Régnier, A unified approach to word occurence probabilities. Discrete Appl. Math. 104 (2000) 259-280. | MR | Zbl

[26] G. Reinert, S. Schbath and M.S. Waterman, Probabilistic and statistical properties of words: An overview. J. Comput. Biology 7 (2000) 1-46.

[27] S. Robin and J.J. Daudin, Exact distribution of word occurences in a random sequence of letters. J. Appl. Prob. 36 (1999) 179-193. | MR | Zbl

[28] A. Roy, C. Raychaudhury and A. Nandy, Novel techniques of graphical representation and analysis of DNA sequences - A review. J. Biosci. 23 (1998) 55-71.

[29] S.S. Samarova, On the length of the longest head-run for a markov chain with two states. Theory Probab. Appl. 26 (1981) 498-509. | MR | Zbl

[30] V. Stefanov and A.G. Pakes, Explicit distributional results in pattern formation. Ann. Appl. Probab. 7 (1997) 666-678. | MR | Zbl

[31] W. Szpankowski, Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York (2001). | MR | Zbl

[32] D. Williams, Probability with martingales. Cambridge Mathematical Textbooks. Cambridge University Press, Cambridge (1991). | MR | Zbl

Cité par Sources :