Research


List of publications



Abstracts


In a tabular database, patterns which occur over a frequency threshold are called frequent patterns. They are central in numerous data processes and various efficient algorithms were recently designed for mining them. Unfortunately, very few is known about the real difficulty of this mining, which is closely related to the number of frequent patterns. The worst case analysis always leads to an exponential number of frequent patterns, but experimentations show that algorithms become efficient for reasonable frequency thresholds. We perform here a probabilistic analysis of the number of frequent patterns. We first introduce a general model of random databases that encompasses all the previous classical models. In this model, the rows of the database are seen as independent words generated by the same probabilistic source [i.e. a random process that emits symbols]. Under natural conditions on the source, the average number of frequent patterns is studied for various frequency thresholds. Then, we exhibit a large class of sources, the class of dynamical sources, which is proven to satisfy our general conditions. This finally shows that our results hold in a quite general context of random databases. pdf
This thesis deals with two main algorithmical domains: Data Mining and Arithmetical computations. In both, we are interested in the average-case analysis of algorithms, and, we adopt more precisely the dynamical analysis point of vue which is a mixed method between Analysis of Algorithms and Dynamical Systems. The Euclid algorithms compute the gcd of two numbers; these are fundamental blocks in computer algebra, but their fine probabilistic behavior is always unknown. Thanks to Dynamical Analysis methods, recent important results have been obtained. In this thesis, we extend this approach to a precise analysis of parameters, as the binary complexity or the size of remainders. These parameters are essential for the Divide and Conquer gcd algorithm due to Knuth-Schönhage. Dynamical Analysis is also used for proven computations of spectral constants. The dynamical approach is then adapted to on polynomial Euclid algorithms even if, in this case, classical Analytic Combinatorics already applies. We also deal with Data Mining. We restrict ourselves to binary databases where the knowledge is represented by 'frequent patterns'. The number of frequent patterns is essential for analysing algorithms but experiments show that it significantly changes with the parameters of the database. Then, the worst case analysis is not meaningful in practice. In this thesis, we elucidate the average beahvior of the number of frequent patterns under a large model of databases built with eventually correlated sources. ps pdf
We provide sharp estimates for the probabilistic behaviour of the main parameters of the Euclid Algorithms, both on polynomials and on integer numbers. We study in particular the distribution of the bit-complexity which involves two main parameters : digit-costs and length of remainders. We first show here that an asymptotic gaussian law holds for the length of remainders at a fraction of the execution, which exhibits a deep regularity phenomenon. Then, we study in each framework -polynomials (P) and integer numbers (I)- two gcd algorithms, the standard one (S) which only computes the gcd, and the extended one (E) which also computes the Bezout pair, and is widely used for computing modular inverses. The extended algorithm is more regular than the standard one, and this explains that our results are more precise for the Extended algorithm: we exhibit an asymptotic gaussian law for the bit-complexity of the extended algorithm, in both cases (P) and (I). We also prove that an asymptotic gaussian law for the bit-complexity of the standard gcd in case (P), but we do not succeed obtaining a similar result in case (I). The integer study is more involved than the polynomial study, as it is usually the case. In the polynomial case, we deal with the central tools of the distributional analysis of algorithms, namely bivariate generating functions. In the integer case, we are led to dynamical methods, which heavily use the dynamical system underlying the number Euclidean algorithm, and its transfer operator. Baladi and Vallée have recently designed a general framework for "distributional dynamical analysis", where they have exhibited asymptotic gaussian laws for a large family of parameters. However, this family does not contain neither the bit-complexity cost nor the size of remainders, and we have to extend their methods for obtaining our results. Even if these dynamical methods are not necessary in case (P), we explain how the polynomial dynamical system can be also used for proving our results. This provides a common framework for both analyses, which well explains the similarities and the differences between the two cases (P) and (I), for the algorithms themselves, and also for their analysis. The article below is an extended abstract of this paper. pdf
We provide sharp estimates for the probabilistic behaviour of the main parameters of the Euclid algorithm, and we study in particular the distribution of the bit-complexity which involves two main parameters : digit costs and length of continuants. We perform a dynamical analysis which heavily uses the dynamical system underlying the Euclidean algorithm. Baladi and Vallée have recently designed a general framework for distributional dynamical analysis, where they have exhibited asymptotic gaussian laws for a large class of digit costs. However, this family contains neither the bit complexity cost nor the length of continuants. We first show here that an asymptotic gaussian law also holds for the length of continuants at a fraction of the execution. There exist two gcd algorithms, the standard one which only computes the gcd, and the extended one which also computes the Bezout pair, and is widely used for computing modular inverses. The extended algorithm is more regular than the standard one, and this explains that our results are more precise for the extended algorithm. We prove that the bit complexity of the extended Euclid algorithm asymptotically follows a gaussian law, and we exhibit the speed of convergence towards the normal law. We describe also conjectures [quite plausible], under which we can obtain an asymptotic gaussian law for the plain bit-complexity, or a sharper estimate of the speed of convergence towards the gaussian law. pdf
There exist fast variants of the gcd algorithm which are all based on principles due to Knuth and Schonhage. On inputs of size n, these algorithms use a Divide and Conquer approach, perform FFT multiplications and stop the recursion at a depth slightly smaller than log n. A rough estimate of the worst-case complexity of these fast versions provides the bound O(n(log n)^2 loglog n). However, this estimate is based on some heuristics and is not actually proven. Here, we provide a precise probabilistic analysis of some of these fast variants, and we prove that their average bit-complexity on random inputs of size n is (n(log n)^2(log log n)^2), with a precise remainder term. We view such a fast algorithm as a sequence of what we call interrupted algorithms, and we obtain three results about the (plain) Euclid Algorithm which may be of independent interest. We precisely describe the evolution of the distribution during the execution of the (plain) Euclid Algorithm; we obtain a sharp estimate for the probability that all the quotients produced by the (plain) Euclid Algorithm are small enough; we also exhibit a strong regularity phenomenon, which proves that these interrupted algorithms are locally similar to the total algorithm. This finally leads to the precise evaluation of the average bit-complexity of these fast algorithms. This work uses various tools, and is based on a precise study of generalised transfer operators related to the dynamical system underlying the Euclid Algorithm. pdf.
In data mining, enumerate the frequent or the closed patterns is often the first difficult task leading to the association rules discovery. The number of these patterns represents a great interest. The lower bound is known to be constant whereas the upper bound is exponential, but both situations correspond to pathological cases. For the first time, we give an average analysis of the number of frequent or closed patterns. Average analysis is often closer to real situations and gives more information about the role of the parameters. In this paper, two probabilistic models are studied: a BERNOULLI and a MARKOVIAN. In both models and for large databases, we prove that the number of frequent patterns, for a fixed frequency threshold , is exponential in the number of items and polynomial in the number of transactions. On the other hand, for a proportional frequency threshold , the number of frequent patterns is polynomial in the number of items and does not involve the number of transactions. Finally, we prove in the BERNOULLI model that the number of closed patterns, for a proportional frequency threshold, is polynomial in the number of items. ps, pdf.
Frequent and closed patterns are at the core of numerous Knowledge Discovery processes. Their mining is known to be difficult, because of the huge size of the search space, exponentially growing with the number of attributes. Unfortunately, most studies about pattern mining do not address the difficulty of the task, and provide their own algorithm. In this paper, we propose some new results about the average number of frequent patterns, by using probabilistic techniques and we extend these results to the number of closed patterns. In a first step, the probabilistic model is simple and far from the real life since the attributes and the objects are considered independent. Nevertheless according to this model, frequency threshold phenomena observed in practice are explained. We also prove that, for a fixed threshold, the number of frequent patterns is asymptotically exponential in the number of attributes and polynomial in the number of objects whereas, for a frequency threshold proportional to the number of objects, the number of frequent and closed patterns is asymptotically polynomial in the number of attributes without depending on the number of objects. ps, pdf.
We describe a class of algorithms which compute in polynomial time important constants related to the Euclidean Dynamical System. Our algorithms are based on a method which has been previously introduced by Daudé Flajolet and Vallée. However, the authors did not prove the correctness of the algorithm and did not provide any complexity bound. Here, we describe a general framework where the DFV-method leads to a proven polynomial time algorithm that computes spectral constants relative to a class of Dynamical Systems. These constants are closely related to eigenvalues of the transfer operator. Since it acts on an infinite dimensional space, exact spectral computations are almost always impossible, and are replaced by (proven) numerical approximations. The transfer operator can be viewed as an infinite matrix M which is the limit (in some precise sense) of the sequence of truncated matrices Mn of order n where exact computations are possible. We prove that each isolated eigenvalue of M is a limit of a sequence of eigenvalues of Mn, with exponential speed. Then, coming back to the Euclidean Dynamical System, we compute (in polynomial time) three important constants which play a central rôle in the Euclidean algorithm: (i) the Gauss-Kuzmin-Wirsing constant related to the speed of convergence of the continued fraction algorithm to its limit density; (ii) the Hensley constant which occurs in the leading term of the variance of the number of steps of the Euclid algorithm; (iii) the Hausdorff dimension of the Cantor sets relative to constrained continued fraction expansions. ps, pdf.
In information theory, the fundamental mechanism that creates the text is called a source. By definition, it is a process that emits symbols (or letters) from an alphabet A at each unit time. The complexity of a source depends on correlations that exist between the successives symbols.
In text algorithmic, the source clearly influences the performances of the algorithms, and this, through two characteristics constants: the entropy and the probability of b-coincidence. The entropy measures the uncertainty of the source whereas the probability of b-coincidence is related to the probability that b words start with the same prefix. These constants are omnipresents in domains that I mentioned previously but they are in general difficult to calculate.
My work consists in finding a method which is able to approximate these constants. For it, I use dynamical sources defined by Brigitte Vallée in 1998. In master thesis, I defined two possible approaches. One of them leads to certain results of convergence. ps, pdf.

Hensley Constant: I applyed the results of my DEA practicum to the source of continuous fractions in order to get the first proven approximation of the Hensley's constant.

Found value:
9.08037 31646 96850 36622 38092 15123 21337 99333 44476 28431 ± 10-50
NB: the first fourty digits are exacts.

* * *