Research
List of publications
- Gaussian laws for the main parameters of the Euclid
algorithms
, invited paper for the revue Algorithmica (always in evolution), 2007,
(with Brigitte Vallée ),
pdf
-
Number of frequent patterns in random databases ,
ASMDA 2007 ,
pdf
- Analysis of fast versions of the Euclid Algorithm,
Analco 2007 ,
(with Benoit Daireaux ,
Eda Cesaratto,
Julien Clément ,
Véronique Maume-Deschamps ,
Brigitte Vallée ),
pdf.
- Thesis: Gcd algorithms and Data Mining: the dynamical analysis point
of vue , 2006 (in french),
ps
pdf
- Sharp Estimates for the Main Parameters of the Euclid Algorithm,
LATIN'06 ,
(with Brigitte Vallée ),
pdf
- Average Number of Frequent (Closed) Patterns in Bernoulli and Markovian databases,
ICDM'05 , (with François Rioult and
Arnaud Soulet ),
ps,
pdf.
- Average Number of Frequent and Closed Patterns in Random Databases
(with
François Rioult and
Arnaud Soulet ), CAP'05 ,
ps,
pdf
- Computation of a Class of Continued Fraction Constants, Proceedings of ANALCO 2004,
ps,
pdf.
- Report in master thesis (in french): "Modélisation et
Approximation de sources complexes", 2002,
ps,
pdf
Abstracts
- Number of frequent patterns in random databases ,
ASMDA 2007
In a tabular database, patterns which occur over a frequency threshold
are called frequent patterns. They are central in numerous data
processes and various efficient algorithms were recently designed for
mining them. Unfortunately, very few is known about the real
difficulty of this mining, which is closely related to the number of
frequent patterns. The worst case analysis always leads to an
exponential number of frequent patterns, but experimentations show
that algorithms become efficient for reasonable frequency
thresholds. We perform here a probabilistic analysis of the number of
frequent patterns. We first introduce a general model of random
databases that encompasses all the previous classical models. In this
model, the rows of the database are seen as independent words
generated by the same probabilistic source [i.e. a random process that
emits symbols]. Under natural conditions on the source, the average
number of frequent patterns is studied for various frequency
thresholds. Then, we exhibit a large class of sources, the class of
dynamical sources, which is proven to satisfy our general
conditions. This finally shows that our results hold in a quite
general context of random databases.
pdf
- Thesis: Gcd algorithms and Data Mining: the dynamical analysis point
of vue (in french)
This thesis deals with two main
algorithmical domains: Data Mining and Arithmetical computations. In
both, we are interested in the average-case analysis of algorithms,
and, we adopt more precisely the dynamical analysis point of vue which
is a mixed method between Analysis of Algorithms and Dynamical
Systems. The Euclid algorithms compute the gcd of two numbers; these
are fundamental blocks in computer algebra, but their fine
probabilistic behavior is always unknown. Thanks to Dynamical Analysis
methods, recent important results have been obtained. In this thesis,
we extend this approach to a precise analysis of parameters, as the
binary complexity or the size of remainders. These parameters are
essential for the Divide and Conquer gcd algorithm due to
Knuth-Schönhage. Dynamical Analysis is also used for proven
computations of spectral constants. The dynamical approach is then
adapted to on polynomial Euclid algorithms even if, in this case,
classical Analytic Combinatorics already applies. We also deal with
Data Mining. We restrict ourselves to binary databases where the
knowledge is represented by 'frequent patterns'. The number of
frequent patterns is essential for analysing algorithms but
experiments show that it significantly changes with the parameters of
the database. Then, the worst case analysis is not meaningful in
practice. In this thesis, we elucidate the average beahvior of the
number of frequent patterns under a large model of databases built
with eventually correlated sources.
ps
pdf
-
Gaussian laws for the main parameters of the Euclid
algorithms
, invited paper for the revue Algorithmica (always in evolution),
(with Brigitte Vallée ).
We provide sharp
estimates for the probabilistic behaviour of the main parameters of
the Euclid Algorithms, both on polynomials and on integer numbers. We
study in particular the distribution of the bit-complexity which
involves two main parameters : digit-costs and length of
remainders. We first show here that an asymptotic gaussian law holds
for the length of remainders at a fraction of the execution, which
exhibits a deep regularity phenomenon. Then, we study in each
framework -polynomials (P) and integer numbers (I)- two gcd
algorithms, the standard one (S) which only computes the gcd, and the
extended one (E) which also computes the Bezout pair, and is widely
used for computing modular inverses. The extended algorithm is more
regular than the standard one, and this explains that our results are
more precise for the Extended algorithm: we exhibit an asymptotic
gaussian law for the bit-complexity of the extended algorithm, in both
cases (P) and (I). We also prove that an asymptotic gaussian law for
the bit-complexity of the standard gcd in case (P), but we do not
succeed obtaining a similar result in case (I). The integer study is
more involved than the polynomial study, as it is usually the case. In
the polynomial case, we deal with the central tools of the
distributional analysis of algorithms, namely bivariate generating
functions. In the integer case, we are led to dynamical methods, which
heavily use the dynamical system underlying the number Euclidean
algorithm, and its transfer operator. Baladi and Vallée have recently
designed a general framework for "distributional dynamical analysis",
where they have exhibited asymptotic gaussian laws for a large family
of parameters. However, this family does not contain neither the
bit-complexity cost nor the size of remainders, and we have to extend
their methods for obtaining our results. Even if these dynamical
methods are not necessary in case (P), we explain how the polynomial
dynamical system can be also used for proving our results. This
provides a common framework for both analyses, which well explains the
similarities and the differences between the two cases (P) and (I),
for the algorithms themselves, and also for their analysis. The
article below is an extended abstract of this paper.
pdf
We provide sharp estimates for the probabilistic behaviour
of the main parameters of the Euclid algorithm, and we study in particular
the distribution of the bit-complexity which involves two main
parameters : digit costs and length of continuants. We perform a dynamical
analysis which heavily uses the dynamical system underlying
the Euclidean algorithm. Baladi and Vallée have recently designed a
general framework for distributional dynamical analysis, where they
have exhibited asymptotic gaussian laws for a large class of digit costs.
However, this family contains neither the bit complexity cost nor the
length of continuants. We first show here that an asymptotic gaussian
law also holds for the length of continuants at a fraction of the execution.
There exist two gcd algorithms, the standard one which only computes
the gcd, and the extended one which also computes the Bezout pair,
and is widely used for computing modular inverses. The extended algorithm
is more regular than the standard one, and this explains that our
results are more precise for the extended algorithm. We prove that the
bit complexity of the extended Euclid algorithm asymptotically follows
a gaussian law, and we exhibit the speed of convergence towards the
normal law. We describe also conjectures [quite plausible], under which
we can obtain an asymptotic gaussian law for the plain bit-complexity,
or a sharper estimate of the speed of convergence towards the gaussian
law.
pdf
There exist fast variants of the gcd algorithm which are all based on
principles due to Knuth and Schonhage. On inputs of size n, these
algorithms use a Divide and Conquer approach, perform FFT
multiplications and stop the recursion at a depth slightly smaller
than log n. A rough estimate of the worst-case complexity of these
fast versions provides the bound O(n(log n)^2 loglog n). However, this
estimate is based on some heuristics and is not actually proven. Here,
we provide a precise probabilistic analysis of some of these fast
variants, and we prove that their average bit-complexity on random
inputs of size n is (n(log n)^2(log log n)^2), with a precise
remainder term. We view such a fast algorithm as a sequence of what we
call interrupted algorithms, and we obtain three results about the
(plain) Euclid Algorithm which may be of independent interest. We
precisely describe the evolution of the distribution during the
execution of the (plain) Euclid Algorithm; we obtain a sharp estimate
for the probability that all the quotients produced by the (plain)
Euclid Algorithm are small enough; we also exhibit a strong regularity
phenomenon, which proves that these interrupted algorithms are locally
similar to the total algorithm. This finally leads to the precise
evaluation of the average bit-complexity of these fast
algorithms. This work uses various tools, and is based on a precise
study of generalised transfer operators related to the dynamical
system underlying the Euclid Algorithm.
pdf.
In data mining, enumerate the frequent or the closed patterns
is often the first difficult task leading to the association
rules discovery. The number of these patterns represents a
great interest. The lower bound is known to be constant
whereas the upper bound is exponential, but both situations
correspond to pathological cases. For the first time, we give
an average analysis of the number of frequent or closed patterns.
Average analysis is often closer to real situations
and gives more information about the role of the parameters.
In this paper, two probabilistic models are studied:
a BERNOULLI and a MARKOVIAN. In both models and
for large databases, we prove that the number of frequent
patterns, for a fixed frequency threshold , is exponential in
the number of items and polynomial in the number of transactions.
On the other hand, for a proportional frequency
threshold , the number of frequent patterns is polynomial
in the number of items and does not involve the number of
transactions. Finally, we prove in the BERNOULLI model
that the number of closed patterns, for a proportional frequency
threshold, is polynomial in the number of items.
ps,
pdf.
Frequent and closed patterns are at the core of numerous Knowledge
Discovery processes. Their mining is known to be difficult, because of
the huge size of the search space, exponentially growing with the
number of attributes. Unfortunately, most studies about pattern
mining do not address the difficulty of the task, and provide their
own algorithm. In this paper, we propose some new results about the
average number of frequent patterns, by using probabilistic techniques
and we extend these results to the number of closed patterns. In a
first step, the probabilistic model is simple and far from the real
life since the attributes and the objects are considered
independent. Nevertheless according to this model, frequency threshold
phenomena observed in practice are explained. We also prove that, for
a fixed threshold, the number of frequent patterns is asymptotically
exponential in the number of attributes and polynomial in the number
of objects whereas, for a frequency threshold proportional to the
number of objects, the number of frequent and closed patterns is
asymptotically polynomial in the number of attributes without
depending on the number of objects.
ps,
pdf.
- Computation of a Class of Continued Fraction Constants, Proceedings of ANALCO 2004.
We describe a class of algorithms which compute in
polynomial time important constants related to the Euclidean
Dynamical System. Our algorithms are based on a
method which has been previously introduced by Daudé Flajolet
and Vallée. However,
the authors did not prove the correctness of the algorithm
and did not provide any complexity bound. Here, we describe
a general framework where the DFV-method leads to
a proven polynomial time algorithm that computes spectral
constants relative to a class of Dynamical Systems.
These constants are closely related to eigenvalues of the
transfer operator. Since it acts on an infinite dimensional
space, exact spectral computations are almost always impossible,
and are replaced by (proven) numerical approximations.
The transfer operator can be viewed as an infinite
matrix M which is the limit (in
some precise sense) of the sequence of truncated matrices
Mn of order n where exact computations
are possible. We prove that each isolated
eigenvalue of M is a limit of a sequence of eigenvalues of Mn, with
exponential speed. Then, coming back to the Euclidean Dynamical
System, we compute (in polynomial time) three important
constants which play a central rôle in the Euclidean
algorithm: (i) the Gauss-Kuzmin-Wirsing constant related
to the speed of convergence of the continued fraction algorithm
to its limit density; (ii) the Hensley constant which
occurs in the leading term of the variance of the number of
steps of the Euclid algorithm; (iii) the Hausdorff dimension
of the Cantor sets relative to constrained continued fraction
expansions.
ps,
pdf.
- Report in master thesis (in french): "Modélisation et
Approximation de sources complexes"
In information theory, the fundamental mechanism that creates the text is
called a source. By definition, it is a process that emits symbols
(or letters) from an alphabet A at each unit time. The complexity
of a source depends on correlations that exist between the successives
symbols.
In text algorithmic, the source clearly influences the performances of
the algorithms, and this, through two characteristics constants: the
entropy and the probability of b-coincidence. The
entropy measures the uncertainty of the source whereas the probability
of b-coincidence is related to the probability that b words start with
the same prefix. These constants are omnipresents in domains that I
mentioned previously but they are in general difficult to calculate.
My work consists in finding a method which is able to approximate
these constants. For it, I use dynamical sources defined by Brigitte Vallée
in 1998. In master thesis, I defined two possible
approaches. One of them leads to certain results of
convergence.
ps,
pdf.
Hensley Constant: I applyed the results
of my DEA practicum to the source of continuous fractions in order to get
the first proven approximation of the
Hensley's constant.
Found value:
9.08037 31646 96850 36622 38092 15123 21337 99333 44476 28431 ± 10-50
NB: the first fourty digits are exacts.
* * *