Algebraic Statistics Short Course

Instructor:  Seth Sullivant
Dates:  May 30  - June 2, 2006
Times:   1:00 PM  - 3:30 PM
Location:   SCI 705,  Harvard University
Abstract:  Algebraic statistics advocates the use of algebraic geometry, commutative algebra, and geometric combinatorics as tools for making statistical inferences. The starting point for this connection is the observation that most statistical models for discrete random variables are, in fact, algebraic varieties. While some of the varieties that appear are classical varieties (like Segre varieties and toric varieties), most are new, and there are many challenging open problems about the algebraic structure of these varieties. These lectures will provide an introduction to algebraic statistics, emphasizing both the interesting algebraic questions that arise and the statistical consequences of the algebraic analysis.

Time/Date Tues, May 30 Weds, May 31 Thurs, June 1 Friday, June 2
1 PM - 2PM Lecture 1 Lecture 3 Lecture 4 Lecture 6
2 PM  - 2:30 PM Break/Discussion Computer
Break/Discussion Computer
2:30 PM - 3:30 PM Lecture 2 Lecture 5

Abstracts of the Lectures:

Lecture 1:  Statistical models are algebraic varieties Notes
This lecture will be devoted to introducing the basic objects of the lecture series.  I'll explain what I mean by a statistical model for discrete random variables, what are algebraic varieties and their ideals, and how these concepts are related to each other.  In particular, I will try to explain what the title means and the subtleties that arise, illustrating everything with lots of examples.  This lecture is the most important because it will introduce a language and notation for talking about the various statistical and algebraic problems that follow in the subsequent lectures.
Lecture 2:  Log-linear models, toric varieties, and their Markov bases Notes
An important first class of statistical models for discrete random variables are the log-linear models.  These are models where the logarithm of the mean vectors can be represented as a linear combination of the parameters.  They also go by the name of discrete exponential families.  In algebraic geometry, the corresponding varieties are called toric varieties and they represent an important first class of varieties.  Their vanishing ideals are called toric ideals.  In this lecture, I will introduce these objects, explain the connection, and show how the generating sets of the toric ideals provide the local moves for running certain random walks.
Lecture 3:  Algebraic tools in statistical disclosure limitation
Statistical disclosure limitation is concerned with determining what summaries of sensitive data are safe to release to the public.  Among the many variations on this problem, and the focus of this talk, will be the case where the data consists of a contingency tables with sensitive cell entries and the summaries are low dimensional marginals of this table.  In this setting, "safe" means that it is not possible for a data snooper to combine information from the margins to deduce details of the hidden sensitive cells.  This naturally leads to challenging problems in integer linear programming, which can be analyzed using algebraic techniques.  An important role is played by the toric ideals from Lecture 2 and their Gröbner bases, which will be explained in this lecture.
Computing the integer programming gap  (Hosten, Sturmfels)
Computer Experiments 1:  Toric Ideals examples
If there is interest from the audience, I will explain how to do computations with toric ideals, using the computer systems 4ti2 and CoCoA.
Lecture 4:  Algebraic approaches to maximum likelihood estimation
Given a parametric statistical model and some sample data, the maximum likelihood estimate is the parameter value that maximize the probability of observing the given data.  For algebraic statistical models, this leads to the problem of maximizing the product of powers of polynomials in the parameters.  Alternatively, in the implicit formulation, this can be viewed as a contrained optimization problem with polynomial contraints.  In this lecture, I will explain how this algebraic formulation can be used to try to answer questions like: "How many (interesting) solutions to the critical (score) equations are there?"
The maximum likelihood degree (Catanese, Hosten, Khetan, Sturmfels)
Solving the likelihood equations (Hosten, Khetan, Sturmfels)
Lecture 5:  Phylogenetic Invariants
Phylogenetics is concerned with the problem of determining the ancestral relationship between a collection of living species or organisms.  In model-based phylogenetics, a statistical model is used to express the evolution of characters (typically, DNA sequences) and the goal is to use properties of the model determine the best tree relating the species. Following the main theme of these lectures, we will explore the algebraic structure of these models and the role that the structure of the defining ideals (the "phylogenetic invariants") has to play.  
Phylogenetic algebraic geometry (Eriksson, Ranestad, Sturmfels, Sullivant)
Lecture 6:  Parametric Inference
Given a statistical model with hidden variables, fixed parameters, and a single observation, the maximum a posteriori (MAP) estimate is the configuration of the hidden variables that maximizes the probability of making the observation given the parameters.  These MAP estimates are a major part of algorithms that align DNA sequence and also appear when working with hidden Markov models.  In parametric inference, one asks the question of how the MAP estimate changes with changing values of the parameters.  For parametric algebraic statistical models, this amounts to computing the (normal fan of) the Newton polytope of a polynomial.  In this lecture, I will explain the connection of MAP estimation and parametric inference to polyhedral geometry and describe how underlying graphical structures can lead to surprising complexity bounds on the resulting algorithms.
Parametric inference for biological sequence analysis (Pachter, Sturmfels)
Tropical geometry of statistical models  (Pachter, Sturmfels)
Computer Experiments 2:  Parametrizations examples
If there is interest from the audience, I will explain how to compute the ideal of a general parametrization using CoCoA and Macaulay2.