PGM(一)--Introduction and Overview

同样,这个系列也是课程笔记,coursera上Stanford的PGM

Background

Required

basic probability theory

Its going to be really hard to do this class without some understanding of basic probability theory.
This also have to be very advance stuff we’re talking about things like independence and days role and just basics of discreet distributions.

some programming

The programming assignments will require that you have some experience programming before because this is not programming class.

some algorithms and data structures

And because this class merges ideas from both probability theory and computer science, it’s really important you have some background in algorithms and data structures.

Recommended

machine learning and simple optimization

Recommended but not strict necessary and we certainly don’t require it.And we give you the background as we go, is a little bit of experience for how some machine learning, may be some simple optimization like gradient descent, nothing very sophisticated.

Matlab or Octave

And it will be helpful to have some experience and Matlab or Octave.

What are probabilistic graphical models?

Probabilistic Graphical Models are a bit of a mouthful, so before we define them, let’s first figure out what they might be used for.

So, one example application, which in fact is the one where probabilistic graphical models, or PGMs as they’re called, first made its way into computer science and artificial intelligence, and that as medical diagnosis.

Consider a doctor who’s faced with a patient. The doctor has a fair amount of information at her disposal when she looks up at a patient.

A very different application where PGMs have also been used is that of image segmentation.

Here we might have an image such as this that has thousands or even hundreds of thousands of pixels, and what we’d like to do is we’d like to figure out what each pixel corresponds to.

For example, if we break up the image into these fairly larger regions to have less stuff to reason about, we want to figure out which of these corresponds to glass, sky, cow, or horse.

What do these two problem have in common?

First, they have very large number of variables that we have to reason about.

In the context of a doctor, it’s all these predisposing factors, text result, possible diseases, and so on. And in the context of the image segmentation, it’s the labels for these different pixels or these larger regions called superpixels.

The second thing that these applications have in common is that fundamentally there is going to be significant uncertainty about the right answer, no matter how clever the algorithms that we design.


So, probabilistic graphical models are a framework for dealing with this kind of application.So, let’s first understand what each of these words mean in the context of this framework.

So first, let’s consider the word models.

Models

The model is a declarative representation of our understanding of the world.

So it is a representation within the computer that captures our understanding of what these variables are and how they interact with each other.

And the fact that it’s declarative means that the representation stands on its own, which means that we can look into it and make sense of it aside from any algorithm that we might choose to apply on.

So, why is that important? It’s important because that same representation, that same model can then be used in the context of one algorithm that answers any one kind of question. Or other algorithms that might answer different kinds of questions or the same question in more efficient ways, or that make different trade-offs between accuracy and complicational cause.

The other advantage of having a stand alone model is that we can separate out the construction of the model from the algorithms that are used to reason over it.

So, we can construct methodologies that elicit these models from a human expert or ones that learn it from historical data using statistical machine learning techniques or a combination of the two. And once again, the separation between the algorithm and the model and the learning in the model allows us to tackle each of these problems separately.

So, that was the word model, what about probabilistic?

probabilistic

Partial knowledge of state of the world

The word probabilistic is in there because these models are designed to help us deal with large amounts of uncertainty. So uncertainty comes in many forms and for many different reasons.

So, first it comes because we just have partial knowledge of the state of the world, for example the doctor doesn’t get to measure every symptom or every test result and she’s certainly uncertain
about the diseases that the patient has.

Noisy obsrevations

Uncertainty comes because of noisy observations.

So even when we get to observe certain things like blood pressure, those observations are often subject to significant amounts of noise.

Phenomena not covered by our model

Uncertainty also comes in because of modeling limitations, so we’re going to have phenomena that are just not covered by our model.

All sorts of different obscure diseases for example that might cause
the same set of symptoms. It’s impossible for us to write down the model that is so detailed that includes every possible contingency in every possible factor.

And so you’re going to have uncertainty and variability that is simply due to modeling limitations.

Inherent stochasticity

And finally, some people would argue that the world is inherently stochastic. Certainly, if you go down to the quantum level, that’s true.

But even at a higher level, the modeling limitations of complicated systems are such that one might as well view the world as inherently stochastic.

Probability theory

Probability theory is a framework that allows us to deal with uncertainty in ways that are principled and that bring to bear important and valuable tools.

Declarative representation with clear semantics

So first, probabilistic models provide us again this word declarative.

A declarative representation, that is stand alone, where you could look at a probability distribution and it has clear semantics that represent our uncertainty about different state that the world might be in.

Powerful reasoning patterns

It also provides us with a toolbox comprising powerful reasoning patterns that include, for example, conditioning on new forms of evidence or decision making under uncertainty.

Established learning methods

And because of the intricate connection between probability theory and statistics, you can bring to bear a range of powerful learning methodologies from statistical learning to allow us to learn these models effectively from historical data. Avoiding the need for a human to specify every single aspect of the model by hand.

Complex systems

Finally, the word graphical.

The word graphical is here from the perspective of computer science, because probabilistic graphical models are a synthesis between ideas from probability theory in statistics and ideas from computer science.

And the idea here is to use the connections computer science, specifically that of graphs to allow us to represent systems that are very complicated that involved large numbers of variables.

And we’d already seen those large number of variable in both of the applications that we use examples. Both in the medical example, as well as in the image segmentation example.

And so in order to capture probability distributions over spaces involving such a large number of factors, we need to have probability distributions over what are called random variables.

And so the focus of this class and what we’ll do for most of it is to think about the world as represented by a set of random variables, X1 up to Xn, each of which captures some facet of the world.

So, one symptom that may be present or absent, or a test result that might have a continuous set of possible values or a pixel that might have one of several labels.

So each of these is a random variable and our goal is to capture our uncertainty about the possible states of the world in terms of their probability distribution or what’s called a joint distribution over the possible assignments to the set of random variables.

Now, the important thing to realize when looking at this, is that even in the simplest case where each of these is, say, binary valued, which is not on the case, but say just for sake of the argument. If you have n binary value variable then this is a distribution, Over to to the n possible states of the world. One for each possible assignment.

And so we have to deal with objects that are intrinsically, exponentially large. And our only way to do that is by exploiting data structures that encode, that use ideas from computer science in this case to exploit the structure and distribution and represent and manipulate it in an effective way.

Graphical models

Let’s look at a couple of very simple examples, so here’s a toy Bayesian network, one that will accompany us through much of the first part of this course.

A Bayesian network is one of the two main classes of probabilistic graphical models, and it uses a directed graph as the intrinsic representation.

In this case, remember we had a set of random variables X1 up to Xn. The random variables are represented by nodes in the graph.

So, to take a look at this very simple example which we’ll discuss again later, here we have a situation where we have a student who takes a course and gets a grade in the course, and so that’s one of our random variables. We have other random variables that are also related to that. For example, the intelligence of the student’s in the course, the difficulty of the course. And others that might also be of interest, for example the quality of the recommendation letter that the student gets in the course which is dependent on things, perhaps the students’ grade, and these score that the students might receive on the SAT.

So, this is a representation of a probability distribution, in this case over these five random variables. And the edges in this graph represent the probabilistic connections between those random variables in a way that is very formal as we’ll define later on.

The other main class of probabilistic graphical model is what’s called the Markov network and that uses an undirected graph.

And in this case, we have an undirected graph over 4 random variables A, B, C, D and will give an example of this type of network maybe a little bit later on.

So these were toy examples, here are some real examples of the same type of framework.

So, this is a real Bayesian network.

It’s a network that’s actually called CPCS, it’s a real medical diagnosis network. It was designed at Stanford University for the purpose of diagnosis of internal diseases and it has 480 some nodes, and a little bit over 900 edges. And it was used for diagnosing internal diseases by physicians here.

Another real graphical model, in this case on the Markov network side, is one that’s used for the image segmentation tasks that we talked about before.

Here, the random variables represent the labels of pixels or superpixels. So, one per each superpixel say. And the edges represent, again probabilistic relationships between the label of a pixel and the label of an adjacent pixel since these are likely to be related to each other.

Graphical representation

Intuitive & compact data structure

So to summarize, the graphical representation gives us an intuitive and compact data structure for capturing these high dimensional probability distributions.

Efficient reasoning using general-purpose algorithms

It provides us at the same time, as we’ll see later in the course, a suite of methods for efficient reasoning, using general purpose algorithms that exploit the graphical structure.

Sparse parameterization

  feasible elicitation
  learning from data

And because of the way in which the graph structure encodes the parameterization of the probability distribution, we can represent these high-dimensional probability distribution efficiently using a very small number of parameters.

Which allows us though feasible elicitation by hand from an expert as well as automatically learning from data. And in both cases a reduction in the number of parameters is very valuable.

Many applications

This framework is a very general one and has been applied to a very broad range of applications and I’m not going to list all of the ones on the slide, and there is many others that I could have put on the slide had there been more space.

Let me just very briefly mention a few of them on subsequent slides.

So, we’ve already discussed the image segmentation. So, just to motivate the benefits of the PGM framework in this context, let’s look at these two images as an example.

Here is the original images, this is the division of these images into what I mentioned were called superpixels, which are these sort of slightly larger coherent regions.

And this is what you get if you apply a state of the art machine learning framework. Individual super pixels separately.

So, just trying to classify each superpixel separately and you can see that you get, especially in the case of the cow, a total mess with different superpixels having drastically different labels.

You can’t even see the cow in this segmented image. Whereas if you construct the probabilistic graphical model to capture the global structure of the scene and the correlations, the probabilistic relationships between these superpixels. You end up with a much more coherent segmentation that captures the structure of the scene.

Textual information extraction

A very different application is one of textual information extraction.

Where we might have an article from, say, a newspaper and we’d like to take this unstructured data and convert it into a structured form, where we have some representation of the people locations, organizations and perhaps relationships.

So one such task might be take this kind of sentence and recognize that these two words together form a person, which might not be that hard, given the presence of the word, missus. But, this is a little bit harder because Green also a word and yet, we want to identify it as a person. We might want to then infer that this is a location and perhaps that this is an organization.

It turns out that the state of the art methodology for solving this problem is as a probabilistic classical model where we have a variable for each word that encodes the label for that word.

For example, here we have the beginning of a person unit and an intermediate label for the person unit. And here is another person unit whereas, here in this variable is we would like to label it as a location. But we would like to capture, importantly, the correlations between both adjacent words as well as between non-adjacent words by using this occurrence of the word Green to infer this occurrence of the word Green is also a name.

Multi-Sensor integration: Traffic

A very different example all together is one that implicates data from multiple sensors.

This occurs in many applications, one such is for integrating data related to traffic from both sensors that we might have in the road or on the top of bridges, weather, information, incident report that we might get in some form.

We’d like to take all this together and use a model that as it happens was learned from data.

And that model is then used to predict both current and future road speed including not only on roads that where we have sensors that measure traffic, but even more interestingly, on roads where traffic wasn’t measured.

And it turns out that this was a very successful application that was fielded in several large cities with very good results.

Overview

So, let’s conclude this introduction with an overview of what we’ll learn in the class.

So, we will cover three main pieces related to probabilistic graphical models. We’ll cover representation of PGMs, including both directed and undirected representation.

We’ll also talk about higher level structures that allows to encode more complicated scenarios. Including ones that involve temporal structure as well as ones that involve repeated or relational structure, specifically a class of model called plate model.

We’ll talk about inference or reasoning using these models. Covering both exact reasoning, where exact probabilities are the outcome, as well as approximate methods that provide the different trade off regarding accuracy in computation.

And we’ll also talk about how this class of models can be used for decision making under uncertainty.

And finally, we’ll talk about how you might learn these models from historical statistical data.

And we’ll talk about learning both parameters, as well as structure of these models automatically. And dealing with both the somewhat simpler scenario where we have complete data that is all of the variables are always observed as well as the more difficult case where some of the variables might not be observed all the time, or perhaps not at all.

And that introduces an interesting set of complications but also a wealth of very exciting applications as we’ll see.

probability distribution

So before we get into the details of probabilistic graphical models, we need to talk a little bit about what a probability distribution is, just so we have a shared vocabulary. So, let’s start with a very simple example of a joint distribution.

Joint distribution

One that is going to be extended in examples later on in the, in other parts of the course. and let’s start with an example that involves just three random variables.

This is what I call the student example and you have a student who has, who can be described, in this case, by a variable representing his intelligence. And that could be high or low.

The student is taking a class. The class might be difficult or not so this random variable, B. So, the random variable I has two values. Difficulty variable also has two values and then, there is the grade that the student gets in the course, and that has three values.

In this case, we’re going to assume A, B, and C.

Now here’s an example, joint distribution over this over this set of three random variables. So this is an example of P of I, D, G.

It’s a joint distribution. And let’s think about how many entries are in such a joint distribution. Well since we have three variables and we want to, we need to represent the probability of every combination of values for each of these three variables, and so we have 2 2 3 possible combinations. For a total of twelve possible values that we need to assign a probability to. So there’s twelve total parameters in this probability and I’m going to introduce a notion of independence parameters which we’re going to talk about later, as well.

Independent parameters are parameters whose value is not completely determined by the value of other parameters.

So in this case, because this thing is a probability distribution, we know that all of the numbers here on the right have to sum to one. And therefore if you tell me eleven out of the twelve, I know what the twelfth is, and so the number of independent parameters is eleven. And we’ll see that, that is a useful notion later on when we start evaluating the relative expressive power of different probability distributions.

What are things that we can do with probability distributions?

Well, one important thing that we can do is condition the probability distribution on a particular observation.

So, for example assume that we observe that the student got an A. And so we have now an assignment to the variable G which is G1. And that immediately eliminates all possible assignments, but they’re not consistent, with my observations.

So everything but the G1 observations, okay? And so that gives me a reduced
probability distribution, and so this is an operation that’s called reduction.I’ve taken the probability distribution, I’ve reduced away stuff that is not consistent with what I’ve observed.

Now, that by itself doesn’t give me a probability distribution, because notice that these numbers no longer sum to one, because they summed to one before I threw out a bunch of stuff.

So what I do in order to get a probability distribution, what I do is I
take this —- Normalized measure. An indication the word measure indicates
that it’s a form of distribution but the fact that it’s un-normalized means that it doesn’t sum to one, it doesn’t normalize to one.

So this un-normalized measure if we want to turn it into a probability distribution, the obvious thing to do is to normalize it.

And so what we’re going to do is take all of these entries and we’re going to sum them up. And that’s going to give us a number, which in this case is 0.447. And we can now divide each of these by 0.447.

And that’s going to give us a normalized distribution. Which in this case corresponds to the probability of I, D given G1.

So that’s a way of taking an un-normalized measure, and turning it into a normalizing a normalized probability distribution. We’ll see that this operation is one of the more important ones that we were using, throughout the course.

Marginalization

Okay, the final operation I’m going to talk about regarding probability distribution is the operation of marginalization, and that is an operation that takes a probability distribution over a larger subset of variables and produces a probability distribution over a subset of those.

So in this case we have a probability distribution over IND. And say that we want to marginalize I which means we’re going to basically sum up we’re going to throw away, I’m going to restrict the tension to D.

And so what that does.

For example. If I want to compute the probability of d0. I’m going to add up both of the entries that have the d0, associated with them. And that’s, the one corresponding to I0, and the one corresponding to I1. And that’s the marginalization of this probability distribution.

Factors

A critical building block in a lot of what we’ll do both in terms of definition probability distributions and in terms of manipulating them for inference in the notion of a factor. So let’s define what a factor is and the kind’s of operations that you can do on factors.

So a factor really is a function, or a table. It takes a bunch of arguments.

In this case, a set of random variables X1 up to XK, and just like any function it gives us a value for every assignment to those random variables. So it takes all possible assignments in the cross products space of of X1 up to XK. That is all possible combinations of assignments and in this case it gives me a real value for each such combination.

And the set of variables X1 up to XK is called the scope of the factor. That is it’s the set of arguments, that a factor takes.

Let’s look at some examples of factors. We’ve already seen a joint distribution, a joint distribution is a factor. For every combination for example here of the variables I, D, and G, it gives me a number. As it happens this number is a probability. And it happens that it sums to one but that doesn’t matter. What’s what’s, what’s important is that for every value of I, D, and G, a combination of values, I get a number. That’s why it’s a factor.

Here’s a different factor and a normalized measure is a factor also. In this case we have a factor such as the probability of ID, G1 and notice that in this case the scope of the factor is actually I and D. Because there is no dependence of the factor on the value of the variable G because the variable G in this case is constant. So this is a factor who’s scope is IND.

Finally, a type of factor that we will use extensively is what’s called a conditional probability distribution, typically abbreviated CPD.

This as it happens is a CPD that’s written as a table. Although, that’s not necessarily the case. and this is a CPD that gives us the conditional probability of the variable G given I and D, so what does that mean? It means for every combination of values to the variables I and D, we have a probability distribution over G.

So, for example, if I have an intelligent student I a difficult class, which is this last line over here, this tells us that the probability of getting an A is 0.5, B is 0.3 and a C is 0.2 and as we can see, these numbers sum to 1 as they should because this is a probability distribution over G for this particular condition and context.

And you can easily verify that this is, that this is true for all of the other lines in this table. So this is again, a particular type of, of factor, one that satisfies certain constraints, in this case that each row sums to 1.

General Factors

Now this is, these are, the factors that we’re dealing with will not always correspond to probabilities. So here is an example of a general factor that, that really doesn’t match in any way to probability because the numbers aren’t even in the range 0,1.

As we’ll see these kinds of factors are nonetheless useful. This is a factor whose scope is the set of variable A, B. And it still goes need a real valued number for each of those for each assignment A and B.

Factor Product

Some operations that we’re going to do on factors. One of the most common operations is what’s called factor product.

It’s taking two factors say Φ1 and Φ2 and multiplying them together. So let’s think what that means. Here we have a factor Φ1. It has a scope of ab, phi two has a scope of bc. And what we’re doing is we’re kind of like multiplying a function f of xy times g of yz you’re going to get a function that is of all three arguments xyz.

So in this case we have a factor whose scope is A, B, and C. And if we want to figure out. Oops, that didn’t come out good. If we want to figure out, for example, the value of the row A1, B1, C1. It’s going to come by taking the A1, B1 row from here. The B1 C1 row from here, and multiplying them together. So we’re going to get 0.25.

So this is effectively taking the functions or the tables, and just multiplying them together.

Factor Marginalization

Another important operation is factor marginalization factor marginalization is very similar to in fact identical to the marginalization of probability distributions so here we accept that it doesn’t have to be a probability distribution

So for example if we have a factor here who’s scope is A, B, and C. And we want to, marginalize out B to get a, a factor whose scope is AC, what we’re going to be doing is again, taking both possible, values of B in this case there’s on, because B is binary, there’s only two values and we add them up, In order to get the entry for A1, C1 so 00.25 + 0.08.

In the same, all the other rows in this in this table are acquired. are computed in exactly the same way, from the corresponding rows, in the, original larger factor.

Factor Reduction

Finally, factor reduction, again, very similar to the context of probability distributions, we want to reduce.

For example, to the context C1, so we’re going to only focus on the rows that have the value C equals C1, and that’s going to give us a reduced factor. Which only has c1 and once again the scope of this factor is AB because there’s no longer any dependence on C. That’s basically the final operation.

Why Factors

It turns out that factors are the fundamental building block in defining these distributions and high dimensional spaces.

That is the way in which we’re going to define an exponentially large probability distribution over N random variables is by taking a bunch of little pieces and putting them together by multiplying factors in order to define these high dimensional probability distributions.

It turns out also that the same set of basic operations that we use to define the probability distributions in these high dimensional spaces are also what we use for manipulating them in order to give us a set of basic inference algorithms.