The first thing to do is to load all of the standard python libraries that we'll need.
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import requests
%matplotlib inline
A Bayesian classifier (also known as a MAP classifier, also known as a minimum-probability-of-error or MPE classifier) is one for which the discriminant function is the posterior probability of $y$ given $\vec{x}$: $$\mbox{one possibility:}~~~g_y(\vec{x}) = p_{Y|X}(y|\vec{x})$$ ... or any monotonic function of that.
For example, from the definition of conditional probability, we get $$\hat{y}=\arg\max_y p_{Y|X}(y|\vec{x}) = \arg\max_y\frac{p_{X,Y}(\vec{x},y)}{p_X(\vec{x})} =\arg\max_y p_{X,Y}(\vec{x},y)$$ so instead of using the posterior probability as the discriminant function, we could use the joint probability
$$\mbox{another possibility that gives the same result:}~~~g_y(\vec{x})=p_{X,Y}(\vec{x},y)$$On the other hand, logarithm is a monotonic function, so $$\hat{y}=\arg\max_y p_{X,Y}(\vec{x},y) = \arg\max_y \ln p_{X,Y}(\vec{x},y)$$ so we could use $$\mbox{yet another possibility that gives the same result:}~~~g_y(\vec{x})=\ln p_{X,Y}(\vec{x},y)$$
It's particularly useful to divide that into the prior probability $p_Y(y)$, and the likelihood $p_{X|Y}(\vec{x}|y)$. The reason it's useful: because $p_Y(y)$ is just a lookup table, and $p_{X|Y}(\vec{x}|y)$ is often much, much easier to learn from data than is $p_{Y|X}(y|\vec{x})$. Thus we often want to use this discriminant:
$$\mbox{This one usually is easiest to compute and use:}~~~g_y(\vec{x})=\ln p_Y(y) + \ln p_{X|Y}(\vec{x}|y)$$You need to know these four terms. If $\vec{x}$ is the observation and $y$ is the class label, then
So let's talk about how to estimate the likelihood function. First, let's get some data. Here I'll download the "Iris" dataset, which is the one that Fisher used in the paper where he invented linear discriminant analysis.
# Download data from the UCI repository
r=requests.get('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
# Split at every newline; split each line at commas
dataset = [ x.split(',') for x in r.text.split('\n') ]
dataset[0:4]
# Get a dictionary from labels t indices, and back again
label2class = { 'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2 }
class2label = { 0:'Iris-setosa', 1:'Iris-versicolor', 2:'Iris-virginica' }
# Read out a list of the class labels, convert to integers
Y = [ label2class[x[4]] for x in dataset if len(x)==5 ]
# Plot the class labels of each token
plt.bar(range(0,len(Y)),Y)
plt.title('Class labels of all of the tokens')
plt.xlabel('Token number')
plt.ylabel('Class label')
# Create a numpy arrays for each data subset
X = np.array([ x[0:4] for x in dataset if len(x)==5 ], dtype='float64')
X.shape
# Plot a scatter plot of the three classes, so we can see how well they separate in the first two dimensions
plt.plot(X[0:50,0],X[0:50,1],'x',X[50:100,0],X[50:100,1],'o',
X[100:150,0],X[100:150,1],'^')
plt.title('Scatter plot of the three classes, in the first two dimensions')
plt.xlabel('Dimension 0')
plt.ylabel('Dimension 1')
Now let's build a Gaussian model of each of those classes, and see how well it fits.
# Compute the mean of each class. axis=0 means to compute the average row vector
mu0 = np.mean(X[0:50,:],axis=0)
mu1 = np.mean(X[50:100,:],axis=0)
mu2 = np.mean(X[100:150,:],axis=0)
plt.plot(mu0[0],mu0[1],'x',mu1[0],mu1[1],'o',mu2[0],mu2[1],'^')
# Compute the covariance of each class.
Sig0 = np.cov(X[0:50,:],rowvar=False)
Sig1 = np.cov(X[50:100,:],rowvar=False)
Sig2 = np.cov(X[100:150,:],rowvar=False)
[ Sig0, Sig1, Sig2 ]
# Create a coordinate system on which we can calculate the Gaussian pdf
x0coords, x1coords = np.mgrid[4:8:0.01,2:4.5:0.01]
# Combine x0coords and x1coords into an array that we can use in the plt.contourf function
# ... this syntax is copied from the scipy.stats.multivariate_normal web page
coords = np.empty(x0coords.shape + (2,))
coords[:,:,0] = x0coords
coords[:,:,1] = x1coords
coords.shape
# Calculate the Gaussian pdfs at the same set of points
pdf0 = stats.multivariate_normal(mu0[0:2],Sig0[0:2,0:2]).pdf(coords)
pdf1 = stats.multivariate_normal(mu1[0:2],Sig1[0:2,0:2]).pdf(coords)
pdf2 = stats.multivariate_normal(mu2[0:2],Sig2[0:2,0:2]).pdf(coords)
plt.contourf(x0coords,x1coords,pdf0)
plt.title('Contour plot of the Gaussian model for class 0')
# Show contour plot of the maximum of all three classes
maxpdf = np.maximum(pdf0,np.maximum(pdf1,pdf2))
plt.contourf(x0coords,x1coords,maxpdf)
plt.title('Maximum of the three likelihood functions')
A Gaussian classifier is just a Bayesian classifier, with a Gaussian pdf.
Remember that a Bayesian classifier means that $$\hat{y} = \arg\max_y \ln p_Y(y)+\ln p_{X|Y}(\vec{x}|y)$$
A Gaussian classifier is one in which $p_{X|Y}(\vec{x}|y)$ is Gaussian: $$p_{X|Y}(\vec{x}|y)={\mathcal N}(\vec{x};\vec{\mu}_y,\Sigma_y)$$
Suppose that all of the classes have equal priors, $p_Y(y)=\frac{1}{3}$ for the cases of three classes. Then the Gaussian classifier is just $$\hat{y} = \arg\max_y {\mathcal N}(\vec{x};\vec{\mu}_y,\Sigma_y)$$
# Find the (x0,x1) coordinates for which class 0 is the best choice
yhat_is_0 = (pdf0 == maxpdf)
yhat_is_1 = (pdf1 == maxpdf)
yhat_is_2 = (pdf2 == maxpdf)
yhat = yhat_is_1 + 2*yhat_is_2
# Now let's plot that
plt.contourf(x0coords,x1coords,yhat)
plt.title('Classifier output, yhat, plotted as a function of x')
Remember that the Gaussian pdf is given by $$\hat{y} = \arg\max p_{X|Y}(\vec{x}|y) = \arg\max \ln p_{X|Y}(\vec{x}|y)$$
But the logarithm of a Gaussian is a quadratic function of $\vec{x}$: $$\ln p_{X|Y}(\vec{x}|y) = -d_\Sigma^2(\vec{x},\vec{\mu}_y) + \mbox{constant}$$
So the boundary between any two classes, in a Gaussian classifier, needs to be either a straight line, or a quadratic.
So, notice that pdf2 is looking elliptical here, even though the data distribution wasn't really very elliptical. Let's analyze that a little more carefully, by drawing a scatter plot of the same data on top of the countour plot.
plt.contourf(x0coords,x1coords,pdf2)
plt.hold(True)
plt.plot(X[100:150,0],X[100:150,1],'g^')
A GMM models the likelihood of each class as the weighted sum of $K$ different Gaussians, like this: $$p_{X|Y}(\vec{x}|y) = \sum_{k=1}^K c_k {\mathcal N}(\vec{x};\vec\mu_{yk},\Sigma_{yk})$$
Notice that we've increased the number of parameters by a factor of $K$. A Gaussian model has just one $D\times D$ covariance matrix per class. A GMM has $K$ of them. There are two possible solutions to avoid parameter overload:
Here are some facts you know about pdfs in general:
The Gaussian is a valid pdf. That means that these things are true for a Gaussian: $${\mathcal N}(\vec{x};\vec\mu,\Sigma)\ge 0,~~~\mbox{and}~~~\int {\mathcal N}(\vec{x};\vec\mu,\Sigma)d\vec{x}=$$
Since these things are already true for each of the Gaussians separately, making them true for the GMM is simple. We just have to choose the right mixture weights, $c_k$. In fact, the rules are easy:
The covariance matrix has to be positive semi-definite, meaning that it has to have all non-negative eigenvalues.
In fact, a positive-semidefinite covariance matrix is kind of hard to deal with --- it's better if you keep it positive definite, instead. This can be done by finding the eigenvalues, then if there are any zero-valued eigenvalues, just add a small constant to each one. Usually people add a constant somewhere between 0.01 and 0.001, then check to see if their results make sense.
OK, the title of this slide is a lie. I'm not going to tell you, yet, how to learn a GMM from data. I'll tell you about that in the next lecture.
For now, let's just look at the scatter plot for class $y=2$, the Iris-virginica class. Notice that you could draw an ellipse around all the data points with $x_0<6.8$, and you could draw another ellipse around all the data points with $x_0>6.8$. So let's arbitrarily divide the data into those two halves, and find the mean and covariance of each of those two half-datasets.
WARNING: this is not the way you should ever train a GMM in practice. You'll learn a much better method in the next lecture.
# Arbitrarily split the class-2 data into two halves, at the threshold x0=6.8
class2 = X[100:150,:]
class2_left = class2[class2[:,0] <= 6.8,:]
class2_right = class2[class2[:,0] > 6.8,:]
plt.plot(class2_left[:,0],class2_left[:,1],'x',class2_right[:,0],class2_right[:,1],'^')
plt.title('Scatter plot of class 2 data: left half as x, right half as ^')
# Now we'll estimate the mixture weights, mean, and covariance of the Gaussians on the left half and right half
c_2_0 = class2_left.shape[0]/class2.shape[0]
c_2_1 = class2_right.shape[0]/class2.shape[0]
mu_2_0 = np.mean(class2_left[:,0:2],axis=0)
mu_2_1 = np.mean(class2_right[:,0:2],axis=0)
Sig_2_0 = np.cov(class2_left[:,0:2],rowvar=False)
Sig_2_1 = np.cov(class2_right[:,0:2],rowvar=False)
# Just as a check, print the mixture weights, and the two mean vectors
[ [c_2_0, c_2_1], mu_2_0, mu_2_1 ]
# Now let's calculate the GMM model of class 2
gaussian_2_0 = stats.multivariate_normal(mu_2_0,Sig_2_0).pdf(coords)
gaussian_2_1 = stats.multivariate_normal(mu_2_1,Sig_2_1).pdf(coords)
gmm_pdf2 = c_2_0*gaussian_2_0 + c_2_1*gaussian_2_1
# And let's plot its contour plot
plt.contourf(x0coords,x1coords,gmm_pdf2)
plt.title('Gaussian Mixture Model of the class Iris-virginica, class y=2')
A GMM classifier is a Bayesian classifier in which one or more of the classes has a GMM likelihood function.
So the classification function is $$\hat{y}=\arg\max_y p_Y(y)p_{X|Y}(\vec{x},y)$$
where $p_{X|Y}(\vec{x}|y)$ is a GMM for at least one of the classes.
# Let's try creating a classifier.
# We'll use Gaussian models for classes 0 and 1,
# and a GMM for class 2
maxpdf = np.maximum(pdf0,np.maximum(pdf1,gmm_pdf2))
yhat_using_gmm = (pdf1==maxpdf) + 2*(gmm_pdf2==maxpdf)
plt.contourf(x0coords,x1coords,yhat_using_gmm)
plt.title('Classification function: Classes 0 and 1 are Gaussian, Class 2 is GMM')
Remember that in a Gaussian classifier, the border between any two classes needs to be quadratic or linear. In a GMM classifier, that's no longer true. Roughly speaking, the border can have as many wiggles as the GMM has Gaussians.