The naive Bayes classifier is one of the simplest machine learning algorithms. It uses Bayes’ rule and an assumption that makes it… well naive.

This is what Bayes asserts:

where is the (categorical) classification and a vector (or a data point). You can imagine this formula as describing the drawing process of a color marble where the weights and sizes of different marbles are their features (components of X). You can think of the color itself as the classification of the marbles, say, for instance, red and blue.

Now, if I had drawn a marble and measured its weight () and diameter (), I could ask the following question: what’re the chances that a marble drawn weighting and measuring in diameter is colored red? Well, the answer that Bayes suggests is simply *proportional* the product of the probability of drawing a red marble with features and (i.e., ) and the probability of drawing any red marble (i.e., ). And I said *proportional* since you can think of the denominator as a normalization factor.

This is essentially what a Bayes classifier does, it calculates the probability of a data vector belonging to each class . And of course, the one with the greatest probability will be the most likely candidate to be correct. In practice, or the prior is easily calculated. But not so quite (for more details of this you can check Tom M. Mitchell’s notes).

A nice and *naive* (so to speak) approximation is to assume that all the components of are conditionally independent given ! That is, we assume that . And this is how we rewrite Bayes rule into the “Naive Bayes” rule:

where is the number of features (or dimensions) in the data. As I’ve mentioned before, to use this as a classifier we just need to find the one combination of and with the greatest probability. This means that we don’t actually loose anything if we drop the denominator of the previous equation (it’s only a scaling factor, after all). So we can define the naive Bayes classification rule as

which says that you get the most likely classification of when maximizes the equation inside of .

Now for continuous , you’ll have to decide what distribution you’ll assume for . A common choice is to use a normal distribution function. In this case, the learning stage boils down to just estimating the means and variances for each feature and target class.

I’ve implemented this in python as one of the classifiers in my mlearn module. You can find it in my github (which also includes other two classifiers). But using the module is pretty straight forward (an example for using the module is also included in the git repository), you just need a few lines of code:

from pylab import *
import scipy.stats as st
import mlearn
# Create a binary class data set and randomly divided
# into testing and training sets
nPts = 200
nCls = 2
mean1 = array([1., 1.])
sd1 = array([1., 1.])
mean2 = array([3.4, 2.1])
sd2 = array([1., 1.])
g1_x = st.norm.rvs(mean1[0], sd1[0], size=nPts/nCls)
g1_y = st.norm.rvs(mean1[1], sd1[1], size=nPts/nCls)
g1 = concatenate( (g1_x.reshape(len(g1_x),1), g1_y.reshape(len(g1_y),1)), axis=1 )
g2_x = st.norm.rvs(mean2[0], sd2[0], size=nPts/nCls)
g2_y = st.norm.rvs(mean2[1], sd2[1], size=nPts/nCls)
g2 = concatenate( (g2_x.reshape(len(g2_x),1), g2_y.reshape(len(g2_y),1)), axis=1 )
data = concatenate( (g1, g2), axis=0 )
target = concatenate( (zeros(nPts/nCls), ones(nPts/nCls)) ).reshape(nPts, 1)
data = concatenate( (data, target), axis=1 )
shuffle(data)
trainData = data[:100,:]
testData = data[101:,:]
# Create naive bayes classifier object and train it
bayMod = mlearn.Ngbayes(trainData[:,:2], trainData[:,2])
bayMod.train()
# Run it forward (classify)
prd = bayMod.classify(testData[:,:2])

Here’s the result of this simple example:

Note that the classifier isn’t perfect. Even for this simple example, it has four miss-classified data points.

Alternatively, you can also do it in R (you’ll need the e1071 CRAN package):

require(e1071)
mod <- naiveBayes(trainTarget ~ ., data=trainData)
prd <- predict(mod, testData)