## Latex lineno package, where art thou?

If you’re trying add lines to your latex document, you’ll need the lineno package. For the ubuntu 12.04 users out there, you won’t find it with the texlive-latex-base package. In fact, it wasn’t even in the texlive-latex-extra (which was my first guess). Oddly, it was all along packaged with the texlive-humanities.

Posted in Latex, Linux, Ubuntu | Tagged , , | 1 Comment

## Extracting pdf pages using Ghostcript (in linux)

Here’s a way to extract specific consecutive pages from a pdf file using Ghostscript in linux. It’s pretty handy since most Linux systems will already have Ghostcript installed.

To extract page 7 to 11  from the file ‘my_thesis.pdf’ just type:

 gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dFirstPage=7 -dLastPage=11 \ -sOutputFile=selected_pages.pdf my_thesis.pdf 

## Generating Poisson random variates with Boost libraries

Boost libraries come pretty handy whenever you’re in need to generate random variates distributed according to some distribution.

In the case of generating Poisson-distributed random variates you can do like this:

#include <iostream>
using std::cout;
using std::endl;
#include <iomanip>
using std::setprecision;
#include <cstdlib>
using std::atoi;
using std::atof;
#include <ctime>
using std::time;
#include <boost/random/mersenne_twister.hpp>
using boost::mt19937;
#include <boost/random/poisson_distribution.hpp>
using boost::poisson_distribution;
#include <boost/random/variate_generator.hpp>
using boost::variate_generator;

int main(int argc, char * argv[]) {
int Nsim=atoi(argv[1]);
double exp=atof(argv[2]);
mt19937 gen;
gen.seed(time(NULL));
poisson_distribution<int> pdist(exp);
variate_generator< mt19937, poisson_distribution<int> > rvt(gen, pdist);
for(int i=0; i<Nsim; i++) cout << rvt() << endl;
return 0;
}


In this snippet you first declare a generator object gen based on the Mersenne Twistter algorithm (line 21), then you declare the distribution object pdist (line 23) with mean exp, and then you declare the variate generator object rvt (line 24), which takes the previously declared objects as parameter. Finally you generate the random variates by invoking the method rvt() (not really critical… but to be precise rvt() is just overloading the () operator, see its template definition here)

To compile you just need to include the include files for the Boost libraries:
 g++ -I$BOOSTROOT/include poisson_gen.cc -o poisson_gen  where$BOOSTROOT is just where the directory of my Boost libraries lives.

Now suppose you wanted to generate 3 columns of Poisson random variates, where each column corresponds to a certain mean. You might be tempted to do the following:

#include <iostream>
using std::cout;
using std::endl;
#include <iomanip>
using std::setprecision;
#include <cstdlib>
using std::atoi;
using std::atof;
#include <ctime>
using std::time;
#include <boost/random/mersenne_twister.hpp>
using boost::mt19937;
#include <boost/random/poisson_distribution.hpp>
using boost::poisson_distribution;
#include <boost/random/variate_generator.hpp>
using boost::variate_generator;

int main(int argc, char * argv[]) {
int Nsim=atoi(argv[1]);
double exp[] = {3.2,10.45,45.3};
mt19937 gen;
gen.seed(time(NULL));
for(int i=0; i<Nsim; i++) {
for(int j=0; j<3; j++) {
poisson_distribution<int> pdist(exp[j]);
variate_generator< mt19937, poisson_distribution<int> > rvt(gen, pdist);
cout << rvt() << " ";
}
cout << endl;
}
return 0;
}


But if you run this code, you’ll get the following result:

2 8 46
2 8 46
2 8 46
2 8 46
...


which is far from generating 3 sets of Poisson random variates. The culprit of this is on line 26 where you declare a variate_generator object using the same mt19937 object every iteration. This means that you’re reseeding the random generator with the same seed every single time.

A solution to this problem is to create a vector containing the variate_generator objects (one for each mean) before going into the for loop. You can then call each variate_generator inside of the main for loop. This fix is show here:

#include <iostream>
using std::cout;
using std::endl;
#include <iomanip>
using std::setprecision;
#include <cstdlib>
using std::atoi;
using std::atof;
#include <ctime>
using std::time;
#include <vector>
using std::vector;
#include <boost/random/mersenne_twister.hpp>
using boost::mt19937;
#include <boost/random/poisson_distribution.hpp>
using boost::poisson_distribution;
#include <boost/random/variate_generator.hpp>
using boost::variate_generator;

int main(int argc, char * argv[]) {
int Nsim=atoi(argv[1]);
double exp[] = {3.2, 10.45, 45.3};
int simVrt;
mt19937 gen;
gen.seed(time(NULL));
vector < variate_generator< mt19937, poisson_distribution<int> > > rvt;
for(int j=0; j<3; j++) {
poisson_distribution<int> pdist(exp[j]);
variate_generator< mt19937, poisson_distribution<int> > j_vgen(gen, pdist);
rvt.push_back(j_vgen);
}
cout << setprecision(20);
for(int i=0; i<Nsim; i++) {
for(int j=0; j<3; j++) {
//poisson pois(exp[j]);
simVrt = rvt[j]();
cout << simVrt << " ";
}
cout << endl;
}
return 0;
}


Running this code you should get something like this:

2 10 41
2 11 59
4 8 44
3 9 31
3 12 49
3 14 50
1 14 43
...


which looks much more like what we wanted in the first place.

## Running synergy in command line

I use synergy almost everyday when I connect my laptop at my office. It would even be better if I can somehow ALT+TAB into my laptop, but for the time being I’m content with using the mouse to move the control over to my laptop.

Here’s how you set it up in Linux (Ubuntu 12.04). Install it first (on both the server and client computer):

sudo apt-get install synergy

On the server machine, create the a .synergy.conf file on your \$HOME with the following content:

section: screens
server-name:
client-name:
end
server-name:
left = client-name
client-name:
right = server-name
end


where server-name and client-name are respectively the name of your server computer and your client computer. Also, with the above configuration it is assumed that by moving your mouse pointer over the left edge of your server monitor will make the pointer appear on the right edge of your client monitor.

To start running the synergy server you just have to enter (on the server machine):

synergys

and to start running the synergy client you just have to enter (on the client machine):

synergyc server-name

To stop either one, I just tend to find its pid with:

ps -A | grep synergy

and kill it with

kill pid-number

Of course you can always write a script to do all that. My final word is that you might need configure your firewall to allow connection through 24800 port.

## Nice looking CV or resume with Latex moderncv

If you’re in need updating your CV or resume, I would suggest you to do it in $\LaTeX$. “For some reason anything that’s written in $\LaTeX$ looks more legitimate…” or at least that’s what the late Peter Eklund once told me.

I have lately learned about moderncv which provides a few really nice templates for CVs (although they could also be tweaked to produce resumes too). The only problem that I found was that the version that came with the Ubuntu repository was out of date, but I did find a simple solution online to update moderncv.

First you need to get the updated moderncv package (you can get it here). Once you downloaded and extracted the contents of the file, you’ll have to copy them to the texmf directory. Don’t worry about any older version of moderncv that you may have, whatever is in that directory will have precedence.

To find where that directory is just type:

kpsewhich -var-value TEXMFLOCAL

On my computer (Ubuntu 12.04) that directory is located in

/usr/local/share/

You’ll need to create a new directory for moderncv:

sudo mkdir -p /usr/local/share/texmf/tex/latex/moderncv

After that just copy the contents that you just downloaded to that newly created directory. Finally just do:

sudo texhash

## Compounding p-values

Some time ago, I needed a way to summarize different independent experiments that yielded different p-values (with and without weighting the p-values). Fortunately, this is a known problem with a known solution.

The problem was the following: Suppose you have $n$ experiments (of continuous random variables) with p-values $\alpha_1, \alpha_2, \ldots, \alpha_n$ which product is $\pi = \prod^n_{i=1} \alpha_i$. If you were repeated the $n$ experiments and obtained p-values $p_1, p_2, \ldots, p_n$, what is the probability that the product $P = \prod^n_{i=1} p_i$ is less or equal to $\pi$? (i.e., what’s $\mathbb{P}(P\leq\pi)$?)

R. A. Fisher solved this problem (no surprise there) for unweighted and independent experiments. And Good extended it for weighted and independent ones. If you’re interested you can see both proofs in one of my notes. For Fisher’s compounding, I wrote down my version of the proof that is a little different from what Fisher originally did, but the end results are the same.

Anyway, for unweighted independent p-values the compounding can be calculated as:

$\mathbb{P}(P\leq\pi) = \pi \displaystyle\sum_{j=0}^{n-1} \dfrac{(-\ln\pi)^j}{j!}$

On the other hand, for weighted p-values where weights are $w_1, w_2, \ldots, w_n$ and $\Pi = \displaystyle\prod^n_{i=1} p_i^{w_i}$, the compounding can be determined as:

$\mathbb{P}(P \leq \Pi) = \displaystyle\sum^{n}_{j=1} \Lambda_j \Pi^{1/w_j}$

where $\Lambda_j = \frac{w^{n-1}_{j}}{\displaystyle\prod_{j \neq k}(w_j-w_k)}$.

The following is a simple implementation (in python) for both Fisher’s and Good’s methods:

from pylab import *
import scipy.stats as st
import math as m

def getFisherPvalue(pvalues):
tt = prod(pvalues)
nSam = len(pvalues)
fP = 0
for i in range(nSam):
fP += tt * ((-logtt)**i)/m.factorial(i)
return fP

def getGoodPvalue(pvalues, weights):
TT = prod(pvalues**weights)
nSam = len(pvalues)
gP = 0
for i in range(nSam):
crossProd = 1.
for j in range(nSam):
if i==j:
continue
else:
crossProd *= weights[i]-weights[j]
gP += ((weights[i]**(nSam-1))/crossProd) \
* TT**(1./weights[i])
return gP


One of the problems with both Fisher’s and Good’s methods (especially the latter one) is that there is a chance that an implementation of these procedures might run out of precision when there are too many small p-values (unless you use arbitrary precision variables). An alternative solution is to use an approximation. I have grown used calling it Bhoj’s approximation (although I’m not sure if anybody else does it). The approximated calculation is:

$\mathbb{P}(P \leq \Pi) \approx 1 - \displaystyle\sum^n_{j=1} w_j \int^{\infty}_{-\ln{\Pi}} \Gamma(\frac{1}{w_j},w_j)(y)\ dy$

As horrible as this last approximation may look, it’s actually quite easy to solve it numerically since the integral is just the cumulative distribution function of a gamma distribution. Here’s a solution with Python (and scipy.stats):

from pylab import *
import scipy.stats as st

def getBhojPvalue(pvalues, weights):
logTT = sum(weights*log(pvalues))
nSam = len(pvalues)
bP = 0
for i in range(nSam):
bP += weights[i] * st.gamma.cdf(-logTT, 1./weights[i], \
loc=0, scale=weights[i])
return 1.-bP


I find Bhoj’s approximation to be quite handy since you can also use it for unweighted p-values: you just have to define dummy weights so that they are the same and so that they add up to unity.

IMPORTANT: I can’t stress enough the fact that all three methods discussed in this entry are only valid for measurements of continuous random numbers! The reason for this is because the methods assume an uniform distribution of p-values, which is true for continuous distributions but not for discrete ones. Here’s the proof if you’re interested.

## Naive Bayes Classifier

The naive Bayes classifier is one of the simplest machine learning algorithms. It uses Bayes’ rule and an assumption that makes it… well naive.

This is what Bayes asserts:

$\mathbb{P}(C|X) = \dfrac{\mathbb{P}(X|C)\mathbb{P}(C)}{\sum_j \mathbb{P}(X|C=c_j)\mathbb{P}(C=c_j)}$

where $C$ is the (categorical) classification and $X$ a vector (or a data point). You can imagine this formula as describing the drawing process of a color marble where the weights and sizes of different marbles are their features (components of X). You can think of the color itself as the classification of the marbles, say, for instance, red and blue.

Now, if I had drawn a marble and measured its weight ($w_o$) and diameter ($d_o$), I could ask the following question: what’re the chances that a marble drawn weighting $w_o$ and measuring $d_o$ in diameter is colored red? Well, the answer that Bayes suggests is simply proportional the product of the probability of drawing a red marble with features $w_o$ and $d_o$ (i.e., $\mathbb{P}(X|C)$) and the probability of drawing any red marble (i.e., $\mathbb{P}(C)$). And I said proportional since you can think of the denominator as a normalization factor.

This is essentially what a Bayes classifier does, it calculates the probability of a data vector $X$ belonging to each class $C$. And of course, the one with the greatest probability will be the most likely candidate to be correct. In practice, $\mathbb{P}(C)$ or the prior is easily calculated. But $\mathbb{P}(X|C)$ not so quite (for more details of this you can check Tom M. Mitchell’s notes).

A nice and naive (so to speak) approximation is to assume that all the components of $X$ are conditionally independent given $C$! That is, we assume that $\mathbb{P}(X|C) = \mathbb{P}(x_1|C) \mathbb{P}(x_2|C) \ldots \mathbb{P}(x_p|C)$. And this is how we rewrite Bayes rule into the “Naive Bayes” rule:

$\mathbb{P}(C|X) = \dfrac{\prod^p_{i=1}\mathbb{P}(x_i|C)\mathbb{P}(C)}{\sum_j \prod^p_{i=1}\mathbb{P}(x_i|C=c_j)\mathbb{P}(C=c_j)}$

where $p$ is the number of features (or dimensions) in the data. As I’ve mentioned before, to use this as a classifier we just need to find the one combination of $X$ and $C$ with the greatest probability. This means that we don’t actually loose anything if we drop the denominator of the previous equation (it’s only a scaling factor, after all). So we can define the naive Bayes classification rule as

$argmax(c_j)\{\ \mathbb{P}(c_j)\prod^p_{i=1}\mathbb{P}(x_i|c_j)\ \} \rightarrow C$

which says that you get the most likely classification of $X$ when $c_j$ maximizes the equation inside of $argmax$.

Now for continuous $X$, you’ll have to decide what distribution you’ll assume for $\mathbb{P}(X|C)$. A common choice is to use a normal distribution function. In this case, the learning stage boils down to just estimating the means and variances for each feature and target class.

I’ve implemented this in python as one of the classifiers in my mlearn module. You can find it in my github (which also includes other two classifiers). But using the module is pretty straight forward (an example for using the module is also included in the git repository), you just need a few lines of code:

from pylab import *
import scipy.stats as st
import mlearn

# Create a binary class data set and randomly divided
# into testing and training sets
nPts  = 200
nCls  = 2
mean1 = array([1., 1.])
sd1   = array([1., 1.])
mean2 = array([3.4, 2.1])
sd2   = array([1., 1.])
g1_x = st.norm.rvs(mean1[0], sd1[0], size=nPts/nCls)
g1_y = st.norm.rvs(mean1[1], sd1[1], size=nPts/nCls)
g1 = concatenate( (g1_x.reshape(len(g1_x),1), g1_y.reshape(len(g1_y),1)), axis=1 )
g2_x = st.norm.rvs(mean2[0], sd2[0], size=nPts/nCls)
g2_y = st.norm.rvs(mean2[1], sd2[1], size=nPts/nCls)
g2 = concatenate( (g2_x.reshape(len(g2_x),1), g2_y.reshape(len(g2_y),1)), axis=1 )
data = concatenate( (g1, g2), axis=0 )
target = concatenate( (zeros(nPts/nCls), ones(nPts/nCls)) ).reshape(nPts, 1)
data = concatenate( (data, target), axis=1 )
shuffle(data)
trainData = data[:100,:]
testData  = data[101:,:]

# Create naive bayes classifier object and train it
bayMod = mlearn.Ngbayes(trainData[:,:2], trainData[:,2])
bayMod.train()

# Run it forward (classify)
prd = bayMod.classify(testData[:,:2])


Here’s the result of this simple example:

Note that the classifier isn’t perfect. Even for this simple example, it has four miss-classified data points.

Alternatively, you can also do it in R (you’ll need the e1071 CRAN package):

require(e1071)
mod <- naiveBayes(trainTarget ~ ., data=trainData)
prd <- predict(mod, testData)