<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>404: Page not found &#187; programming</title>
	<atom:link href="http://krishnamurthy.net.in/blog/category/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://krishnamurthy.net.in/blog</link>
	<description>You are what you read, and with whom you cook</description>
	<lastBuildDate>Sun, 18 Dec 2011 06:04:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>NetSVMLight: Stratified Cross Validation for Parameter Selection</title>
		<link>http://krishnamurthy.net.in/blog/2011/05/25/netsvmlight-stratified-cross-validation-for-parameter-selection/</link>
		<comments>http://krishnamurthy.net.in/blog/2011/05/25/netsvmlight-stratified-cross-validation-for-parameter-selection/#comments</comments>
		<pubDate>Wed, 25 May 2011 07:17:44 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/?p=338</guid>
		<description><![CDATA[SVMlight provides the option to perform leave one out cross validation using the -x switch. The problem is that leave one out is too expensive for moderately large datasets. If you want to do say 10-fold cross validation, then you need to manage the folds yourself. I finished writing an n-fold cross validation functionality for [...]]]></description>
			<content:encoded><![CDATA[<p>SVMlight provides the option to perform leave one out cross validation using the -x switch. The problem is that leave one out is too expensive for moderately large datasets. If you want to do say 10-fold cross validation, then you need to manage the folds yourself. I finished writing an n-fold cross validation functionality for <a href="http://krishnamurthy.net.in/blog/2011/05/09/netsvmlight-a-net-wrapper-for-svmlight/">NetSVMLight</a>. An example follows, which demonstrates how you can use cross validation to pick the values of parameters to train your SVM.</p>
<p>For my problem, the parameter I was trying to pick was the cost-factor -j. In general, the principled way of doing this is as follows: Perform 10-fold (or n-fold) cross validation for each value of the cost-factor, on the training dataset. Get the average accuracy, precision, and recall for each set of models (where the set of models is the n-folds). Pick the value of the cost that gives the best cross validation results. Now the best results are determined by your requirements, and whether you are willing to trade off some recall for some additional precision etc. With the selected value of the parameter (cost-factor in this case), train a new model on the entire training dataset.</p>
<p>Now the basic idea behind n-fold cross validation is that the training dataset is divided into n-folds. In case of stratified cross validation (which is what NetSVMLight does), each fold contains approximately the same proportion of positive and negative class labels as the entire dataset. Also, each feature vector is randomly assigned to one of the n folds. This ensures a fair distribution. Now using a pre-determined value for all parameters, a model is constructed using n-1 of the n folds. This model is then tested on the remaining unseen fold, which yields a value for precision, recall and accuracy. This process is then repeated n-1 times. Each fold is used once for testing. The results of each fold are averaged and reported as the results for the n-fold cross validation.</p>
<p>Hence, in order to pick a parameter, such an n-fold cross validation is performed for each value of the parameter to be chosen. The value of the parameter that gives the most desired cross validation results is picked.<br />
<script src="https://gist.github.com/990475.js?file=Program.cs"></script><br />
In the code above, the ConstructNFolds method first constructs each of the n-folds on the disk. In the first for-loop, ten different values for the parameter are tried (you can try as many as you wish). The results of each cross validation set are stored in a dictionary, along with the corresponding value of the parameter being tested. The foreach loop simply goes over the dictionary, and saves the results to a file and prints them to the console.</p>
<p>For my dataset, I tried 10 different values of the cost-parameter and recorded the corresponding precision, recall and accuracy from each run of the 10-fold cross validation.</p>
<p><a href="https://lh4.googleusercontent.com/_8DTdVNRe-EE/TdytR5EUkoI/AAAAAAAABu8/g7Qqg5q4v9E/10foldcv.jpg"><img class="alignnone" src="https://lh4.googleusercontent.com/_8DTdVNRe-EE/TdytR5EUkoI/AAAAAAAABu8/g7Qqg5q4v9E/10foldcv.jpg" alt="" width="525" height="310" /></a></p>
<p>At cost = 0.75, I think there is a reasonable tradeoff between precision and recall. Hence I pick this value and train my SVM over the entire training dataset.</p>
<p>The latest source and binaries can be downloaded <a href="http://code.google.com/p/net-svmlight/downloads/detail?name=NetSVMLightv1.1.zip&amp;can=2&amp;q=">here</a> or <a href="http://code.google.com/p/net-svmlight/source/checkout">checked out</a> using SVN.</p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2011/05/25/netsvmlight-stratified-cross-validation-for-parameter-selection/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>NetSVMLight: A .NET Wrapper for SVMlight</title>
		<link>http://krishnamurthy.net.in/blog/2011/05/09/netsvmlight-a-net-wrapper-for-svmlight/</link>
		<comments>http://krishnamurthy.net.in/blog/2011/05/09/netsvmlight-a-net-wrapper-for-svmlight/#comments</comments>
		<pubDate>Tue, 10 May 2011 04:24:26 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/?p=323</guid>
		<description><![CDATA[Thorstein Joachim&#8217;s SVMlight does not need an introduction to the relevant audience. For windows it comes as a pair of binaries that can be run on the commandline. The software is originally written using gcc. I am not very good at using C/C++, but had some specific requirements where I had to run several experiments [...]]]></description>
			<content:encoded><![CDATA[<p>Thorstein Joachim&#8217;s <a href="http://svmlight.joachims.org/">SVMlight</a> does not need an introduction to the relevant audience. For windows it comes as a pair of binaries that can be run on the commandline. The software is originally written using gcc. I am not very good at using C/C++, but had some specific requirements where I had to run several experiments and analyze the results. Doing this on the commandline was extremely tedious, and naturally I was looking for an SVMlight API in Java or Python.</p>
<p>There are a bunch of extensions and additions listed on the website that provide an interface to SVMlight. Most of these (atleast the ones in python) simply provide an interface to set parameters and run the executables from within the code. Atleast one of the Python extensions did not have the ability to support different kernels. I was finding it hard to understand some of them due to insufficient documentation. Then I thought of writing one in C# simply because I did not find one featured on the SVMlight page. Here is a list of features:</p>
<ul>
<li>Well documented</li>
<li>Strongly typed parameters for svm_learn and svm_classify</li>
<li>Supports all kernels provided by svmlight</li>
<li>Supports most commonly used parameters (including kernel params and cross validation)</li>
<li>Can provide support for others parameters upon request</li>
<li>Constructs training and test sets from a given dataset and percentage split</li>
<li>Creates a list of all misclassified instances for further analysis</li>
</ul>
<p>Here is an example usage:<br />
<script src="https://gist.github.com/963041.js?file=Program.cs"></script></p>
<p>You can download the source, binaries and documentation from:</p>
<p><a href="http://code.google.com/p/net-svmlight/">http://code.google.com/p/net-svmlight/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2011/05/09/netsvmlight-a-net-wrapper-for-svmlight/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thank your friends on Facebook&#8230;more Graph API</title>
		<link>http://krishnamurthy.net.in/blog/2011/05/08/thank-your-friends-on-facebook-more-graph-api/</link>
		<comments>http://krishnamurthy.net.in/blog/2011/05/08/thank-your-friends-on-facebook-more-graph-api/#comments</comments>
		<pubDate>Sun, 08 May 2011 10:29:26 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/?p=312</guid>
		<description><![CDATA[I am going to get a lot of flak for this post. In some sense it is like digging my own grave, but I think the geek in you might be able to appreciate the ability to do something cool (even though it may take more time to create), rather than the monotonous repetitive. It [...]]]></description>
			<content:encoded><![CDATA[<p>I am going to get a lot of flak for this post. In some sense it is like digging my own grave, but I think the geek in you might be able to appreciate the ability to do something cool (even though it may take more time to create), rather than the monotonous repetitive.</p>
<p>It was my birthday yesterday. I was overwhelmed by the number of friends who wished me on Facebook, and I really wanted to thank each one of them by commenting on each &#8220;Happy Birthday&#8221; message. But when the number of such wishes runs into an unmanageable number, then you know that manually sitting and typing &#8220;thank you&#8221; (leave alone a more personalized thank you) seems to be quite daunting. Now, I declare a disclaimer that I really do thank all my friends for their good wishes, but I could not resist the idea of automating the process of writing thank you messages in response to each birthday wish.</p>
<p>So I set out with the assumption, that for each birthday message on my wall, I will create a comment carrying a personalized thank you, with my friend&#8217;s first name. This seemed like a worthwhile thing to do, since I would be able to individually thank each of my friends; and considering that the only other alternative was to be lazy and do nothing at all (since I was not going to be able to do this task manually).</p>
<p>So I set out to hack a python script using the Facebook Graph API. A few things to be considered:</p>
<ol>
<li>When you retrieve items from your wall, facebook usually returns multiple pages of data. After playing around with this for a while, I did not want to play with it, simply because it was increasing the complexity of what I wanted to do, and I was plainly finding it difficult to understand the meaning of their paging parameters.</li>
<li>So I determined that starting from the newest post on my wall, there were not more than a hundred posts that were birthday wishes (some of them were not, and would need to be filtered away).</li>
<li>I decided to identify the birthday wishes using the criterion that such a message would contain the word &#8220;happy&#8221; in it (simply because there were different ways in which people wrote birthday e.g. bday, budday and various others). This may not be a very smart filtering mechanism, but it seemed to have pretty good precision for me, and I was willing to let go of the minor loss of recall (I would deal with these exceptional cases manually).</li>
<li>So here is the basic algorithm:
<ul>
<li>Get the list of the 100 most recent posts on my wall</li>
<li>If the post does not contain the word &#8220;happy&#8221;, or already has a comment on it, or doesn&#8217;t contain a &#8220;message&#8221; section in the JSON response, then discard it.</li>
<li>Else, extract the messageID, and make an additional HTTP request using the ID of the message sender to determine the sender&#8217;s first name</li>
<li>Make a new HTTP post to https://graph.facebook.com/messageID/comments using the <strong>appropriate</strong> access token to write the comment.</li>
</ul>
</li>
<li>One final note about access tokens: The list of wall posts could be obtained using the access token described in <a href="http://krishnamurthy.net.in/blog/2011/04/16/console-application-continued-for-facebook/">my previous post about the Facebook Graph API</a>, but in order to be able to post on my wall, I needed another access_token that would contain write permissions. So, I had to implement the OAuth authentication workflow and request the <strong>publish_stream</strong> permission. This procedure is <strong>not </strong>documented very well on <a href="http://developers.facebook.com/docs/authentication/">http://developers.facebook.com/docs/authentication/</a>.
<ul>
<li> Essentially, I made a request to
<pre>https://www.facebook.com/dialog/oauth?client_id=YOUR_APP_ID&amp;redirect_uri=http://www.facebook.com/connect/login_success.html&amp;scope=publish_stream</pre>
<p>This made my application request me for write permission to my wall. Since, I am the app creator, I authorized it.</li>
<li>This returned a <strong>code</strong> in the redirect URL, which I used to make another HTTP request to
<pre>https://graph.facebook.com/oauth/access_token?client_id=YOUR_APP_ID&amp;redirect_uri=http://www.facebook.com/connect/login_success.html&amp;client_secret=YOUR_CLIENT_SECRET&amp;code=CODE_FROM_ABOVE</pre>
</li>
<li>The response contains a new access token that can be used for write operations.
<ul></ul>
</li>
</ul>
</li>
</ol>
<p>Armed with this new access token, and the old one I already have, I could now read posts on my wall, and write comments to them. Here is the code in Python:<br />
<script src="https://gist.github.com/961236.js?file=comment.py"></script></p>
<p>One final thing: there were at-least two posts on my wall, that were completely missed by the Graph API, and no matter what I did, they were just not retrieved. I had to respond to these manually.</p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2011/05/08/thank-your-friends-on-facebook-more-graph-api/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Console Application&#8230;continued for facebook</title>
		<link>http://krishnamurthy.net.in/blog/2011/04/16/console-application-continued-for-facebook/</link>
		<comments>http://krishnamurthy.net.in/blog/2011/04/16/console-application-continued-for-facebook/#comments</comments>
		<pubDate>Sat, 16 Apr 2011 15:25:27 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/?p=282</guid>
		<description><![CDATA[A couple of days ago, I was talking about how to use your private access token to extract information from your Twitter account using the Twitter API. Today, I tried using the RestFB library to extract information about my friends from facebook. Now, there is a different kind of a hack that I had to [...]]]></description>
			<content:encoded><![CDATA[<p>A couple of days ago, I was talking about how to use your private access token to extract information from your Twitter account using the Twitter API.</p>
<p>Today, I tried using the RestFB library to extract information about my friends from facebook. Now, there is a different kind of a hack that I had to use in order to get an access token: Login to facebook and go to this page: <a href="http://developers.facebook.com/docs/reference/api/user/">http://developers.facebook.com/docs/reference/api/user/</a>. Now, click on the example link on that page. In your browser address bar, you will see an access token. Save this in your configuration file (this access token expires after a certain duration).</p>
<p>There are a few further to jump through to simply access data. The RestFB library gives you a list of your friend names and ids. Even though it has methods to access other information such as location, gender etc. of your friends, all of these method calls return null UNLESS each of your friends gives your &#8220;application&#8221; specific permissions to access these fields. Crazy ain&#8217;t it? Look at this thread on <a href="http://stackoverflow.com/questions/5516627/facebook-graph-api-get-friends-info">stackoverflow</a>. So much distress to simply programmatically access information of my friends that I already can see in my facebook account.</p>
<p>So I had to take a different approach: Calling the graph API directly using direct HTTP GET requests as myself, rather than go through the application that needs to be specifically approved by my 350+ friends on FB.</p>
<p>Since I already have my friends&#8217; user IDs (from the previous call using restFB), I made individual HTTP requests to the following URL:</p>
<p><a href="https://graph.facebook.com/id?access_token=abcd">https://graph.facebook.com/id?access_token=abcd</a> (where abcd is the access token I received using the method described above). In the HTTP response body, you receive a JSON string which represents your friend&#8217;s information.</p>
<p>One final thing: making a GET request to an https link is painful in java. I used a crude hack in order to do this (essentially install an all trusting trust manager&#8230;never do this). Here is the complete code:<br />
<script src="https://gist.github.com/918808.js?file=facebooknetwork.java"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2011/04/16/console-application-continued-for-facebook/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What about the good ole&#8217; console application?</title>
		<link>http://krishnamurthy.net.in/blog/2011/04/12/what-about-the-good-ole-console-application/</link>
		<comments>http://krishnamurthy.net.in/blog/2011/04/12/what-about-the-good-ole-console-application/#comments</comments>
		<pubDate>Tue, 12 Apr 2011 22:05:10 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/?p=272</guid>
		<description><![CDATA[This is my second attempt at writing this post. Somehow I had a demented idea of using IE9 while writing this; and due to a stupid browser glitch, I lost my original draft. Now back on trusty old Firefox: essentially I was ranting about how I was having to jump through several hoops while using [...]]]></description>
			<content:encoded><![CDATA[<p>This is my second attempt at writing this post. Somehow I had a demented idea of using IE9 while writing this; and due to a stupid browser glitch, I lost my original draft. Now back on trusty old Firefox: essentially I was ranting about how I was having to jump through several hoops while using the Twitter and Facebook Graph APIs, mostly due to their OAuth workflow.</p>
<p>I am writing a simple console application that needs to collect some data and do some social network analysis. The complicated browser based authentication workflow supported by both FB and Twitter are a huge hindrance to this purpose, since they assume that you are making a web application with several users. Essentially the process is that your application exchanges your API key and some other secret key for an auth token (specific to each user), and this is done using the browser.</p>
<p>I discovered after a lot of digging in that, Twitter decided to be a little considerate by providing a single user auth_token that can be used for single-user scenarios like mine. So, I can simply start using my secret auth_token and completely bypass the OAuth workflow. To obtain your access token, login at dev.twitter.com, select your app, and click on &#8220;My Access Token&#8221;.</p>
<p>Given below is an example. I hope this is useful for people who have requirements similar to mine. I will follow up with a post about how to get this working for Facebook.</p>
<p><script src="https://gist.github.com/916489.js?file=gistfile1.java"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2011/04/12/what-about-the-good-ole-console-application/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Naive Bayes Classifier in 50 lines</title>
		<link>http://krishnamurthy.net.in/blog/2010/12/08/naive-bayes-classifier-in-50-lines/</link>
		<comments>http://krishnamurthy.net.in/blog/2010/12/08/naive-bayes-classifier-in-50-lines/#comments</comments>
		<pubDate>Wed, 08 Dec 2010 17:04:04 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/?p=254</guid>
		<description><![CDATA[The Naive Bayes classifier is one of the most versatile machine learning algorithms that I have seen around during my meager experience as a graduate student, and I wanted to do a toy implementation for fun. At its core, the implementation is reduced to a form of counting, and the entire Python module, including a [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes classifier</a> is one of the most versatile machine learning algorithms that I have seen around during my meager experience as a graduate student, and I wanted to do a toy implementation for fun. At its core, the implementation is reduced to a form of counting, and the entire Python module, including a test harness took only 50 lines of code. I haven&#8217;t really evaluated the performance, so I welcome any comments. I am a Python amateur, and am sure that experienced Python hackers can trim a few rough edges off this code.</p>
<h2>Intuition and Design</h2>
<p>Here is definition of the classifier functionality (from wikipedia):</p>
<p><img class="alignleft" src="http://upload.wikimedia.org/math/c/2/e/c2e227dfe0979e43cf06bfa318652dd3.png" alt="" width="504" height="55" /></p>
<p>Now this means, that for each possible class label, multiply together the conditional probability of each feature, given the class label. This means, for us to implement the classifier, all we need to do, is compute these individual conditional probabilities for each label, for each feature, p(Fi | Cj), and multiply them together with the prior probability for that label p(Cj). The label for which we get the largest product, is the label returned by the classifier.</p>
<p>In order to compute these individual conditional probabilities, we use the <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">Maximum Likelihood Estimation</a> method. In a very short sentence, we approximate these probabilities using the counts from the input/training vectors.</p>
<p>Hence we have: p(Fi | Cj) = count( Fi ^ Cj) / count(Cj)</p>
<p>That is, we count from the training corpus, the ratio of the number of occurrences of the feature Fi and the label Cj together to the total number of occurrences of the label Cj.</p>
<h2>Zero Probability Problem</h2>
<p>What if we have never seen a particular feature Fa and a particular label Cb together in the training dataset? Whenever they occur in the test data, p(Fa | Cb) will be zero. Hence the overall product will also be zero. This is a problem with maximum likelihood estimates. Just because a particular observation was not made during training does not mean that it will never occur in the test data. In order to remedy this issue, we use what is known as smoothing. The simplest kind of smoothing that we use in this code, is called &#8220;add one smoothing&#8221;. Essentially, the probability for an unseen event should be greater than one. We achieve this by adding one to each zero count. The net effect should be that we redistribute some of the probability mass from the non-zero count observations to the zero-count observations. Hence, we also need to increase the total count for each label by the number of possible observations, in order to maintain the total probability mass at 1.</p>
<p>e.g. if we have two classes C = 0 and C = 1, then after smoothing, the smoothed MLE probabilities can be written as:</p>
<p>p-smoothed(Fi | Cj) = [count(Fi ^ Cj) + 1]/[count(Cj) + N] where N is the total number of observations across all features in the training corpus.</p>
<h2>Code</h2>
<p>For simplicity, we will use Weka&#8217;s <a href="http://www.cs.waikato.ac.nz/~ml/weka/arff.html">ARFF</a> file format as input. We have a single class called Model which has a few dictionaries and lists to store the counts and feature vector details. In this implementation, we only deal with discrete valued features.<br />
<script src="https://gist.github.com/731471.js"> </script></p>
<p>The dictionary &#8216;features&#8217; saves all possible values for a feature. &#8216;<em>featureNameList</em>&#8216; is simply a list that contains the names of the features in the same order that it appears in the ARFF file. This is because our features dictionary does not have any intrinsic order, and we need to maintain feature order explicitly. &#8216;<em>featureCounts</em>&#8216; contains the actual counts for co-occurrence of each feature value with each label value. The keys for this dictionary are tuples of the form (class_label, feature_name, feature_value). Hence, if we have observed the feature F1 with the value &#8216;x&#8217; for the label &#8216;yes&#8217;, fifteen times, then we will have the entry {(&#8216;yes&#8217;, &#8216;F1&#8242;, 15)} in the dictionary. <strong>Note</strong> how the default values for counts in this dictionary is &#8217;1&#8242; instead of &#8217;0&#8242;. This is because we are smoothing the counts. The &#8216;<em>featureVectors</em>&#8216; list actually contains all the input feature vectors from the ARFF file. The last feature in this vector is the class label itself, as is the convention with weka ARFF files. Finally, &#8216;<em>labelCounts</em>&#8216; stores the counts of the class labels themselves, i.e. now many times did we see the label Ci during training.<br />
We also have the following member functions in the Model class:<br />
<script src="https://gist.github.com/731482.js"> </script><br />
The above method simply reads the feature names (including class labels), their possible values, and the feature vectors themselves; and populate the appropriate data structures defined above.<br />
<script src="https://gist.github.com/731487.js"> </script><br />
The TrainClassifier method simply counts the number of co-occurrences of each feature value with each class label, and stores them in the form of 3-tuples. These counts are automatically smoothed by using add-one smoothing as the default value of count for this dictionary is &#8217;1&#8242;. The counts of the labels is also adjusted by incrementing these counts by the total number of observations.<br />
<script src="https://gist.github.com/731492.js"> </script><br />
Finally, we have the Classify method, that accepts as argument, a single feature vector (as a list), and computes the product of individual conditional probabilities (smoothed MLE) for each label. The final computed probabilities for each label are stored in the &#8216;<em>probabilityPerLabel</em>&#8216; dictionary. In the last line, we return the entry from <em>probabilityPerLabel</em> which has the highest probability. Note that the multiplication is actually done as addition in the log domain as the numbers involved are extremely small. Also, one of the factors used in this multiplication, is the prior probability of having this class label.<br />
Here is the complete code, including a test method:<br />
<script src="https://gist.github.com/731413.js"> </script><br />
Download the <a href="http://cs.umbc.edu/~krishna3/linked-files/tennis.arff">sample ARFF file</a> to try it out.</p>
<div id="_mcePaste">Update: I found a bug in the last but one(th) line of the GetValues() function. This line gets the possible attribute values from the arff file and stores them in self.featureNameList. This method did not deal with whitespaces correctly. Update this line to:</div>
<div id="_mcePaste">&lt;code&gt;self.features[self.featureNameList[len(self.featureNameList) - 1]] = [featureName.strip() for featureName in line[line.find('{')+1: line.find('}')].strip().split(&#8216;,&#8217;)]&lt;/code&gt;</div>
<p>Update: I found a bug in the last but one(th) line of the GetValues() function. This line gets the possible attribute values from the arff file and stores them in self.featureNameList. This method did not deal with whitespaces correctly. Update this line to:&lt;code&gt;self.features[self.featureNameList[len(self.featureNameList) - 1]] = [featureName.strip() for featureName in line[line.find('{')+1: line.find('}')].strip().split(&#8216;,&#8217;)]&lt;/code&gt;</p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2010/12/08/naive-bayes-classifier-in-50-lines/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Can&#8217;t connect to MySQL server on &#8216;server&#8217;</title>
		<link>http://krishnamurthy.net.in/blog/2010/06/04/cant-connect-to-mysql-server-on-server/</link>
		<comments>http://krishnamurthy.net.in/blog/2010/06/04/cant-connect-to-mysql-server-on-server/#comments</comments>
		<pubDate>Fri, 04 Jun 2010 07:58:27 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/2010/06/04/cant-connect-to-mysql-server-on-server/</guid>
		<description><![CDATA[Its been a long time since I wrote anything here. So, I thought let me resume by sharing a little piece of information I gathered the other day. So, I have been using MySQL to manage some data related to my research. I have installations on multiple machines that I use, and recently I had [...]]]></description>
			<content:encoded><![CDATA[<p>Its been a long time since I wrote anything here. So, I thought let me resume by sharing a little piece of information I gathered the other day. So, I have been using MySQL to manage some data related to my research. I have installations on multiple machines that I use, and recently I had to install it on another ubuntu machine. I did the following:</p>
<p>&lt;code&gt;sudo apt-get install php5 mysql-server apache2 phpmyadmin&lt;/code&gt;</p>
<p>It worked fine, but then, my python script that runs on another machine began to complain that it could not connect to my MySQL server:</p>
<p><em><strong> Can&#8217;t connect to MySQL server on &#8216;server&#8217;</strong></em></p>
<p>Now that was just ridiculous because this has never happened before. So I trolled and trolled till I found what I was looking for:</p>
<p>http://www.webmasterworld.com/forum10/6141.htm</p>
<p>So, turns out that there is this tiny piece of configuration information in your /etc/mysql/my.cnf file that says:</p>
<p>bind-address = 127.0.0.1</p>
<p>which essentially means that all connections coming in from anywhere other than the local machine will not be entertained. Remove or comment that line and restart your server. Things start working!</p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2010/06/04/cant-connect-to-mysql-server-on-server/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Shuffle shuffle</title>
		<link>http://krishnamurthy.net.in/blog/2009/03/16/shuffle-shuffle/</link>
		<comments>http://krishnamurthy.net.in/blog/2009/03/16/shuffle-shuffle/#comments</comments>
		<pubDate>Sun, 15 Mar 2009 19:58:00 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/2009/03/16/shuffle-shuffle/</guid>
		<description><![CDATA[Several weeks ago, I found myself thinking, how I could shuffle a list/array given to me in a random order. This is a typically commonplace thing to do in several applications: online card games, your favourite music player etc. The interesting thing about this problem is how some naive approaches, even though easy to code [...]]]></description>
			<content:encoded><![CDATA[<p>Several weeks ago, I found myself thinking, how I could shuffle a list/array given to me in a random order. This is a typically commonplace thing to do in several applications: online card games, your favourite music player etc.</p>
<p>The interesting thing about this problem is how some naive approaches, even though easy to code and efficient enough, do not achieve the desired randomness.</p>
<p>Lets define the problem: You have an input array which has elements in positions 1 through n. The objective is to produce a random permutation of the array. By a random permutation, we mean to say that each permutation of the array is equally likely. So a truly randomized algorithm will generate each permutation with a probability of 1/n!</p>
<p>The first approach that comes to mind is the naive approach of generating a random number between 1 and n for each element in the array and placing the element at the position indicated by the random number. This could be accomplished by using another auxillary array and placing elements into the new positions generated. The pseudocode could look something like the following:<br />
<code><br />
NaiveShuffle(A[1...n], B[1....n])            //randomly permutes the elements in array A<br />
for i from 1 to n<br />
random &lt;- RandomNumber(1,n)<br />
B[random] &lt;- A[i]<br />
for i from 1 to n<br />
A[i] &lt;- B[i]<br />
</code></p>
<p>The above approach is very crude in the sense it uses an auxillary array and also parses through the array twice instead of just once. Still, the worst case running time of the above algorithm is O(n).</p>
<p>Random(1, n) generates a random number 1 and n (both inclusive). We assume that the random number generator generates truly random numbers in the interval specified. Also, we assume some kind of collision resolution mechanism. We are assuming that this method returns a random number in O(1) time.</p>
<p>All that said, a O(n) algorithm is not bad at all for this purpose. There is only one problem, this algorithm is WRONG! (hah&#8230;fat chance, didn&#8217;t we name it a naive algorithm?). Why is that? This is because in every iteration, we generate n possible choices. Since there are n such iterations, the total number of permutations generated is n.n.n.n&#8230;&#8230;.n times = n^n</p>
<p>The  total number of permutations possible while shuffling an array is n!. Since n^n is not exactly divisible by n!, there have to be some permutations which appear more frequently than the others (basic pigeonhole principle). Thus this naive algorithm does not generate truly random permutations.</p>
<p>Lets try something else. We generate a random priority between 1 and n^3 the Random(1, n^3)  routine, and assign a it to each element in the array. Then sort the array based on these weights.</p>
<p>e.g. if the original array is A&lt;1,2,3,4&gt; and we generate priorities randomly as P&lt;34,56,8,77&gt;, then when we sort array A based on the increasing order of the priorities assigned using array P, then we get the shuffled array as &lt;3, 1, 2, 4&gt;.</p>
<p>We used the interval [1, n^3] to generate priorities so as to reduce collisions. I will not delve into the correctness of this algorithm. The running time of this shuffling by sorting algorithm depends on which sorting algorithm we use. Typically it is Big-Theta(n lg n).</p>
<p>Another approach to solve this problem, is to shuffle by swapping:<br />
<code>Shuffle[1...n]<br />
for i from 2 to n<br />
temp = Generate(x from 1 to i)<br />
swap(i,temp)</code></p>
<p>Notice that the interval for generating random numbers keeps decreasing. For the first iteration, there are n choices. For the 2nd iteration, there are (n-1) choices and so on.</p>
<p>Hence the total number of choices generated by this algorithm is: n(n-1)(n-2)&#8230;..3.2.1 = n!</p>
<p>which is exactly equal to the total number of possible permutations. Moreover, since we iterate over the array only once, this algorithm runs in O(n) time.</p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2009/03/16/shuffle-shuffle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sending bulk emails using Outlook and C#</title>
		<link>http://krishnamurthy.net.in/blog/2008/12/24/sending-bulk-emails-using-outlook-and-c/</link>
		<comments>http://krishnamurthy.net.in/blog/2008/12/24/sending-bulk-emails-using-outlook-and-c/#comments</comments>
		<pubDate>Wed, 24 Dec 2008 09:27:50 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[development]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/2008/12/24/sending-bulk-emails-using-outlook-and-c/</guid>
		<description><![CDATA[I have always derived pleasure writing programs that solve real world problems. This is one such problem that I was able to solve. With the holiday season on, you might want to send greetings to your numerous business contacts. If you have several contacts that you want to send personalized messages to, then you very [...]]]></description>
			<content:encoded><![CDATA[<p>I have always derived pleasure writing programs that solve real world problems. This is one such problem that I was able to solve. With the holiday season on, you might want to send greetings to your numerous business contacts. If you have several contacts that you want to send personalized messages to, then you very well can imagine how much time and effort that will take.<br />
So this is what I set out to do: create an application that would send out emails to several contacts, with a personalized greeting line, but similar message body. Also, depending on the type of contact, you might want to send a different message. E.g. if it is a close colleague of yours, then you might want to send a more personalized mail rather than a one liner. Since these are personalized emails, these need to be sent from your actual email id rather than an SMTP server on your dev machine. Also, I needed this application to work for someone else who runs only MS Office on his machine. So I decided to use Microsoft Office Outlook 2007 for this task.<br />
The first thing to do was to decide the fomat in which I would store all the configuration information that would be used by the application: So I created two different text files, mail1.txt and mail2.txt each with a separate email message:</p>
<p>Hello {0}<br />
Mail body</p>
<p>Where {0} is a placeholder that will be replaced by the receiver name. mail1.txt and mail2.txt have the same structure except for the mail body depending on the requirement.<br />
Next, I needed to create a list of names and the corresponding email IDs to which the mails are to be sent. Also, I needed a flag that will indicate the type of message that is to be sent, i.e. mail1 or mail2. I created a comma separated file with the following format:</p>
<p>&lt;Receiver’s name&gt;, &lt;email id&gt;, &lt;mail body to be sent, i.e. 1 or 2&gt;</p>
<p>I wrote a console application in C# that uses the Microsoft Office 2007 Primary Interop assemblies to automate sending emails to all these contacts specified. The emails get sent using the default account configured in your Outlook. The code looks something like the following. Please note that this is a quick hack which actually works and that I have not really done a lot of error handling or exception management on this because I know the conditions under which this will be used.</p>
<pre>
Microsoft.Office.Interop.Outlook.Application app = null;
Microsoft.Office.Interop.Outlook._NameSpace ns = null;
Microsoft.Office.Interop.Outlook.PostItem item = null;
Microsoft.Office.Interop.Outlook.MAPIFolder inboxFolder = null;
Microsoft.Office.Interop.Outlook.MAPIFolder subFolder = null;
Microsoft.Office.Interop.Outlook.MailItem memo = null;
Microsoft.Office.Interop.Outlook.MAPIFolder sentFolder = null;
StreamReader addressReader = null;
StreamReader contentsReader = null;
StreamWriter logWriter = null; 

try
{
addressReader = new StreamReader(ConfigurationManager.AppSettings["addresses"]);
String currentLine = String.Empty;
String[] currentReceiver = null;
String messageBodyFile = String.Empty;
logWriter = new StreamWriter(Path.Combine(Environment.CurrentDirectory, "Log.txt"), false);
while (!addressReader.EndOfStream)
{
currentLine = addressReader.ReadLine();
currentReceiver = currentLine.Split(',');
switch (currentReceiver[2])
{
case "1":
messageBodyFile = ConfigurationManager.AppSettings["contentsFile1"];
break; 

case "2":
messageBodyFile = ConfigurationManager.AppSettings["contentsFile2"];
break; 

default:
Console.WriteLine("Could not send email to ", currentReceiver[0]);
logWriter.WriteLine("Could not send email to ", currentReceiver[0]);
currentReceiver[1] = String.Empty;
break;
} 

#region EmailInit 

app = new Microsoft.Office.Interop.Outlook.Application();
ns = app.GetNamespace("MAPI");
ns.Logon(null, null, false, false);
sentFolder = ns.GetDefaultFolder(OlDefaultFolders.olFolderSentMail);
memo = (Microsoft.Office.Interop.Outlook.MailItem)app.CreateItem(OlItemType.olMailItem); 

#endregion 

contentsReader = new StreamReader(messageBodyFile);
memo.To = currentReceiver[1].Trim();
memo.Subject = ConfigurationManager.AppSettings["mailSubject"].Trim();
memo.Body = String.Format(contentsReader.ReadToEnd(), currentReceiver[0]);
memo.BodyFormat = OlBodyFormat.olFormatHTML;
memo.Send();
Console.WriteLine("{0}: Sent email with body {1} to {2}:{3}", DateTime.Now, currentReceiver[2], currentReceiver[0], currentReceiver[1]);
logWriter.WriteLine("{0}: Sent email with body {1} to {2}:{3}", DateTime.Now, currentReceiver[2], currentReceiver[0], currentReceiver[1]);
contentsReader.Close();
contentsReader.Dispose();
}
} 

catch (System.Exception ex)
{
Console.WriteLine(ex.ToString());
EventLog.WriteEntry("Email Automation", ex.Message, EventLogEntryType.Error);
} 

finally
{
ns = null;
app = null;
inboxFolder = null;
addressReader.Close();
addressReader.Dispose();
logWriter.Close();
logWriter.Dispose();
}</pre>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2008/12/24/sending-bulk-emails-using-outlook-and-c/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Touchy feely gcc</title>
		<link>http://krishnamurthy.net.in/blog/2008/10/09/touchy-feely-gcc/</link>
		<comments>http://krishnamurthy.net.in/blog/2008/10/09/touchy-feely-gcc/#comments</comments>
		<pubDate>Thu, 09 Oct 2008 00:47:47 +0000</pubDate>
		<dc:creator>Krishnamurthy Koduvayur Viswanathan</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://krishnamurthy.net.in/blog/?p=46</guid>
		<description><![CDATA[I am writing code in C after several years. Needless to say, I am woefully out of touch and don&#8217;t remember the most basic of things. Add to that, I am writing code using a simple text editor and compiling it using gcc on commandline. Every time I see a funny error, it takes me [...]]]></description>
			<content:encoded><![CDATA[<p>I am writing code in C after several years. Needless to say, I am woefully out of touch and don&#8217;t remember the most basic of things. Add to that, I am writing code using a simple text editor and compiling it using gcc on commandline. Every time I see a funny error, it takes me a while to actually understand what is wrong. A really good IDE with awesome intellisense really does spoil you!</p>
<p>So I got this funny little compilation error which left me stumped:</p>
<p><em>/tmp/cckI2FzP.o:(.eh_frame+0&#215;11): undefined reference to `__gxx_personality_v0&#8242;<br />
collect2: ld returned 1 exit status</em></p>
<p>I googled and found that this error is normally related to C++, but I was writing code in plain old C. So what was wrong? I found later that I had named my code file as List.C instead of List.c. After renaming it to List.c, all was well.</p>
<p>Turns out that filename extensions in linux are case sensitive (wonder why I did not run into that problem all these years),  and that C is a commonly used extension for C++</p>
]]></content:encoded>
			<wfw:commentRss>http://krishnamurthy.net.in/blog/2008/10/09/touchy-feely-gcc/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

