SVMlight provides the option to perform leave one out cross validation using the -x switch. The problem is that leave one out is too expensive for moderately large datasets. If you want to do say 10-fold cross validation, then you need to manage the folds yourself. I finished writing an n-fold cross validation functionality for NetSVMLight. An example follows, which demonstrates how you can use cross validation to pick the values of parameters to train your SVM.
For my problem, the parameter I was trying to pick was the cost-factor -j. In general, the principled way of doing this is as follows: Perform 10-fold (or n-fold) cross validation for each value of the cost-factor, on the training dataset. Get the average accuracy, precision, and recall for each set of models (where the set of models is the n-folds). Pick the value of the cost that gives the best cross validation results. Now the best results are determined by your requirements, and whether you are willing to trade off some recall for some additional precision etc. With the selected value of the parameter (cost-factor in this case), train a new model on the entire training dataset.
Now the basic idea behind n-fold cross validation is that the training dataset is divided into n-folds. In case of stratified cross validation (which is what NetSVMLight does), each fold contains approximately the same proportion of positive and negative class labels as the entire dataset. Also, each feature vector is randomly assigned to one of the n folds. This ensures a fair distribution. Now using a pre-determined value for all parameters, a model is constructed using n-1 of the n folds. This model is then tested on the remaining unseen fold, which yields a value for precision, recall and accuracy. This process is then repeated n-1 times. Each fold is used once for testing. The results of each fold are averaged and reported as the results for the n-fold cross validation.
Hence, in order to pick a parameter, such an n-fold cross validation is performed for each value of the parameter to be chosen. The value of the parameter that gives the most desired cross validation results is picked.
In the code above, the ConstructNFolds method first constructs each of the n-folds on the disk. In the first for-loop, ten different values for the parameter are tried (you can try as many as you wish). The results of each cross validation set are stored in a dictionary, along with the corresponding value of the parameter being tested. The foreach loop simply goes over the dictionary, and saves the results to a file and prints them to the console.
For my dataset, I tried 10 different values of the cost-parameter and recorded the corresponding precision, recall and accuracy from each run of the 10-fold cross validation.
At cost = 0.75, I think there is a reasonable tradeoff between precision and recall. Hence I pick this value and train my SVM over the entire training dataset.
The latest source and binaries can be downloaded here or checked out using SVN.
