Cross validation on Weka decision Tree


% Pree Thiengburanathum
% File Name: DecisionTreeC4.5.m
% Last updated 5 December 2014
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% This function responses constructing the decision tree c4.5 model using
% weka library, and evaluate the performce of feature selection algorithms
% using the balanced k-fold cross validation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% inputs - pre-processed vectors of independent variables
% algoName - the name of validate feature selection algorithm
%
% Output:
% bestAccuRate - the best accuracy
% bestModel - the best model
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [bestAccuRate, bestModel] = DecisionTreeC4p5(inputs, algoName)
disp('Running Decision Tree C4.5...');
outFileName = strcat(algoName, '_DT_C4.5.txt');
fileID = fopen(algoName, 'wt');
relationName = 'tpd';
bestAccuRate = 0;
bestModel = NaN;

nRun = 5;
nFea = size(inputs, 2);
nFold = 10;

for k=1:nRun
for i=1:3:nFea
X = inputs(:, 1:i);
% randomly generated indices for a balanced K-fold cross-validation
[trainIdx, testIdx] = sampling(table2array(X)', table2array(inputs(:, end))');
totalAccuRate = [];
for j=1:nFold
fprintf('Run# %sn', int2str(k));
fprintf('Important feature = %sn', int2str(i));
fprintf('Balanced K-folds Cross validation fold: %sn', int2str(j));
trainData = X(trainIdx(j, :), :);
trainData = [trainData inputs(trainIdx(j, :), end)];

testData = X(testIdx(j, :), :);
testData = [testData inputs(testIdx(j, :), end)];

disp('Converting to weka Java object...');
wekaTrainObj = Matlab2weka(relationName, trainData.Properties.VariableNames, table2array(trainData));
wekaTestObj = Matlab2weka(relationName, testData.Properties.VariableNames, table2array(testData));

%{
SaveARFF(strcat(algoName, '_test_', int2str(i), '.arff'), wekaTrainObj);
SaveARFF(strcat(algoName, '_train_', int2str(i), '.arff'), wekaTestObj);
%}

model = trainWekaClassifier(wekaTrainObj, 'trees.J48');

% test the classifier model
predicted = wekaClassify(wekaTestObj, model);

% the actual class index value according to the test dataset
actual = wekaTestObj.attributeToDoubleArray(wekaTestObj.numAttributes - 1);

accuRate = sum(actual == predicted)*(100/numel(predicted));

corrected = find(actual == predicted);
incorrected = find(actual ~= predicted);
disp(['number of correctly classified instances: ', int2str(numel(corrected))]);
disp(['number of incorrectly classified instances: ', int2str(numel(incorrected))]);
disp(['accuracy rate= ', num2str(accuRate), '']);
disp(' ');
totalAccuRate(end+1) = accuRate;
end
avgAccuRate = mean(totalAccuRate);
disp('********************************************************');
disp([algoName, ' with ', int2str(i), ' selected features']);
fprintf(fileID, '%s with %s selected featuresn', algoName, int2str(i));
disp([int2str(nFold), ' folds CV accuracy rate = ', num2str(mean(totalAccuRate)), '%']);
fprintf(fileID, '%s folds CV accuracy rate = %s%%n', int2str(nFold), num2str(avgAccuRate));
disp('********************************************************');

if avgAccuRate > bestAccuRate
disp('****found best model****');
bestAccuRate = avgAccuRate;
bestModel = model;
end
end
end
disp('Finished Decision Tree C4.5!');
fclose(fileID);
end % end function DecisionTreeC4p5

One-to-N Encoding for Nominal Variable

 

This is an example of my matlab implementation of the function.
One of N encoding is a very simple way of encoding classes for a machine learning method.
A class set is a dataset value that can have one of several non-numeric values.
The number of classes must be known ahead of time.

nv = size(X, 2);
nc = size(X, 1);
Y = X;

% for each variable
for i=1:nv
atts = unique(table2array(X(:, i)));
% We only encode the variable that has more than 2 states (e.g., 0 or 1)
if(size(atts, 1) ~= 2)

numVar = size(atts, 1);
% create new variables equals to the possible state of the variable
v = zeros(nc, numVar);
% for each case
for j=1:nc
% find the index of the state of the variable
idx = find( atts == X{j, i} );
if(size(idx, 1) == 1)
v(j, idx) = 1;
else
error('Error: Index error when encoding.');
end
end
% remove the variable and replace with the new variables
removedVarName = X(:, i).Properties.VariableNames;
Y(:, removedVarName) = [];

newVars = array2table(v);
% rename the var according to the removed variable
for k=1:numVar
name = strcat(removedVarName, '_v', int2str(k));
newVars.Properties.VariableNames(k) = name;
end

Y = [Y newVars];

end % end if
end
end % end function OneOfNEncodingNominal

Stratified K-fold cross validation Matlab

My implementation of stratified K-fold cross-validation, pretty much like the c = cvpartition(group,'KFold',k)  from Matlab statistic toolbox library.
<pre>function [X, partition] = KfoldCVBalance(X, y, k)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% To ensure that the training, testing, and validating dataset have similar
% proportions of classes (e.g., 20 classes). This stratified sampling
% technique provided the analyst with more control over the sampling process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% X - dataset
% k - number of fold
% classData - the class data
%
% Output:
% X - new dataset
% partition - fold index
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
n = size(X, 1);
partition = repmat(0, n, 1);
% shuffle the dataset
[~, idx] = sort(rand(1, n));
X = X(idx, :);
y = y(idx);
% find the unique class
group = unique(y);
nGroup = numel(group);
% find min max number of sample per class
nmax = 0;
for i=1:nGroup
    idx = find(y == group(i));
    ni = length(idx);
    nmax = max(nmax, ni);
end
% create fold indices
foldIndices = zeros(nGroup, nmax);
for i=1:nGroup
    idx = find(y == group(i));
    foldIndices(i, 1:numel(idx)) = idx;
end
% compute fold size for each fold
foldSize = zeros(nGroup, 1);
for i=1:nGroup
    % find the number of element of the class
    numElement = numel(find(foldIndices(i,:) ~= 0));
    % calculate number of element for each fold
    foldSize(i) = floor(numElement/k);
end
ptr = ones(nGroup, 1);
for i=1:k
    for j=1:nGroup
        idx =  foldIndices(j, (ptr(j): (ptr(j)+foldSize(j)) ));
        if(idx(end) == 0)
           idx = idx(1:end-1);
        end
        partition (idx) = i;
        ptr(j) = ptr(j)+foldSize(j);
    end
end
% dump the rest of index to the last fold
idx = find(partition == 0);
partition(idx) = k;
data = [X partition];
% check class balance for each fold
for i=1:k
    idx = find(data(:, 2) == i);
    fold = X(idx);
    disp(['fold# ', int2str(i), ' has ', int2str( numel(fold) ) ]);
    for j=1:nGroup
        idx = find(fold == group(j));
        percentage = (numel(idx)/numel(fold)) * 100;
        disp(['class# ', int2str(j), ' = ', num2str(percentage), '%']);


    end
    disp(' ');
end
end % end function

 

Entropy and Probability state Matlab version


classdef ProbabilityState &lt; handle
% Athor: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This is the updated from the previous MI implementation which was % implemented using array indexing. % This implementation use Map container from Matlab to calculate</pre>
 % the Probability state to enhance indexing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


    properties
        X % discrete random vector X
        Y % discrete random vector Y
        probXMap % probability mass function of X,  p(x)
        probYMap % probability mass function of Y,  p(y)
        jointMap % joint probability mass function, p(x,y)
    end
    methods
        function obj = ProbabilityState(X, Y)
            if nargin &gt; 0
                obj.X = X;
                obj.Y = Y;
                obj.probXMap = containers.Map();
                obj.probYMap = containers.Map();
                obj.jointMap = containers.Map();
            end
            obj.calculateProbabilityX();
            obj.calculateProbabilityY();
            obj.calculateJointProbabilityMassFunction();
        end % end constructor
      
        function calculateProbabilityX(obj)
            % create key and insert to the map
            for i=1:numel(obj.X) 
                tmpKey = sprintf('%d', obj.X(i));
                if( isKey(obj.probXMap, tmpKey ) )
                    obj.probXMap(tmpKey) =  obj.probXMap(tmpKey) + 1;
                else
                    keySet = tmpKey ;
          
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.probXMap = [obj.probXMap; newMap];
                end
            end
          
            % calcuate probability of each x in X   
            key = keys(obj.probXMap);
            for i=1:numel(key)
                 obj.probXMap(key{i}) = obj.probXMap(key{i}) / numel(obj.X);
            end
        end % end function
      
         function calculateProbabilityY(obj)
            % create key and insert to the map
            for i=1:numel(obj.Y) 
                tmpKey = sprintf('%d', obj.Y(i));
                if( isKey(obj.probYMap, tmpKey) )
                    obj.probYMap(tmpKey) =  obj.probYMap(tmpKey) + 1;
                else
                    keySet = tmpKey;
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.probYMap = [obj.probYMap; newMap];
                end
            end
          
            % calcuate probability of each y in Y   
            key = keys(obj.probYMap);
            for i=1:numel(key)
                 obj.probYMap(key{i}) = obj.probYMap(key{i}) / numel(obj.Y);
            end
          
        end % end function


        function result = marginalProbabilityX(obj, x)
            result = 0;
            for i=1:size(obj.probXMap)
                if ( isKey(obj.probXMap,x) )
                    result = obj.probXMap(x);
                end
            end
        end % end function
      
        function result = marginalProbabilityY(obj, y)
            result = 0;
            for i=1:size(obj.probYMap)
                if ( isKey(obj.probYMap,y) )
                    result = obj.probYMap(y);
                end
            end
        end % end function
      
        function result = jointProbability(obj, x, y)
            result = 0;
            keySet = strcat(sprintf('%d', x), ',', sprintf('%d', y));
            for i=1:size(obj.probYMap)
                if ( isKey(obj.jointMap,keySet) )
                    result = obj.jointMap(keySet);
                end
            end
        end % end function
      
        function entropyX = calculateEntropyX(obj)
            % H(X) = -sumX p(x)log p(x)
            entropyX = 0.0;
            key = keys(obj.probXMap);
            for i=1:numel(key)
                entropyX = entropyX - obj.probXMap(key{i}) * log2(obj.probXMap(key{i}));
            end
        end % end function
      
         function entropyY = calculateEntropyY(obj)
            % H(Y) = -sumY p(y)log p(y)
            entropyY = 0.0;
            key = keys(obj.probYMap);
            for i=1:numel(key)
                entropyY = entropyY - obj.probYMap(key{i}) * log2(obj.probYMap(key{i}));
            end
        end % end function
      
        function entropyXY = calculateJointEntropy(obj)
            % H(X,Y) = - sumx sumy p(x,y) log p(x,y)
            entropyXY = 0.0;
            key = keys(obj.jointMap);
            for i=1:numel(key)
               entropyXY = entropyXY  - obj.jointMap(key{i}) * log2(obj.jointMap(key{i}));
            end
        end % end function
      
        function displayProbXMap(obj)
            disp(keys(obj.probXMap));disp(values(obj.probXMap));
        end % end function
      
        function displayProbYMap(obj)
            disp(keys(obj.probYMap));disp(values(obj.probYMap));
        end % end function
      
        function displayJointMap(obj)
            disp(keys(obj.jointMap));disp(values(obj.jointMap));
        end % end function
      end % end method
    
      methods (Access = private)
            function calculateJointProbabilityMassFunction(obj)
            % count frequency (occurrence) from the elements in X and Y
            for i=1:numel(obj.X) % loop through the size of X or Y doesn't matter
                jointKey = [sprintf('%d', obj.X(i)), ',', sprintf('%d', obj.Y(i))];
                if( isKey(obj.jointMap, jointKey ) )
                    obj.jointMap(jointKey) =  obj.jointMap(jointKey) + 1;
                else
                    keySet = [sprintf('%d', obj.X(i)), ',', sprintf('%d', obj.Y(i))];
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.jointMap = [obj.jointMap; newMap];
                end
            end
          
            % calculate probability p(x,y)
            key = keys(obj.jointMap);
            for i=1:numel(key)
                obj.jointMap(key{i}) = obj.jointMap(key{i}) / numel(obj.X);
            end
          
        end % end function
      end % end method
end % end class