% Pree Thiengburanathum % File Name: DecisionTreeC4.5.m % Last updated 5 December 2014 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Description: % This function responses constructing the decision tree c4.5 model using % weka library, and evaluate the performce of feature selection algorithms % using the balanced k-fold cross validation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Input: % inputs - pre-processed vectors of independent variables % algoName - the name of validate feature selection algorithm % % Output: % bestAccuRate - the best accuracy % bestModel - the best model %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [bestAccuRate, bestModel] = DecisionTreeC4p5(inputs, algoName) disp('Running Decision Tree C4.5...'); outFileName = strcat(algoName, '_DT_C4.5.txt'); fileID = fopen(algoName, 'wt'); relationName = 'tpd'; bestAccuRate = 0; bestModel = NaN; nRun = 5; nFea = size(inputs, 2); nFold = 10; for k=1:nRun for i=1:3:nFea X = inputs(:, 1:i); % randomly generated indices for a balanced K-fold cross-validation [trainIdx, testIdx] = sampling(table2array(X)', table2array(inputs(:, end))'); totalAccuRate = []; for j=1:nFold fprintf('Run# %sn', int2str(k)); fprintf('Important feature = %sn', int2str(i)); fprintf('Balanced K-folds Cross validation fold: %sn', int2str(j)); trainData = X(trainIdx(j, :), :); trainData = [trainData inputs(trainIdx(j, :), end)]; testData = X(testIdx(j, :), :); testData = [testData inputs(testIdx(j, :), end)]; disp('Converting to weka Java object...'); wekaTrainObj = Matlab2weka(relationName, trainData.Properties.VariableNames, table2array(trainData)); wekaTestObj = Matlab2weka(relationName, testData.Properties.VariableNames, table2array(testData)); %{ SaveARFF(strcat(algoName, '_test_', int2str(i), '.arff'), wekaTrainObj); SaveARFF(strcat(algoName, '_train_', int2str(i), '.arff'), wekaTestObj); %} model = trainWekaClassifier(wekaTrainObj, 'trees.J48'); % test the classifier model predicted = wekaClassify(wekaTestObj, model); % the actual class index value according to the test dataset actual = wekaTestObj.attributeToDoubleArray(wekaTestObj.numAttributes - 1); accuRate = sum(actual == predicted)*(100/numel(predicted)); corrected = find(actual == predicted); incorrected = find(actual ~= predicted); disp(['number of correctly classified instances: ', int2str(numel(corrected))]); disp(['number of incorrectly classified instances: ', int2str(numel(incorrected))]); disp(['accuracy rate= ', num2str(accuRate), '']); disp(' '); totalAccuRate(end+1) = accuRate; end avgAccuRate = mean(totalAccuRate); disp('********************************************************'); disp([algoName, ' with ', int2str(i), ' selected features']); fprintf(fileID, '%s with %s selected featuresn', algoName, int2str(i)); disp([int2str(nFold), ' folds CV accuracy rate = ', num2str(mean(totalAccuRate)), '%']); fprintf(fileID, '%s folds CV accuracy rate = %s%%n', int2str(nFold), num2str(avgAccuRate)); disp('********************************************************'); if avgAccuRate > bestAccuRate disp('****found best model****'); bestAccuRate = avgAccuRate; bestModel = model; end end end disp('Finished Decision Tree C4.5!'); fclose(fileID); end % end function DecisionTreeC4p5

My implementation of stratified K-fold cross-validation, pretty much like the c = cvpartition(group,'KFold',k) from Matlab statistic toolbox library. <pre>function [X, partition] = KfoldCVBalance(X, y, k) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Author: Pree Thiengburanathum %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Description: % To ensure that the training, testing, and validating dataset have similar % proportions of classes (e.g., 20 classes). This stratified sampling % technique provided the analyst with more control over the sampling process. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Input: % X - dataset % k - number of fold % classData - the class data % % Output: % X - new dataset % partition - fold index %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% n = size(X, 1); partition = repmat(0, n, 1); % shuffle the dataset [~, idx] = sort(rand(1, n)); X = X(idx, :); y = y(idx); % find the unique class group = unique(y); nGroup = numel(group); % find min max number of sample per class nmax = 0; for i=1:nGroup idx = find(y == group(i)); ni = length(idx); nmax = max(nmax, ni); end % create fold indices foldIndices = zeros(nGroup, nmax); for i=1:nGroup idx = find(y == group(i)); foldIndices(i, 1:numel(idx)) = idx; end % compute fold size for each fold foldSize = zeros(nGroup, 1); for i=1:nGroup % find the number of element of the class numElement = numel(find(foldIndices(i,:) ~= 0)); % calculate number of element for each fold foldSize(i) = floor(numElement/k); end ptr = ones(nGroup, 1); for i=1:k for j=1:nGroup idx = foldIndices(j, (ptr(j): (ptr(j)+foldSize(j)) )); if(idx(end) == 0) idx = idx(1:end-1); end partition (idx) = i; ptr(j) = ptr(j)+foldSize(j); end end % dump the rest of index to the last fold idx = find(partition == 0); partition(idx) = k; data = [X partition]; % check class balance for each fold for i=1:k idx = find(data(:, 2) == i); fold = X(idx); disp(['fold# ', int2str(i), ' has ', int2str( numel(fold) ) ]); for j=1:nGroup idx = find(fold == group(j)); percentage = (numel(idx)/numel(fold)) * 100; disp(['class# ', int2str(j), ' = ', num2str(percentage), '%']); end disp(' '); end end % end function