November 5, 2016 / Pree / 0 Comments

Ten years ago, everyone thought that Artifical neural network is obsoleted and here they are wrong.

Deep learning one of the biggest breakthrough in AI (not sure if in the AI history or not), but it successfully applied in several domains. Many researchers attempted to use deep learning technology to improve

their classification accuracies and mostly in image recognition, face detection, etc.

I have played around with Caffee and Theano lately. Still couldn’t decide which framework to use for my project.

December 7, 2014 / Pree / 0 Comments

% Pree Thiengburanathum
% File Name: DecisionTreeC4.5.m
% Last updated 5 December 2014
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% This function responses constructing the decision tree c4.5 model using
% weka library, and evaluate the performce of feature selection algorithms
% using the balanced k-fold cross validation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% inputs - pre-processed vectors of independent variables
% algoName - the name of validate feature selection algorithm
%
% Output:
% bestAccuRate - the best accuracy
% bestModel - the best model
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [bestAccuRate, bestModel] = DecisionTreeC4p5(inputs, algoName)
disp('Running Decision Tree C4.5...');
outFileName = strcat(algoName, '_DT_C4.5.txt');
fileID = fopen(algoName, 'wt');
relationName = 'tpd';
bestAccuRate = 0;
bestModel = NaN;
nRun = 5;
nFea = size(inputs, 2);
nFold = 10;
for k=1:nRun
for i=1:3:nFea
X = inputs(:, 1:i);
% randomly generated indices for a balanced K-fold cross-validation
[trainIdx, testIdx] = sampling(table2array(X)', table2array(inputs(:, end))');
totalAccuRate = [];
for j=1:nFold
fprintf('Run# %sn', int2str(k));
fprintf('Important feature = %sn', int2str(i));
fprintf('Balanced K-folds Cross validation fold: %sn', int2str(j));
trainData = X(trainIdx(j, :), :);
trainData = [trainData inputs(trainIdx(j, :), end)];
testData = X(testIdx(j, :), :);
testData = [testData inputs(testIdx(j, :), end)];
disp('Converting to weka Java object...');
wekaTrainObj = Matlab2weka(relationName, trainData.Properties.VariableNames, table2array(trainData));
wekaTestObj = Matlab2weka(relationName, testData.Properties.VariableNames, table2array(testData));
%{
SaveARFF(strcat(algoName, '_test_', int2str(i), '.arff'), wekaTrainObj);
SaveARFF(strcat(algoName, '_train_', int2str(i), '.arff'), wekaTestObj);
%}
model = trainWekaClassifier(wekaTrainObj, 'trees.J48');
% test the classifier model
predicted = wekaClassify(wekaTestObj, model);
% the actual class index value according to the test dataset
actual = wekaTestObj.attributeToDoubleArray(wekaTestObj.numAttributes - 1);
accuRate = sum(actual == predicted)*(100/numel(predicted));
corrected = find(actual == predicted);
incorrected = find(actual ~= predicted);
disp(['number of correctly classified instances: ', int2str(numel(corrected))]);
disp(['number of incorrectly classified instances: ', int2str(numel(incorrected))]);
disp(['accuracy rate= ', num2str(accuRate), '']);
disp(' ');
totalAccuRate(end+1) = accuRate;
end
avgAccuRate = mean(totalAccuRate);
disp('********************************************************');
disp([algoName, ' with ', int2str(i), ' selected features']);
fprintf(fileID, '%s with %s selected featuresn', algoName, int2str(i));
disp([int2str(nFold), ' folds CV accuracy rate = ', num2str(mean(totalAccuRate)), '%']);
fprintf(fileID, '%s folds CV accuracy rate = %s%%n', int2str(nFold), num2str(avgAccuRate));
disp('********************************************************');
if avgAccuRate > bestAccuRate
disp('****found best model****');
bestAccuRate = avgAccuRate;
bestModel = model;
end
end
end
disp('Finished Decision Tree C4.5!');
fclose(fileID);
end % end function DecisionTreeC4p5

November 28, 2014 / Pree / 0 Comments

This is an example of my matlab implementation of the function.

One of N encoding is a very simple way of encoding classes for a machine learning method.

A class set is a dataset value that can have one of several non-numeric values.

The number of classes must be known ahead of time.

nv = size(X, 2);
nc = size(X, 1);
Y = X;
% for each variable
for i=1:nv
atts = unique(table2array(X(:, i)));
% We only encode the variable that has more than 2 states (e.g., 0 or 1)
if(size(atts, 1) ~= 2)
numVar = size(atts, 1);
% create new variables equals to the possible state of the variable
v = zeros(nc, numVar);
% for each case
for j=1:nc
% find the index of the state of the variable
idx = find( atts == X{j, i} );
if(size(idx, 1) == 1)
v(j, idx) = 1;
else
error('Error: Index error when encoding.');
end
end
% remove the variable and replace with the new variables
removedVarName = X(:, i).Properties.VariableNames;
Y(:, removedVarName) = [];
newVars = array2table(v);
% rename the var according to the removed variable
for k=1:numVar
name = strcat(removedVarName, '_v', int2str(k));
newVars.Properties.VariableNames(k) = name;
end
Y = [Y newVars];
end % end if
end
end % end function OneOfNEncodingNominal

October 30, 2014 / Pree / 0 Comments

My implementation of stratified K-fold cross-validation, pretty much like the c = cvpartition(group,'KFold',k) from Matlab statistic toolbox library.
<pre>function [X, partition] = KfoldCVBalance(X, y, k)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% To ensure that the training, testing, and validating dataset have similar
% proportions of classes (e.g., 20 classes). This stratified sampling
% technique provided the analyst with more control over the sampling process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% X - dataset
% k - number of fold
% classData - the class data
%
% Output:
% X - new dataset
% partition - fold index
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
n = size(X, 1);
partition = repmat(0, n, 1);
% shuffle the dataset
[~, idx] = sort(rand(1, n));
X = X(idx, :);
y = y(idx);
% find the unique class
group = unique(y);
nGroup = numel(group);
% find min max number of sample per class
nmax = 0;
for i=1:nGroup
idx = find(y == group(i));
ni = length(idx);
nmax = max(nmax, ni);
end
% create fold indices
foldIndices = zeros(nGroup, nmax);
for i=1:nGroup
idx = find(y == group(i));
foldIndices(i, 1:numel(idx)) = idx;
end
% compute fold size for each fold
foldSize = zeros(nGroup, 1);
for i=1:nGroup
% find the number of element of the class
numElement = numel(find(foldIndices(i,:) ~= 0));
% calculate number of element for each fold
foldSize(i) = floor(numElement/k);
end
ptr = ones(nGroup, 1);
for i=1:k
for j=1:nGroup
idx = foldIndices(j, (ptr(j): (ptr(j)+foldSize(j)) ));
if(idx(end) == 0)
idx = idx(1:end-1);
end
partition (idx) = i;
ptr(j) = ptr(j)+foldSize(j);
end
end
% dump the rest of index to the last fold
idx = find(partition == 0);
partition(idx) = k;
data = [X partition];
% check class balance for each fold
for i=1:k
idx = find(data(:, 2) == i);
fold = X(idx);
disp(['fold# ', int2str(i), ' has ', int2str( numel(fold) ) ]);
for j=1:nGroup
idx = find(fold == group(j));
percentage = (numel(idx)/numel(fold)) * 100;
disp(['class# ', int2str(j), ' = ', num2str(percentage), '%']);
end
disp(' ');
end
end % end function