Begin to deep learning

Ten years ago, everyone thought that Artifical neural network is obsoleted and here they are wrong.
Deep learning one of the biggest breakthrough in AI (not sure if in the AI history or not), but it successfully applied in several domains. Many researchers attempted to use deep learning technology to improve
their classification accuracies and mostly in image recognition, face detection, etc.

I have played around with Caffee and Theano lately. Still couldn’t decide which framework to use for my project.

Cross validation on Weka decision Tree


% Pree Thiengburanathum
% File Name: DecisionTreeC4.5.m
% Last updated 5 December 2014
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% This function responses constructing the decision tree c4.5 model using
% weka library, and evaluate the performce of feature selection algorithms
% using the balanced k-fold cross validation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% inputs - pre-processed vectors of independent variables
% algoName - the name of validate feature selection algorithm
%
% Output:
% bestAccuRate - the best accuracy
% bestModel - the best model
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [bestAccuRate, bestModel] = DecisionTreeC4p5(inputs, algoName)
disp('Running Decision Tree C4.5...');
outFileName = strcat(algoName, '_DT_C4.5.txt');
fileID = fopen(algoName, 'wt');
relationName = 'tpd';
bestAccuRate = 0;
bestModel = NaN;

nRun = 5;
nFea = size(inputs, 2);
nFold = 10;

for k=1:nRun
for i=1:3:nFea
X = inputs(:, 1:i);
% randomly generated indices for a balanced K-fold cross-validation
[trainIdx, testIdx] = sampling(table2array(X)', table2array(inputs(:, end))');
totalAccuRate = [];
for j=1:nFold
fprintf('Run# %sn', int2str(k));
fprintf('Important feature = %sn', int2str(i));
fprintf('Balanced K-folds Cross validation fold: %sn', int2str(j));
trainData = X(trainIdx(j, :), :);
trainData = [trainData inputs(trainIdx(j, :), end)];

testData = X(testIdx(j, :), :);
testData = [testData inputs(testIdx(j, :), end)];

disp('Converting to weka Java object...');
wekaTrainObj = Matlab2weka(relationName, trainData.Properties.VariableNames, table2array(trainData));
wekaTestObj = Matlab2weka(relationName, testData.Properties.VariableNames, table2array(testData));

%{
SaveARFF(strcat(algoName, '_test_', int2str(i), '.arff'), wekaTrainObj);
SaveARFF(strcat(algoName, '_train_', int2str(i), '.arff'), wekaTestObj);
%}

model = trainWekaClassifier(wekaTrainObj, 'trees.J48');

% test the classifier model
predicted = wekaClassify(wekaTestObj, model);

% the actual class index value according to the test dataset
actual = wekaTestObj.attributeToDoubleArray(wekaTestObj.numAttributes - 1);

accuRate = sum(actual == predicted)*(100/numel(predicted));

corrected = find(actual == predicted);
incorrected = find(actual ~= predicted);
disp(['number of correctly classified instances: ', int2str(numel(corrected))]);
disp(['number of incorrectly classified instances: ', int2str(numel(incorrected))]);
disp(['accuracy rate= ', num2str(accuRate), '']);
disp(' ');
totalAccuRate(end+1) = accuRate;
end
avgAccuRate = mean(totalAccuRate);
disp('********************************************************');
disp([algoName, ' with ', int2str(i), ' selected features']);
fprintf(fileID, '%s with %s selected featuresn', algoName, int2str(i));
disp([int2str(nFold), ' folds CV accuracy rate = ', num2str(mean(totalAccuRate)), '%']);
fprintf(fileID, '%s folds CV accuracy rate = %s%%n', int2str(nFold), num2str(avgAccuRate));
disp('********************************************************');

if avgAccuRate > bestAccuRate
disp('****found best model****');
bestAccuRate = avgAccuRate;
bestModel = model;
end
end
end
disp('Finished Decision Tree C4.5!');
fclose(fileID);
end % end function DecisionTreeC4p5

One-to-N Encoding for Nominal Variable

 

This is an example of my matlab implementation of the function.
One of N encoding is a very simple way of encoding classes for a machine learning method.
A class set is a dataset value that can have one of several non-numeric values.
The number of classes must be known ahead of time.

nv = size(X, 2);
nc = size(X, 1);
Y = X;

% for each variable
for i=1:nv
atts = unique(table2array(X(:, i)));
% We only encode the variable that has more than 2 states (e.g., 0 or 1)
if(size(atts, 1) ~= 2)

numVar = size(atts, 1);
% create new variables equals to the possible state of the variable
v = zeros(nc, numVar);
% for each case
for j=1:nc
% find the index of the state of the variable
idx = find( atts == X{j, i} );
if(size(idx, 1) == 1)
v(j, idx) = 1;
else
error('Error: Index error when encoding.');
end
end
% remove the variable and replace with the new variables
removedVarName = X(:, i).Properties.VariableNames;
Y(:, removedVarName) = [];

newVars = array2table(v);
% rename the var according to the removed variable
for k=1:numVar
name = strcat(removedVarName, '_v', int2str(k));
newVars.Properties.VariableNames(k) = name;
end

Y = [Y newVars];

end % end if
end
end % end function OneOfNEncodingNominal

Stratified K-fold cross validation Matlab

My implementation of stratified K-fold cross-validation, pretty much like the c = cvpartition(group,'KFold',k)  from Matlab statistic toolbox library.
<pre>function [X, partition] = KfoldCVBalance(X, y, k)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% To ensure that the training, testing, and validating dataset have similar
% proportions of classes (e.g., 20 classes). This stratified sampling
% technique provided the analyst with more control over the sampling process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% X - dataset
% k - number of fold
% classData - the class data
%
% Output:
% X - new dataset
% partition - fold index
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
n = size(X, 1);
partition = repmat(0, n, 1);
% shuffle the dataset
[~, idx] = sort(rand(1, n));
X = X(idx, :);
y = y(idx);
% find the unique class
group = unique(y);
nGroup = numel(group);
% find min max number of sample per class
nmax = 0;
for i=1:nGroup
    idx = find(y == group(i));
    ni = length(idx);
    nmax = max(nmax, ni);
end
% create fold indices
foldIndices = zeros(nGroup, nmax);
for i=1:nGroup
    idx = find(y == group(i));
    foldIndices(i, 1:numel(idx)) = idx;
end
% compute fold size for each fold
foldSize = zeros(nGroup, 1);
for i=1:nGroup
    % find the number of element of the class
    numElement = numel(find(foldIndices(i,:) ~= 0));
    % calculate number of element for each fold
    foldSize(i) = floor(numElement/k);
end
ptr = ones(nGroup, 1);
for i=1:k
    for j=1:nGroup
        idx =  foldIndices(j, (ptr(j): (ptr(j)+foldSize(j)) ));
        if(idx(end) == 0)
           idx = idx(1:end-1);
        end
        partition (idx) = i;
        ptr(j) = ptr(j)+foldSize(j);
    end
end
% dump the rest of index to the last fold
idx = find(partition == 0);
partition(idx) = k;
data = [X partition];
% check class balance for each fold
for i=1:k
    idx = find(data(:, 2) == i);
    fold = X(idx);
    disp(['fold# ', int2str(i), ' has ', int2str( numel(fold) ) ]);
    for j=1:nGroup
        idx = find(fold == group(j));
        percentage = (numel(idx)/numel(fold)) * 100;
        disp(['class# ', int2str(j), ' = ', num2str(percentage), '%']);


    end
    disp(' ');
end
end % end function

 

Entropy and Probability state Matlab version


classdef ProbabilityState &lt; handle
% Athor: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This is the updated from the previous MI implementation which was % implemented using array indexing. % This implementation use Map container from Matlab to calculate</pre>
 % the Probability state to enhance indexing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


    properties
        X % discrete random vector X
        Y % discrete random vector Y
        probXMap % probability mass function of X,  p(x)
        probYMap % probability mass function of Y,  p(y)
        jointMap % joint probability mass function, p(x,y)
    end
    methods
        function obj = ProbabilityState(X, Y)
            if nargin &gt; 0
                obj.X = X;
                obj.Y = Y;
                obj.probXMap = containers.Map();
                obj.probYMap = containers.Map();
                obj.jointMap = containers.Map();
            end
            obj.calculateProbabilityX();
            obj.calculateProbabilityY();
            obj.calculateJointProbabilityMassFunction();
        end % end constructor
      
        function calculateProbabilityX(obj)
            % create key and insert to the map
            for i=1:numel(obj.X) 
                tmpKey = sprintf('%d', obj.X(i));
                if( isKey(obj.probXMap, tmpKey ) )
                    obj.probXMap(tmpKey) =  obj.probXMap(tmpKey) + 1;
                else
                    keySet = tmpKey ;
          
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.probXMap = [obj.probXMap; newMap];
                end
            end
          
            % calcuate probability of each x in X   
            key = keys(obj.probXMap);
            for i=1:numel(key)
                 obj.probXMap(key{i}) = obj.probXMap(key{i}) / numel(obj.X);
            end
        end % end function
      
         function calculateProbabilityY(obj)
            % create key and insert to the map
            for i=1:numel(obj.Y) 
                tmpKey = sprintf('%d', obj.Y(i));
                if( isKey(obj.probYMap, tmpKey) )
                    obj.probYMap(tmpKey) =  obj.probYMap(tmpKey) + 1;
                else
                    keySet = tmpKey;
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.probYMap = [obj.probYMap; newMap];
                end
            end
          
            % calcuate probability of each y in Y   
            key = keys(obj.probYMap);
            for i=1:numel(key)
                 obj.probYMap(key{i}) = obj.probYMap(key{i}) / numel(obj.Y);
            end
          
        end % end function


        function result = marginalProbabilityX(obj, x)
            result = 0;
            for i=1:size(obj.probXMap)
                if ( isKey(obj.probXMap,x) )
                    result = obj.probXMap(x);
                end
            end
        end % end function
      
        function result = marginalProbabilityY(obj, y)
            result = 0;
            for i=1:size(obj.probYMap)
                if ( isKey(obj.probYMap,y) )
                    result = obj.probYMap(y);
                end
            end
        end % end function
      
        function result = jointProbability(obj, x, y)
            result = 0;
            keySet = strcat(sprintf('%d', x), ',', sprintf('%d', y));
            for i=1:size(obj.probYMap)
                if ( isKey(obj.jointMap,keySet) )
                    result = obj.jointMap(keySet);
                end
            end
        end % end function
      
        function entropyX = calculateEntropyX(obj)
            % H(X) = -sumX p(x)log p(x)
            entropyX = 0.0;
            key = keys(obj.probXMap);
            for i=1:numel(key)
                entropyX = entropyX - obj.probXMap(key{i}) * log2(obj.probXMap(key{i}));
            end
        end % end function
      
         function entropyY = calculateEntropyY(obj)
            % H(Y) = -sumY p(y)log p(y)
            entropyY = 0.0;
            key = keys(obj.probYMap);
            for i=1:numel(key)
                entropyY = entropyY - obj.probYMap(key{i}) * log2(obj.probYMap(key{i}));
            end
        end % end function
      
        function entropyXY = calculateJointEntropy(obj)
            % H(X,Y) = - sumx sumy p(x,y) log p(x,y)
            entropyXY = 0.0;
            key = keys(obj.jointMap);
            for i=1:numel(key)
               entropyXY = entropyXY  - obj.jointMap(key{i}) * log2(obj.jointMap(key{i}));
            end
        end % end function
      
        function displayProbXMap(obj)
            disp(keys(obj.probXMap));disp(values(obj.probXMap));
        end % end function
      
        function displayProbYMap(obj)
            disp(keys(obj.probYMap));disp(values(obj.probYMap));
        end % end function
      
        function displayJointMap(obj)
            disp(keys(obj.jointMap));disp(values(obj.jointMap));
        end % end function
      end % end method
    
      methods (Access = private)
            function calculateJointProbabilityMassFunction(obj)
            % count frequency (occurrence) from the elements in X and Y
            for i=1:numel(obj.X) % loop through the size of X or Y doesn't matter
                jointKey = [sprintf('%d', obj.X(i)), ',', sprintf('%d', obj.Y(i))];
                if( isKey(obj.jointMap, jointKey ) )
                    obj.jointMap(jointKey) =  obj.jointMap(jointKey) + 1;
                else
                    keySet = [sprintf('%d', obj.X(i)), ',', sprintf('%d', obj.Y(i))];
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.jointMap = [obj.jointMap; newMap];
                end
            end
          
            % calculate probability p(x,y)
            key = keys(obj.jointMap);
            for i=1:numel(key)
                obj.jointMap(key{i}) = obj.jointMap(key{i}) / numel(obj.X);
            end
          
        end % end function
      end % end method
end % end class

Install windows 7 without an external cd-dvd drive for Mavarick
 1. edit the info.plist inside the bootcamp assistant file
2. place the Mac model and Rom info into the config file.
3. remove “Pre” from the PreUSBBootSupportedModels
4. sign the code like the following:
 cocain:~ Pree$ sudo codesign -f -s – /Applications/Utilities/Boot Camp Assistant.app/Contents/MacOS/Boot Camp Assistant –deep /Applications/Utilities/Boot Camp Assistant.app/Contents/MacOS/Boot Camp Assistant: replacing existing signature

cocain:~ Pree$

Obtaining real data set for experiment and concurrency computing

Prediction, scheduling, optimization and classification problem are becoming more complex. Relying on high-computation power for machine doesn’t guarantee that you would execute it fast enough. It depends on many factors, for example, the data set, the algorithm, heuristic function and so on. Parallel computation seems to be a good approach to solve the run time efficiency of the problem. Getting the data for an experiment is a painful job. Using the existing available data set from other would give you more headache due to the unbalance information. Developing a web crawler to automate gather information for the hetero-genus websites is another approach that we should consider using when obtaining some generic information. Why do it manually or hiring someone to key-type them in? Runni