One-to-N Encoding for Nominal Variable

 

This is an example of my matlab implementation of the function.
One of N encoding is a very simple way of encoding classes for a machine learning method.
A class set is a dataset value that can have one of several non-numeric values.
The number of classes must be known ahead of time.

nv = size(X, 2);
nc = size(X, 1);
Y = X;

% for each variable
for i=1:nv
atts = unique(table2array(X(:, i)));
% We only encode the variable that has more than 2 states (e.g., 0 or 1)
if(size(atts, 1) ~= 2)

numVar = size(atts, 1);
% create new variables equals to the possible state of the variable
v = zeros(nc, numVar);
% for each case
for j=1:nc
% find the index of the state of the variable
idx = find( atts == X{j, i} );
if(size(idx, 1) == 1)
v(j, idx) = 1;
else
error('Error: Index error when encoding.');
end
end
% remove the variable and replace with the new variables
removedVarName = X(:, i).Properties.VariableNames;
Y(:, removedVarName) = [];

newVars = array2table(v);
% rename the var according to the removed variable
for k=1:numVar
name = strcat(removedVarName, '_v', int2str(k));
newVars.Properties.VariableNames(k) = name;
end

Y = [Y newVars];

end % end if
end
end % end function OneOfNEncodingNominal

Stratified K-fold cross validation Matlab

My implementation of stratified K-fold cross-validation, pretty much like the c = cvpartition(group,'KFold',k)  from Matlab statistic toolbox library.
<pre>function [X, partition] = KfoldCVBalance(X, y, k)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% To ensure that the training, testing, and validating dataset have similar
% proportions of classes (e.g., 20 classes). This stratified sampling
% technique provided the analyst with more control over the sampling process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% X - dataset
% k - number of fold
% classData - the class data
%
% Output:
% X - new dataset
% partition - fold index
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
n = size(X, 1);
partition = repmat(0, n, 1);
% shuffle the dataset
[~, idx] = sort(rand(1, n));
X = X(idx, :);
y = y(idx);
% find the unique class
group = unique(y);
nGroup = numel(group);
% find min max number of sample per class
nmax = 0;
for i=1:nGroup
    idx = find(y == group(i));
    ni = length(idx);
    nmax = max(nmax, ni);
end
% create fold indices
foldIndices = zeros(nGroup, nmax);
for i=1:nGroup
    idx = find(y == group(i));
    foldIndices(i, 1:numel(idx)) = idx;
end
% compute fold size for each fold
foldSize = zeros(nGroup, 1);
for i=1:nGroup
    % find the number of element of the class
    numElement = numel(find(foldIndices(i,:) ~= 0));
    % calculate number of element for each fold
    foldSize(i) = floor(numElement/k);
end
ptr = ones(nGroup, 1);
for i=1:k
    for j=1:nGroup
        idx =  foldIndices(j, (ptr(j): (ptr(j)+foldSize(j)) ));
        if(idx(end) == 0)
           idx = idx(1:end-1);
        end
        partition (idx) = i;
        ptr(j) = ptr(j)+foldSize(j);
    end
end
% dump the rest of index to the last fold
idx = find(partition == 0);
partition(idx) = k;
data = [X partition];
% check class balance for each fold
for i=1:k
    idx = find(data(:, 2) == i);
    fold = X(idx);
    disp(['fold# ', int2str(i), ' has ', int2str( numel(fold) ) ]);
    for j=1:nGroup
        idx = find(fold == group(j));
        percentage = (numel(idx)/numel(fold)) * 100;
        disp(['class# ', int2str(j), ' = ', num2str(percentage), '%']);


    end
    disp(' ');
end
end % end function

 

Entropy and Probability state Matlab version


classdef ProbabilityState &lt; handle
% Athor: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This is the updated from the previous MI implementation which was % implemented using array indexing. % This implementation use Map container from Matlab to calculate</pre>
 % the Probability state to enhance indexing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


    properties
        X % discrete random vector X
        Y % discrete random vector Y
        probXMap % probability mass function of X,  p(x)
        probYMap % probability mass function of Y,  p(y)
        jointMap % joint probability mass function, p(x,y)
    end
    methods
        function obj = ProbabilityState(X, Y)
            if nargin &gt; 0
                obj.X = X;
                obj.Y = Y;
                obj.probXMap = containers.Map();
                obj.probYMap = containers.Map();
                obj.jointMap = containers.Map();
            end
            obj.calculateProbabilityX();
            obj.calculateProbabilityY();
            obj.calculateJointProbabilityMassFunction();
        end % end constructor
      
        function calculateProbabilityX(obj)
            % create key and insert to the map
            for i=1:numel(obj.X) 
                tmpKey = sprintf('%d', obj.X(i));
                if( isKey(obj.probXMap, tmpKey ) )
                    obj.probXMap(tmpKey) =  obj.probXMap(tmpKey) + 1;
                else
                    keySet = tmpKey ;
          
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.probXMap = [obj.probXMap; newMap];
                end
            end
          
            % calcuate probability of each x in X   
            key = keys(obj.probXMap);
            for i=1:numel(key)
                 obj.probXMap(key{i}) = obj.probXMap(key{i}) / numel(obj.X);
            end
        end % end function
      
         function calculateProbabilityY(obj)
            % create key and insert to the map
            for i=1:numel(obj.Y) 
                tmpKey = sprintf('%d', obj.Y(i));
                if( isKey(obj.probYMap, tmpKey) )
                    obj.probYMap(tmpKey) =  obj.probYMap(tmpKey) + 1;
                else
                    keySet = tmpKey;
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.probYMap = [obj.probYMap; newMap];
                end
            end
          
            % calcuate probability of each y in Y   
            key = keys(obj.probYMap);
            for i=1:numel(key)
                 obj.probYMap(key{i}) = obj.probYMap(key{i}) / numel(obj.Y);
            end
          
        end % end function


        function result = marginalProbabilityX(obj, x)
            result = 0;
            for i=1:size(obj.probXMap)
                if ( isKey(obj.probXMap,x) )
                    result = obj.probXMap(x);
                end
            end
        end % end function
      
        function result = marginalProbabilityY(obj, y)
            result = 0;
            for i=1:size(obj.probYMap)
                if ( isKey(obj.probYMap,y) )
                    result = obj.probYMap(y);
                end
            end
        end % end function
      
        function result = jointProbability(obj, x, y)
            result = 0;
            keySet = strcat(sprintf('%d', x), ',', sprintf('%d', y));
            for i=1:size(obj.probYMap)
                if ( isKey(obj.jointMap,keySet) )
                    result = obj.jointMap(keySet);
                end
            end
        end % end function
      
        function entropyX = calculateEntropyX(obj)
            % H(X) = -sumX p(x)log p(x)
            entropyX = 0.0;
            key = keys(obj.probXMap);
            for i=1:numel(key)
                entropyX = entropyX - obj.probXMap(key{i}) * log2(obj.probXMap(key{i}));
            end
        end % end function
      
         function entropyY = calculateEntropyY(obj)
            % H(Y) = -sumY p(y)log p(y)
            entropyY = 0.0;
            key = keys(obj.probYMap);
            for i=1:numel(key)
                entropyY = entropyY - obj.probYMap(key{i}) * log2(obj.probYMap(key{i}));
            end
        end % end function
      
        function entropyXY = calculateJointEntropy(obj)
            % H(X,Y) = - sumx sumy p(x,y) log p(x,y)
            entropyXY = 0.0;
            key = keys(obj.jointMap);
            for i=1:numel(key)
               entropyXY = entropyXY  - obj.jointMap(key{i}) * log2(obj.jointMap(key{i}));
            end
        end % end function
      
        function displayProbXMap(obj)
            disp(keys(obj.probXMap));disp(values(obj.probXMap));
        end % end function
      
        function displayProbYMap(obj)
            disp(keys(obj.probYMap));disp(values(obj.probYMap));
        end % end function
      
        function displayJointMap(obj)
            disp(keys(obj.jointMap));disp(values(obj.jointMap));
        end % end function
      end % end method
    
      methods (Access = private)
            function calculateJointProbabilityMassFunction(obj)
            % count frequency (occurrence) from the elements in X and Y
            for i=1:numel(obj.X) % loop through the size of X or Y doesn't matter
                jointKey = [sprintf('%d', obj.X(i)), ',', sprintf('%d', obj.Y(i))];
                if( isKey(obj.jointMap, jointKey ) )
                    obj.jointMap(jointKey) =  obj.jointMap(jointKey) + 1;
                else
                    keySet = [sprintf('%d', obj.X(i)), ',', sprintf('%d', obj.Y(i))];
                    valueSet = 1;
                    newMap = containers.Map(keySet,valueSet);
                    obj.jointMap = [obj.jointMap; newMap];
                end
            end
          
            % calculate probability p(x,y)
            key = keys(obj.jointMap);
            for i=1:numel(key)
                obj.jointMap(key{i}) = obj.jointMap(key{i}) / numel(obj.X);
            end
          
        end % end function
      end % end method
end % end class

Enable to find C++ compiler in Matlab:
remove previous distribution (i.e., 2008)
reinstall the SDK
then run
mex -setup

Install windows 7 without an external cd-dvd drive for Mavarick
 1. edit the info.plist inside the bootcamp assistant file
2. place the Mac model and Rom info into the config file.
3. remove “Pre” from the PreUSBBootSupportedModels
4. sign the code like the following:
 cocain:~ Pree$ sudo codesign -f -s – /Applications/Utilities/Boot Camp Assistant.app/Contents/MacOS/Boot Camp Assistant –deep /Applications/Utilities/Boot Camp Assistant.app/Contents/MacOS/Boot Camp Assistant: replacing existing signature

cocain:~ Pree$

Obtaining real data set for experiment and concurrency computing

Prediction, scheduling, optimization and classification problem are becoming more complex. Relying on high-computation power for machine doesn’t guarantee that you would execute it fast enough. It depends on many factors, for example, the data set, the algorithm, heuristic function and so on. Parallel computation seems to be a good approach to solve the run time efficiency of the problem. Getting the data for an experiment is a painful job. Using the existing available data set from other would give you more headache due to the unbalance information. Developing a web crawler to automate gather information for the hetero-genus websites is another approach that we should consider using when obtaining some generic information. Why do it manually or hiring someone to key-type them in? Runni

I found my blog

Its good to get back to blogging and start writing things. It has been a year since my last post. I just know that my blog still existed somewhere on the internet. lets see what I can do to get it update somehow.

Subversion hosting

Looking for a project hosting? If you need a SVN repository or GIT hosting, unfuddle.com is a very neat project hosting service and I really recommend. Thier web application is very easy to use, also it has the ticketing systems, calendar, email notifications, SSH protocol, and some cool project management tools. I have created my account and been working my current compiler projects with them since last week, and I am very satisfied and impressed. Comparing to Google code which I have been using since last Fall and still. I like it a lot, its a free service, but I think it only suites for some type of projects. One thing that I don’t like about the Google project hosting is that it doesn’t let me set permission to allow public reading the project. This is bothered me, such that sometime I feel I need to have just my own private access to my project.

Not to be offense, Google makes me feel that everyone has been watching for their Internet activities. Every information that connected to the Internet is almost search able from Google. Years ago, I used to understand that there is a script that we can put in our file directories and we can set not to let the Google bot to crawl the files in such directories. Are those scripts still exist these days?

100F here in Denver


It’s 100F here in Denver, the weather is hot like mad. I am taking a day break out of town to a city called Canon. The place is not far from Denver, it took about 3 hrs travelling by car. Yeah, I expected to see rocks, mountains and river again. Its same old things in Colorado but I had a great time and enjoy the day.