Crawling using beautiful soup 4

import requests
import re
from bs4 import BeautifulSoup

target_url = ''

def download_page(page_url):
		return requests.get(page_url).text
	except Exception as e:
		print ("Error: invalid ip and port", e)

def parse_page(page):
	tags = []
	soup = BeautifulSoup(page, 'html.parser')
	for tag in soup.find_all(re.compile("t")):
	return tags

def main():
	print('Basic with BeautifulSoup is starting...')
	page = download_page(target_url)
	tags = parse_page(page)

Basic web-crawler process flow demo using Python

Demonstrate basic process in web static snipplet cralwing.
Pree Thiengburanathum
Python 3.7
import re
import requests
from urllib.parse import urlparse

target_url = ''

def get_links(page_url):
	host = urlparse(page_url)
	page = download_page(page_url)
	links = extract_links(page)
	return links

def extract_links(page):
	if not page:
		return []
	link_regex = re.compile('(?<=href=").*?(?=")')
	return link_regex.findall(page)

def download_page(url):
		return requests.get(url).text
	except Exception as e:
		print ("Error: invalid ip and port", e)
def main():
	print('Basic crawler is starting...')
	links = get_links(target_url)
	for link in links:
	print('Program terminated successfully')


MTGO Seal#1


MTGO Draft#1

HOUx2 AKHx1 drafting

Begin to deep learning

Ten years ago, everyone thought that Artifical neural network is obsoleted and here they are wrong.
Deep learning one of the biggest breakthrough in AI (not sure if in the AI history or not), but it successfully applied in several domains. Many researchers attempted to use deep learning technology to improve
their classification accuracies and mostly in image recognition, face detection, etc.

I have played around with Caffee and Theano lately. Still couldn’t decide which framework to use for my project.

Cross validation on Weka decision Tree

% Pree Thiengburanathum
% File Name: DecisionTreeC4.5.m
% Last updated 5 December 2014
% Description:
% This function responses constructing the decision tree c4.5 model using
% weka library, and evaluate the performce of feature selection algorithms
% using the balanced k-fold cross validation.
% Input:
% inputs - pre-processed vectors of independent variables
% algoName - the name of validate feature selection algorithm
% Output:
% bestAccuRate - the best accuracy
% bestModel - the best model
function [bestAccuRate, bestModel] = DecisionTreeC4p5(inputs, algoName)
disp('Running Decision Tree C4.5...');
outFileName = strcat(algoName, '_DT_C4.5.txt');
fileID = fopen(algoName, 'wt');
relationName = 'tpd';
bestAccuRate = 0;
bestModel = NaN;

nRun = 5;
nFea = size(inputs, 2);
nFold = 10;

for k=1:nRun
for i=1:3:nFea
X = inputs(:, 1:i);
% randomly generated indices for a balanced K-fold cross-validation
[trainIdx, testIdx] = sampling(table2array(X)', table2array(inputs(:, end))');
totalAccuRate = [];
for j=1:nFold
fprintf('Run# %sn', int2str(k));
fprintf('Important feature = %sn', int2str(i));
fprintf('Balanced K-folds Cross validation fold: %sn', int2str(j));
trainData = X(trainIdx(j, :), :);
trainData = [trainData inputs(trainIdx(j, :), end)];

testData = X(testIdx(j, :), :);
testData = [testData inputs(testIdx(j, :), end)];

disp('Converting to weka Java object...');
wekaTrainObj = Matlab2weka(relationName, trainData.Properties.VariableNames, table2array(trainData));
wekaTestObj = Matlab2weka(relationName, testData.Properties.VariableNames, table2array(testData));

SaveARFF(strcat(algoName, '_test_', int2str(i), '.arff'), wekaTrainObj);
SaveARFF(strcat(algoName, '_train_', int2str(i), '.arff'), wekaTestObj);

model = trainWekaClassifier(wekaTrainObj, 'trees.J48');

% test the classifier model
predicted = wekaClassify(wekaTestObj, model);

% the actual class index value according to the test dataset
actual = wekaTestObj.attributeToDoubleArray(wekaTestObj.numAttributes - 1);

accuRate = sum(actual == predicted)*(100/numel(predicted));

corrected = find(actual == predicted);
incorrected = find(actual ~= predicted);
disp(['number of correctly classified instances: ', int2str(numel(corrected))]);
disp(['number of incorrectly classified instances: ', int2str(numel(incorrected))]);
disp(['accuracy rate= ', num2str(accuRate), '']);
disp(' ');
totalAccuRate(end+1) = accuRate;
avgAccuRate = mean(totalAccuRate);
disp([algoName, ' with ', int2str(i), ' selected features']);
fprintf(fileID, '%s with %s selected featuresn', algoName, int2str(i));
disp([int2str(nFold), ' folds CV accuracy rate = ', num2str(mean(totalAccuRate)), '%']);
fprintf(fileID, '%s folds CV accuracy rate = %s%%n', int2str(nFold), num2str(avgAccuRate));

if avgAccuRate > bestAccuRate
disp('****found best model****');
bestAccuRate = avgAccuRate;
bestModel = model;
disp('Finished Decision Tree C4.5!');
end % end function DecisionTreeC4p5

One-to-N Encoding for Nominal Variable


This is an example of my matlab implementation of the function.
One of N encoding is a very simple way of encoding classes for a machine learning method.
A class set is a dataset value that can have one of several non-numeric values.
The number of classes must be known ahead of time.

nv = size(X, 2);
nc = size(X, 1);
Y = X;

% for each variable
for i=1:nv
atts = unique(table2array(X(:, i)));
% We only encode the variable that has more than 2 states (e.g., 0 or 1)
if(size(atts, 1) ~= 2)

numVar = size(atts, 1);
% create new variables equals to the possible state of the variable
v = zeros(nc, numVar);
% for each case
for j=1:nc
% find the index of the state of the variable
idx = find( atts == X{j, i} );
if(size(idx, 1) == 1)
v(j, idx) = 1;
error('Error: Index error when encoding.');
% remove the variable and replace with the new variables
removedVarName = X(:, i).Properties.VariableNames;
Y(:, removedVarName) = [];

newVars = array2table(v);
% rename the var according to the removed variable
for k=1:numVar
name = strcat(removedVarName, '_v', int2str(k));
newVars.Properties.VariableNames(k) = name;

Y = [Y newVars];

end % end if
end % end function OneOfNEncodingNominal