Crawling using beautiful soup 4

'''
import requests
import re
from bs4 import BeautifulSoup

target_url = 'https://www.poftut.com/python-main-function-use/'

def download_page(page_url):
	print(page_url)
	try:
		return requests.get(page_url).text
	except Exception as e:
		print ("Error: invalid ip and port", e)

def parse_page(page):
	tags = []
	soup = BeautifulSoup(page, 'html.parser')
	for tag in soup.find_all(re.compile("t")):
		tags.append(tag.name)
	return tags

def main():
	print('Basic with BeautifulSoup is starting...')
	page = download_page(target_url)
	tags = parse_page(page)
	print(tags)
main()

Basic web-crawler process flow demo using Python

'''
Demonstrate basic process in web static snipplet cralwing.
Pree Thiengburanathum
Python 3.7
'''
import re
import requests
from urllib.parse import urlparse


target_url = 'https://www.poftut.com/python-main-function-use/'

def get_links(page_url):
	host = urlparse(page_url)
	page = download_page(page_url)
	links = extract_links(page)
	return links

def extract_links(page):
	if not page:
		return []
	link_regex = re.compile('(?<=href=").*?(?=")')
	return link_regex.findall(page)

def download_page(url):
	print(url)
	try:
		return requests.get(url).text
	except Exception as e:
		print ("Error: invalid ip and port", e)
		
def main():
	print('Basic crawler is starting...')
	links = get_links(target_url)
	print(len(links))
	for link in links:
		print(link)
	print('Program terminated successfully')

main()

Subversion hosting

Looking for a project hosting? If you need a SVN repository or GIT hosting, unfuddle.com is a very neat project hosting service and I really recommend. Thier web application is very easy to use, also it has the ticketing systems, calendar, email notifications, SSH protocol, and some cool project management tools. I have created my account and been working my current compiler projects with them since last week, and I am very satisfied and impressed. Comparing to Google code which I have been using since last Fall and still. I like it a lot, its a free service, but I think it only suites for some type of projects. One thing that I don’t like about the Google project hosting is that it doesn’t let me set permission to allow public reading the project. This is bothered me, such that sometime I feel I need to have just my own private access to my project.

Not to be offense, Google makes me feel that everyone has been watching for their Internet activities. Every information that connected to the Internet is almost search able from Google. Years ago, I used to understand that there is a script that we can put in our file directories and we can set not to let the Google bot to crawl the files in such directories. Are those scripts still exist these days?