''' import requests import re from bs4 import BeautifulSoup target_url = 'https://www.poftut.com/python-main-function-use/' def download_page(page_url): print(page_url) try: return requests.get(page_url).text except Exception as e: print ("Error: invalid ip and port", e) def parse_page(page): tags =  soup = BeautifulSoup(page, 'html.parser') for tag in soup.find_all(re.compile("t")): tags.append(tag.name) return tags def main(): print('Basic with BeautifulSoup is starting...') page = download_page(target_url) tags = parse_page(page) print(tags) main()
''' Demonstrate basic process in web static snipplet cralwing. Pree Thiengburanathum Python 3.7 ''' import re import requests from urllib.parse import urlparse target_url = 'https://www.poftut.com/python-main-function-use/' def get_links(page_url): host = urlparse(page_url) page = download_page(page_url) links = extract_links(page) return links def extract_links(page): if not page: return  link_regex = re.compile('(?<=href=").*?(?=")') return link_regex.findall(page) def download_page(url): print(url) try: return requests.get(url).text except Exception as e: print ("Error: invalid ip and port", e) def main(): print('Basic crawler is starting...') links = get_links(target_url) print(len(links)) for link in links: print(link) print('Program terminated successfully') main()
Looking for a project hosting? If you need a SVN repository or GIT hosting, unfuddle.com is a very neat project hosting service and I really recommend. Thier web application is very easy to use, also it has the ticketing systems, calendar, email notifications, SSH protocol, and some cool project management tools. I have created my account and been working my current compiler projects with them since last week, and I am very satisfied and impressed. Comparing to Google code which I have been using since last Fall and still. I like it a lot, its a free service, but I think it only suites for some type of projects. One thing that I don’t like about the Google project hosting is that it doesn’t let me set permission to allow public reading the project. This is bothered me, such that sometime I feel I need to have just my own private access to my project.
Not to be offense, Google makes me feel that everyone has been watching for their Internet activities. Every information that connected to the Internet is almost search able from Google. Years ago, I used to understand that there is a script that we can put in our file directories and we can set not to let the Google bot to crawl the files in such directories. Are those scripts still exist these days?