crawler Category Page - PythonForBeginners.com

BeautifulSoup Intro

PFB Staff Writer — Sun, 30 Sep 2012 04:12:04 +0000

What is BeautifulSoup?


BeautifulSoup is a Python library from www.crummy.com

What can it do


On their website they write "Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. 

You can tell it to:

"Find all the links"

"Find all the links of class externalLink"

"Find all the links whose urls match "foo.com"

"Find the table heading that's got bold text, then give me that text.""

BeautifulSoup Example


In this example, we will try and find a link (a tag) in a webpage. 

Before we start, we have to import two modules. (BeutifulSoup and urllib2). 

Urlib2 is used to open the URL we want. 

We will use the soup.findAll method to search through the soup object to match fortext and html tags within the page.

from BeautifulSoup import BeautifulSoup
import urllib2

url = urllib2.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
links = soup.findAll("a")

Output


That will print out all the elements in python.org with an "a" tag. 

(The "a" tag defines a hyperlink, which is used to link from one page to another.)

BeautifulSoup Example 2


To make it a bit more useful, we can specify the URL's we want to return.

from BeautifulSoup import BeautifulSoup
import urllib2
import re

url = urllib2.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
    if re.findall('python', a['href']):
        print "Found the URL:", a['href']

Python Code : Get all the links from a website

PFB Staff Writer — Sat, 22 Sep 2012 12:17:13 +0000

Overview


In this script, we are going to use the re module to get all links from any website. 

One of the most powerful function in the re module is "re.findall()".

While re.search() is used to find the first match for a pattern, re.findall() finds *all*
the matches and returns them as a list of strings, with each string representing one match.

Get all links from a website


This example will get all the links from any websites HTML code. 

To find all the links, we will in this example use the urllib2 module together
with the re.module

import urllib2
import re

#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links


Happy scraping!

The post Python Code : Get all the links from a website appeared first on PythonForBeginners.com.

Python Command Line IMDB Scraper

PFB Staff Writer — Wed, 19 Sep 2012 11:56:15 +0000

Overview

This script will ask for a movie title and a year and then query IMDB for it.

Command Line IMDB Scraper

First step is to import the necessary modules.

#!/usr/bin/env python27

#Importing the modules

from BeautifulSoup import BeautifulSoup
import sys
import urllib2
import re
import json

#Ask for movie title
title = raw_input("Please enter a movie title: ")

#Ask for which year
year = raw_input("which year? ")

#Search for spaces in the title string
raw_string = re.compile(r' ')

#Replace spaces with a plus sign
searchstring = raw_string.sub('+', title)

#Prints the search string
print searchstring

#The actual query
url = "http://www.imdbapi.com/?t=" + searchstring + "&y="+year

request = urllib2.Request(url)

response = json.load(urllib2.urlopen(request))

print json.dumps(response,indent=2)

Enjoy it!

The post Python Command Line IMDB Scraper appeared first on PythonForBeginners.com.

crawler Category Page - PythonForBeginners.com

BeautifulSoup Intro

What is BeautifulSoup?

What can it do

BeautifulSoup Example

Output

BeautifulSoup Example 2

Further Reading

Python Code : Get all the links from a website

Overview

Get all links from a website

Python Command Line IMDB Scraper

Overview

Command Line IMDB Scraper