crawler Category Page - PythonForBeginners.com https://www.pythonforbeginners.com Learn By Example Fri, 28 Aug 2020 15:51:34 +0000 en-US hourly 1 https://wordpress.org/?v=5.8.12 https://www.pythonforbeginners.com/wp-content/uploads/2020/05/cropped-pfb_icon-32x32.png crawler Category Page - PythonForBeginners.com https://www.pythonforbeginners.com 32 32 201782279 BeautifulSoup Intro https://www.pythonforbeginners.com/beautifulsoup/python-beautifulsoup-basic Sun, 30 Sep 2012 04:12:04 +0000 https://www.pythonforbeginners.com/?p=979 What is BeautifulSoup? BeautifulSoup is a Python library from www.crummy.com What can it do On their website they write "Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it to: "Find all the links" "Find all the links of class externalLink" "Find all the links whose […]

The post BeautifulSoup Intro appeared first on PythonForBeginners.com.

]]>
What is BeautifulSoup?

BeautifulSoup is a Python library from www.crummy.com 

What can it do


On their website they write "Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. 

You can tell it to:

"Find all the links"

"Find all the links of class externalLink"

"Find all the links whose urls match "foo.com"

"Find the table heading that's got bold text, then give me that text."" 

BeautifulSoup Example


In this example, we will try and find a link (a tag) in a webpage. 

Before we start, we have to import two modules. (BeutifulSoup and urllib2). 

Urlib2 is used to open the URL we want. 

We will use the soup.findAll method to search through the soup object to match fortext and html tags within the page. 
from BeautifulSoup import BeautifulSoup
import urllib2

url = urllib2.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
links = soup.findAll("a")
Output

That will print out all the elements in python.org with an "a" tag. 

(The "a" tag defines a hyperlink, which is used to link from one page to another.)

BeautifulSoup Example 2


To make it a bit more useful, we can specify the URL's we want to return.  
from BeautifulSoup import BeautifulSoup
import urllib2
import re

url = urllib2.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
    if re.findall('python', a['href']):
        print "Found the URL:", a['href']
Further Reading

I recommend that you head over to http://www.crummy.com to read more about what you can do with this awesome module.

The post BeautifulSoup Intro appeared first on PythonForBeginners.com.

]]>
979
Python Code : Get all the links from a website https://www.pythonforbeginners.com/code/regular-expression-re-findall https://www.pythonforbeginners.com/code/regular-expression-re-findall#comments Sat, 22 Sep 2012 12:17:13 +0000 https://www.pythonforbeginners.com/?p=591 Overview In this script, we are going to use the re module to get all links from any website. One of the most powerful function in the re module is "re.findall()". While re.search() is used to find the first match for a pattern, re.findall() finds *all* the matches and returns them as a list of […]

The post Python Code : Get all the links from a website appeared first on PythonForBeginners.com.

]]>
Overview

In this script, we are going to use the re module to get all links from any website. 

One of the most powerful function in the re module is "re.findall()".

While re.search() is used to find the first match for a pattern, re.findall() finds *all*
the matches and returns them as a list of strings, with each string representing one match.

Get all links from a website


This example will get all the links from any websites HTML code. 

To find all the links, we will in this example use the urllib2 module together
with the re.module
import urllib2
import re

#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

Happy scraping!

The post Python Code : Get all the links from a website appeared first on PythonForBeginners.com.

]]>
https://www.pythonforbeginners.com/code/regular-expression-re-findall/feed 1 591
Python Command Line IMDB Scraper https://www.pythonforbeginners.com/code-snippets-source-code/imdb-crawler Wed, 19 Sep 2012 11:56:15 +0000 https://www.pythonforbeginners.com/?p=119 Overview This script will ask for a movie title and a year and then query IMDB for it. Command Line IMDB Scraper First step is to import the necessary modules. #!/usr/bin/env python27 #Importing the modules from BeautifulSoup import BeautifulSoup import sys import urllib2 import re import json #Ask for movie title title = raw_input("Please enter […]

The post Python Command Line IMDB Scraper appeared first on PythonForBeginners.com.

]]>
Overview

This script will ask for a movie title and a year and then query IMDB for it.

Command Line IMDB Scraper

First step is to import the necessary modules.

#!/usr/bin/env python27

#Importing the modules

from BeautifulSoup import BeautifulSoup
import sys
import urllib2
import re
import json

#Ask for movie title
title = raw_input("Please enter a movie title: ")

#Ask for which year
year = raw_input("which year? ")

#Search for spaces in the title string
raw_string = re.compile(r' ')

#Replace spaces with a plus sign
searchstring = raw_string.sub('+', title)

#Prints the search string
print searchstring

#The actual query
url = "http://www.imdbapi.com/?t=" + searchstring + "&y="+year

request = urllib2.Request(url)

response = json.load(urllib2.urlopen(request))

print json.dumps(response,indent=2)

Enjoy it!

The post Python Command Line IMDB Scraper appeared first on PythonForBeginners.com.

]]>
119