Python On The Web Category Page - PythonForBeginners.com https://www.pythonforbeginners.com Learn By Example Fri, 03 Dec 2021 19:06:55 +0000 en-US hourly 1 https://wordpress.org/?v=5.8.12 https://www.pythonforbeginners.com/wp-content/uploads/2020/05/cropped-pfb_icon-32x32.png Python On The Web Category Page - PythonForBeginners.com https://www.pythonforbeginners.com 32 32 201782279 Web Scraping with BeautifulSoup https://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup Wed, 09 Mar 2016 18:04:26 +0000 https://www.pythonforbeginners.com/?p=5277 Web Scraping “Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.” HTML parsing is easy in Python, especially with help of the BeautifulSoup library. In this post we will scrape a website (our own) to extract all URL’s. Getting Started To begin with, make sure that […]

The post Web Scraping with BeautifulSoup appeared first on PythonForBeginners.com.

]]>
Web Scraping

“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.”

HTML parsing is easy in Python, especially with help of the BeautifulSoup library. In this post we will scrape a website (our own) to extract all URL’s.

Getting Started

To begin with, make sure that you have the necessary modules installed. In the example below, we are using Beautiful Soup 4 and Requests on a system with Python 2.7 installed. Installing BeautifulSoup and Requests can be done with pip:


$ pip install requests

$ pip install beautifulsoup4

What is Beautiful Soup?

On the top of their website, you can read: “You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.”

Beautiful Soup Features:

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Extracting URL’s from any website

Now when we know what BS4 is and we have installed it on our machine, let’s see what we can do with it.


from bs4 import BeautifulSoup

import requests

url = raw_input("Enter a website to extract the URL's from: ")

r  = requests.get("http://" +url)

data = r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

When we run this program, it will ask us for a website to extract the URL’s from


Enter a website to extract the URL's from: www.pythonforbeginners.com
Learn Python By Example
https://www.pythonforbeginners.com/python-overview-start-here/ https://www.pythonforbeginners.com/dictionary/ https://www.pythonforbeginners.com/python-functions-cheat-sheet/
Lists
https://www.pythonforbeginners.com/loops/ https://www.pythonforbeginners.com/python-modules/ https://www.pythonforbeginners.com/strings/ https://www.pythonforbeginners.com/sitemap/ https://www.pythonforbeginners.com/feed/
Learn Python By Example
.... .... ....

I recommend that you read our introduction article: Beautiful Soup 4 Python to get more knowledge and understanding about Beautiful Soup.

More Reading

http://www.crummy.com/software/BeautifulSoup/

http://docs.python-requests.org/en/latest/index.html

The post Web Scraping with BeautifulSoup appeared first on PythonForBeginners.com.

]]>
5277
Beautiful Soup 4 Python https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python Wed, 09 Mar 2016 18:04:26 +0000 https://www.pythonforbeginners.com/?p=4072 Overview This article is an introduction to BeautifulSoup 4 in Python. If you want to know more I recommend you to read the official documentation found here. What is Beautiful Soup? Beautiful Soup is a Python library for pulling data out of HTML and XML files. BeautifulSoup 3 or 4? Beautiful Soup 3 has been […]

The post Beautiful Soup 4 Python appeared first on PythonForBeginners.com.

]]>
Overview

This article is an introduction to BeautifulSoup 4 in Python. If you want to know more I recommend you to read the official documentation found here.

What is Beautiful Soup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

BeautifulSoup 3 or 4?

Beautiful Soup 3 has been replaced by Beautiful Soup 4. Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.

Installing Beautiful Soup

If you run Debian or Ubuntu, you can install Beautiful Soup with the system package manager

apt-get install python-bs4

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

easy_install beautifulsoup4

pip install beautifulsoup4

If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py. python setup.py install

BeautifulSoup Usage

Right after the installation you can start using BeautifulSoup. At the beginning of your Python script, import the library Now you have to pass something to BeautifulSoup to create a soup object. That could be a document or an URL. BeautifulSoup does not fetch the web page for you, you have to do that yourself. That’s why I use urllib2 in combination with the BeautifulSoup library.

Filtering

There are some different filters you can use with the search API. Below I will show you some examples on how you can pass those filters into methods such as find_all You can use these filters based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

A string

The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the ‘b’ tags in the document (you can replace b with any tag you want to find)

soup.find_all('b')

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.

A regular expression

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its match() method. This code finds all the tags whose names start with the letter “b”, in this case, the ‘body’ tag and the ‘b’ tag:

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

This code finds all the tags whose names contain the letter “t”:

for tag in soup.find_all(re.compile("t")):
    print(tag.name)

A list

If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the ‘a’ tags and all the ‘b’ tags

print soup.find_all(["a", "b"])

True

The value True matches everything it can. This code finds all the tags in the document, but none of the text strings:

for tag in soup.find_all(True):
    print(tag.name)

A function

If none of the other matches work for you, define a function that takes an element as its only argument. Please see the official documentation if you want to do that.

BeautifulSoup Object

As an example, we’ll use the very website you currently are on (https://www.pythonforbeginners.com) To parse the data from the content, we simply create a BeautifulSoup object for it That will create a soup object of the content of the url we passed in. From this point, we can now use the Beautiful Soup methods on that soup object. We can use the prettify method to turn a BS parse tree into a nicely formatted Unicode string

The Find_all method

The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag’s descendants and retrieves all descendants that match your filters.

soup.find_all("title")

soup.find_all("p", "title")

soup.find_all("a")

soup.find_all(id="link2")

Let’s see some examples on how to use BS 4

from bs4 import BeautifulSoup
import urllib2

url = "https://www.pythonforbeginners.com"

content = urllib2.urlopen(url).read()

soup = BeautifulSoup(content)

print soup.prettify()

print title
>> 'title'? Python For Beginners

print soup.title.string
>> ? Python For Beginners

print soup.p
print soup.a
Python For Beginners

Navigating the Parse Tree

If you want to know how to navigate the tree please see the official documentation . There you can read about the following things:

Going down

  • Navigating using tag names
  • .contents and .children
  • .descendants
  • .string .strings and stripped_strings

Going up

  • .parent
  • .parents

Going sideways

  • .next_sibling and .previous_sibling
  • .next_siblings and .previous_siblings

Going back and forth

  • .next_element and .previous_element
  • .next_elements and .previous_elements

Extracting all the URLs found within a page ‘a’ tags

One common task is extracting all the URLs found within a page’s ‘a’ tags Using the find_all method, gives us a whole list of elements with the tag “a”.

for link in soup.find_all('a'):
    print(link.get('href'))
Output:
..https://www.pythonforbeginners.com
..https://www.pythonforbeginners.com/python-overview-start-here/
..https://www.pythonforbeginners.com/dictionary/
..https://www.pythonforbeginners.com/python-functions-cheat-sheet/
..https://www.pythonforbeginners.com/lists/python-lists-cheat-sheet/
..https://www.pythonforbeginners.com/loops/
..https://www.pythonforbeginners.com/python-modules/
..https://www.pythonforbeginners.com/strings/
..https://www.pythonforbeginners.com/sitemap/
...
...

Extracting all the text from a page

Another common task is extracting all the text from a page:

print(soup.get_text())
Output:
Python For Beginners
Python Basics
Dictionary
Functions
Lists
Loops
Modules
Strings
Sitemap
...
...

Get all links from Reddit

As a last example, let’s grab all the links from Reddit

from bs4 import BeautifulSoup
import urllib2

redditFile = urllib2.urlopen("http://www.reddit.com")
redditHtml = redditFile.read()
redditFile.close()

soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all("a")
for links in soup.find_all('a'):
    print (links.get('href'))
Output:
#content
..http://www.reddit.com/r/AdviceAnimals/
..http://www.reddit.com/r/announcements/
..http://www.reddit.com/r/AskReddit/
..http://www.reddit.com/r/atheism/
..http://www.reddit.com/r/aww/
..http://www.reddit.com/r/bestof/
..http://www.reddit.com/r/blog/
..http://www.reddit.com/r/funny/
..http://www.reddit.com/r/gaming/
..http://www.reddit.com/r/IAmA/
..http://www.reddit.com/r/movies/
..http://www.reddit.com/r/Music/
..http://www.reddit.com/r/pics/
..http://www.reddit.com/r/politics/
...

For more information, please see the official documentation.

The post Beautiful Soup 4 Python appeared first on PythonForBeginners.com.

]]>
4072
Using Feedparser in Python https://www.pythonforbeginners.com/feedparser/using-feedparser-in-python Mon, 14 Oct 2013 05:04:18 +0000 https://www.pythonforbeginners.com/?p=3342 Overview In this post we will take a look on how we can download and parse syndicated feeds with Python. The Python module we will use for that is “Feedparser”. The complete documentation can be found here. What is RSS? RSS stands for Rich Site Summary and uses standard web feed formats to publish frequently […]

The post Using Feedparser in Python appeared first on PythonForBeginners.com.

]]>
Overview

In this post we will take a look on how we can download and parse syndicated
feeds with Python.

The Python module we will use for that is “Feedparser”.

The complete documentation can be found here.

What is RSS?

RSS stands for Rich Site Summary and uses standard web feed formats to publish
frequently updated information: blog entries, news headlines, audio, video.

An RSS document (called “feed”, “web feed”, or “channel”) includes full or
summarized text, and metadata, like publishing date and author’s name. [source]

What is Feedparser?

Feedparser is a Python library that parses feeds in all known formats, including
Atom, RSS, and RDF. It runs on Python 2.4 all the way up to 3.3. [source]

RSS Elements

Before we install the feedparser module and start to code, let’s take a look
at some of the available RSS elements.

The most commonly used elements in RSS feeds are “title”, “link”, “description”,
“publication date”, and “entry ID”.

The less commonnly used elements are “image”, “categories”, “enclosures”
and “cloud”.

Install Feedparser

To install feedparser on your computer, open your terminal and install it using
pip” (A tool for installing and managing Python packages)

sudo pip install feedparser

To verify that feedparser is installed, you can run a “pip list”.

You can of course also enter the interactive mode, and import the feedparser
module there.

If you see an output like below, you can be sure it’s installed.

>>> import feedparser
>>>

Now that we have installed the feedparser module, we can go ahead and begin
to work with it.

Getting the RSS feed

You can use any RSS feed that you want. Since I like to read Reddit, I will use
that for my example.

Reddit is made up of many sub-reddits, the one I am particular interested in for
now is the “Python” sub-reddit.

The way to get the RSS feed, is just to look up the URL to that sub-reddit and
add a “.rss” to it.

The RSS feed that we need for the python sub-reddit would be:
http://www.reddit.com/r/python/.rss

Using Feedparser

You start your program with importing the feedparser module.

import feedparser

Create the feed. Put in the RSS feed that you want.

d = feedparser.parse('http://www.reddit.com/r/python/.rss')

The channel elements are available in d.feed (Remember the “RSS Elements” above)

The items are available in d.entries, which is a list.

You access items in the list in the same order in which they appear in the
original feed, so the first item is available in d.entries[0].

Print the title of the feed

print d['feed']['title']

>>> Python

Resolves relative links

print d['feed']['link']

>>> http://www.reddit.com/r/Python/

Parse escaped HTML

print d.feed.subtitle

>>> news about the dynamic, interpreted, interactive, object-oriented, extensible
programming language Python

See number of entries

print len(d['entries'])

>>> 25

Each entry in the feed is a dictionary. Use [0] to print the first entry.

print d['entries'][0]['title'] 

>>> Functional Python made easy with a new library: Funcy

Print the first entry and its link

print d.entries[0]['link'] 

>>> http://www.reddit.com/r/Python/comments/1oej74/functional_python_made_easy_with_a_new_
library/

Use a for loop to print all posts and their links.

for post in d.entries:
    print post.title + ": " + post.link + "
"

>>>
Functional Python made easy with a new library: Funcy: http://www.reddit.com/r/Python/
comments/1oej74/functional_python_made_easy_with_a_new_
library/

Python Packages Open Sourced: http://www.reddit.com/r/Python/comments/1od7nn/
python_packages_open_sourced/

PyEDA 0.15.0 Released: http://www.reddit.com/r/Python/comments/1oet5m/
pyeda_0150_released/

PyMongo 2.6.3 Released: http://www.reddit.com/r/Python/comments/1ocryg/
pymongo_263_released/
.....
.......
........

Reports the feed type and version

print d.version      

>>> rss20

Full access to all HTTP headers

print d.headers          	

>>> 
{'content-length': '5393', 'content-encoding': 'gzip', 'vary': 'accept-encoding', 'server':
"'; DROP TABLE servertypes; --", 'connection': 'close', 'date': 'Mon, 14 Oct 2013 09:13:34
GMT', 'content-type': 'text/xml; charset=UTF-8'}

Just get the content-type from the header

print d.headers.get('content-type')

>>> text/xml; charset=UTF-8

Using the feedparser is an easy and fun way to parse RSS feeds.

Sources

http://www.slideshare.net/LindseySmith1/feedparser
http://code.google.com/p/feedparser/

The post Using Feedparser in Python appeared first on PythonForBeginners.com.

]]>
3342
Scraping Wunderground https://www.pythonforbeginners.com/scraping/scraping-wunderground Thu, 26 Sep 2013 06:12:16 +0000 https://www.pythonforbeginners.com/?p=6323 Overview Working with APIs is both fun and educational. Many companies like Google, Reddit and Twitter releases it’s API to the public so that developers can develop products that are powered by its service. Working with APIs learns you the nuts and bolts beneath the hood. In this post, we will work the Weather Underground […]

The post Scraping Wunderground appeared first on PythonForBeginners.com.

]]>
Overview

Working with APIs is both fun and educational.

Many companies like Google, Reddit and Twitter releases it’s API to the public
so that developers can develop products that are powered by its service.

Working with APIs learns you the nuts and bolts beneath the hood.

In this post, we will work the Weather Underground API.

Weather Underground (Wunderground)

We will build an app that will connect to ‘Wunderground‘ and retrieve.
Weather Forecasts etc.

Wunderground provides local & long range Weather Forecast, weather reports,
maps & tropical weather conditions for locations worldwide.

API

An API is a protocol intended to be used as an interface by software components
to communicate with each other. An API is a set of programming instructions and
standards for accessing web based software applications (such as above).

With API’s applications talk to each other without any user knowledge or
intervention.

Getting Started

The first thing that we need to do when we want to use an API, is to see if the
company provides any API documentation. Since we want to write an application for
Wunderground, we will go to Wundergrounds website

At the bottom of the page, you should see the “Weather API for Developers”.

The API Documentation

Most of the API features require an API key, so let’s go ahead and sign up for
a key before we start to use the Weather API.

In the documentation we can also read that the API requests are made over HTTP
and that Data features return JSON or XML.

To read the full API documentation, see this link.

Before we get the key, we need to first create a free account.

The API Key

Next step is to sign up for the API key. Just fill in your name, email address,
project name and website and you should be ready to go.

Many services on the Internet (such as Twitter, Facebook..) requires that you
have an “API Key”.

An application programming interface key (API key) is a code passed in by
computer programs calling an API to identify the calling program, its developer,
or its user to the Web site.

API keys are used to track and control how the API is being used, for example
to prevent malicious use or abuse of the API.

The API key often acts as both a unique identifier and a secret token for
authentication, and will generally have a set of access rights on the API
associated with it.

Current Conditions in US City

Wunderground provides an example for us in their API documentation.

Current Conditions in US City

http://api.wunderground.com/api/0def10027afaebb7/conditions/q/CA/San_Francisco.json

If you click on the “Show response” button or copy and paste that URL into your
browser, you should something similar to this:

{
	"response": {
		"version": "0.1"
		,"termsofService": "http://www.wunderground.com/weather/api/d/terms.html"
		,"features": {
		"conditions": 1
		}
	}
		,	"current_observation": {
		"image": {
		"url":"http://icons-ak.wxug.com/graphics/wu2/logo_130x80.png",
		"title":"Weather Underground",
		"link":"http://www.wunderground.com"
		},
		"display_location": {
		"full":"San Francisco, CA",
		"city":"San Francisco",
		"state":"CA",
		"state_name":"California",
		"country":"US",
		"country_iso3166":"US",
		"zip":"94101",
		"magic":"1",
		"wmo":"99999",
		"latitude":"37.77500916",
		"longitude":"-122.41825867",
		"elevation":"47.00000000"
		},
		.....

Current Conditions in Cedar Rapids

On the “Code Samples” page we can see the whole Python code to retrieve the
current temperature in Cedar Rapids.

Copy and paste this into your favorite editor and save it as anything you like.

Note, that you have to replace “0def10027afaebb7” with your own API key.

import urllib2
import json
f = urllib2.urlopen('http://api.wunderground.com/api/0def10027afaebb7/geolookup/conditions/q/IA/Cedar_Rapids.json')
json_string = f.read()

parsed_json = json.loads(json_string)

location = parsed_json['location']['city']

temp_f = parsed_json['current_observation']['temp_f']

print "Current temperature in %s is: %s" % (location, temp_f)

f.close()

To run the program in your terminal:

python get_current_temp.py

Your program will return the current temperature in Cedar Rapids:

Current temperature in Cedar Rapids is: 68.9

What is next?

Now that we have looked at and tested the examples provided by Wunderground,
let’s create a program by ourselves.

The Weather Underground provides us with a whole bunch of “Data Features” that
we can use.

It is important that you read through the information there, to understand how
the different features can be accessed.

Standard Request URL Format

“Most API features can be accessed using the following format.

Note that several features can be combined into a single request.”

http://api.wunderground.com/api/0def10027afaebb7/features/settings/q/query.format

where:

0def10027afaebb7: Your API key

features: One or more of the following data features

settings (optional): Example: lang:FR/pws:0

query: The location for which you want weather information

format: json, or xml

What I want to do is to retrieve the forecast for Paris.

The forecast feature returns a summary of the weather for the next 3 days.

This includes high and low temperatures, a string text forecast and the conditions.

Forecast for Paris

To retrieve the forecast for Paris, I will first have to find out the country
code for France, which I can find here:

Weather by country

Next step is to look for the “Feature: forecast” in the API documentation.

The string that we need can be found here:

http://www.wunderground.com/weather/api/d/docs?d=data/forecast

By reading the documentation, we should be able to construct an URL.

Making the API call

We now have the URL that we need and we can start with our program.

Now its time to make the API call to Weather Underground.

Note: Instead of using the urllib2 module as we did in the examples above,
we will in this program use the “requests” module.

Making the API call is very easy with the “requests” module.

r = requests.get("http://api.wunderground.com/api/your_api_key/forecast/q/France/
Paris.json")

Now, we have a Response object called “r”. We can get all the information we need
from this object.

Creating our Application

Open your editor of choice, at the first line, import the requests module.

Note, the requests module comes with a built-in JSON decoder, which we can use
for the JSON data. That also means, that we don’t have to import the JSON
module (like we did in the previous example when we used the urllib2 module)

import requests

To begin extracting the information that we need, we first have to see
what keys that the “r” object returns to us.

The code below will return the keys and should return [u’response’, u’forecast’]

import requests

r = requests.get("http://api.wunderground.com/api/your_api_key/forecast/q/France/
Paris.json")

data = r.json()

print data.keys()

Getting the data that we want

Copy and paste the URL (from above) into a JSON editor.

I use http://jsoneditoronline.org/ but any JSON editor should do the work.

This will show an easier overview of all the data.

http://api.wunderground.com/api/your_api_key/forecast/q/France/Paris.json

Note, the same information can be gained via the terminal, by typing:

r = requests.get("http://api.wunderground.com/api/your_api_key/forecast/q/France/
Paris.json")
print r.text

After inspecting the output given to us, we can see that the data that we are
interested in, is in the “forecast” key. Back to our program, and print out the
data from that key.

import requests

r = requests.get("http://api.wunderground.com/api/your_api_key/forecast/q/France/
Paris.json")

data = r.json()

print data['forecast']

The result is stored in the variable “data”.

To access our JSON data, we simple use the bracket notation, like this:
data[‘key’].

Let’s navigate a bit more through the data, by adding ‘simpleforecast’

import requests

r = requests.get("http://api.wunderground.com/api/your_api_key/forecast/q/France/
Paris.json")

data = r.json()

print data['forecast']['simpleforecast']

We are still getting a bit to much output, but hold on, we are almost there.

The last step in our program is to add [‘forecastday’] and instead of printing
out each and every entry, we will use a for loop to iterate through the dictionary.

We can access anything we want like this, just look up what data you are
interested in.

In this program I wanted to get the forecast for Paris.

Let’s see how the code looks like.

import requests

r = requests.get("http://api.wunderground.com/api/0def10027afaebb7/forecast/q/France/Paris.json")
data = r.json()

for day in data['forecast']['simpleforecast']['forecastday']:
    print day['date']['weekday'] + ":"
    print "Conditions: ", day['conditions']
    print "High: ", day['high']['celsius'] + "C", "Low: ", day['low']['celsius'] + "C", '
'

Run the program.

$ python get_temp_paris.py

Monday:
Conditions:  Partly Cloudy
High:  23C Low:  10C

Tuesday:
Conditions:  Partly Cloudy
High:  23C Low:  10C

Wednesday:
Conditions:  Partly Cloudy
High:  24C Low:  14C

Thursday:
Conditions:  Mostly Cloudy
High:  26C Low:  15C

The forecast feature is just one of many. I will leave it up to you to explore
the rest.

Once you get the understanding of an API and it’s output in JSON, you understand
how most of them work.

More Reading

A comprehensive list of Python APIs
Weather Underground

The post Scraping Wunderground appeared first on PythonForBeginners.com.

]]>
6323
Python – Quick Start Web https://www.pythonforbeginners.com/python-on-the-web/python-quick-start-web Mon, 12 Aug 2013 05:39:00 +0000 https://www.pythonforbeginners.com/?p=6154 Python Quick Start Web This post will be a collection of the posts we have written about Python for the web. What is Python for the Web? Basically, python programs that interface/scrape websites and/or utilize web based APIs. If you like what you read, please take a moment to share it using the buttons above. […]

The post Python – Quick Start Web appeared first on PythonForBeginners.com.

]]>
Python Quick Start Web

This post will be a collection of the posts we have written about Python for
the web. What is Python for the Web? Basically, python programs that interface/scrape websites and/or utilize web based APIs.

If you like what you read, please take a moment to share it using the buttons
above.

How to use Reddit API in Python

In the API documentation, you can see there are tons of things to do.

How to access various Web Services in Python

A very good way of learning Python is trying to work with various Web Services
API’s. In order to answer that, we will first have to get some knowledge about API’s,
JSON, Data structures etc.

Web Scraping with BeautifulSoup

“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.”

HTML parsing is easy in Python, especially with help of the BeautifulSoup library. In this post we will scrape a website (our own) to extract all URL’s.

Tweet Search with Python

Twitter’s API is REST-based and will return results as either XML or JSON, as well as both RSS and ATOM feed formats. Public timelines can be accessed by any client, but all other Twitter methods require authentication.

CommandLineFu with Python

A common first step when you want to use a Web-based services is to see if they have an API.

How to use urllib2 in Python

urllib2 is a Python module that can be used for fetching URLs.

Using Requests in Python

Requests is an Apache2 Licensed HTTP library, written in Python. It is designed to be used by humans to interact with the language.

Beautiful Soup 4 Python

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

Using Feedparser in Python

In this post we will take a look on how we can download and parse syndicated
feeds with Python.

Scraping websites with Python

“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.”

HTML parsing is easy in Python, especially with help of the BeautifulSoup library. In this post we will scrape a website (our own) to extract all URL’s.

Python API and JSON

An application programming interface (API) is a protocol intended to be used as
an interface by software components to communicate with each other.

Python Mechanize Cheat Sheet

A very useful python module for navigating through web forms is Mechanize.

Using the YouTube API in Python

In this post we will be looking on how to use the YouTube API in Python. This program will show how we can use the API to retrieve feeds from YouTube.

How to use the Vimeo API in Python

Vimeo offers an API which lets us integrate with their site and build applications
on top of their data.

Browsing in Python with Mechanize

The mechanize module in Python gives you a browser like object to interact with web pages.

Fetching data from the Internet

urllib2 is a Python module for fetching URLs.

The awesome requests module

The Requests module is a an elegant and simple HTTP library for Python.

Parsing JSON in Python

Request to an HTTP API is often just the URL with some query parameters.

Parse JSON objects in Python

In this post we will explain how you can parse JSON objects in Python.

Knowing how to parse JSON objects is useful when you want to access an API
from various web services that gives the response in JSON.

What is JSON

JSON (JavaScript Object Notation) is a compact, text based format for computers to exchange data.

The post Python – Quick Start Web appeared first on PythonForBeginners.com.

]]>
6154
How to use Reddit API in Python https://www.pythonforbeginners.com/api/how-to-use-reddit-api-in-python Mon, 22 Jul 2013 07:22:44 +0000 https://www.pythonforbeginners.com/?p=6044 Reddit API – Overview In an earlier post “How to access various Web Services in Python“, we described how we can access services such as YouTube, Vimeo and Twitter via their API’s. Note, there are a few Reddit Wrappers that you can use to interact with Reddit. A wrapper is an API client, that are […]

The post How to use Reddit API in Python appeared first on PythonForBeginners.com.

]]>
Reddit API – Overview

In an earlier post “How to access various Web Services in Python“, we described
how we can access services such as YouTube, Vimeo and Twitter via their API’s.

Note, there are a few Reddit Wrappers that you can use to interact with Reddit.

A wrapper is an API client, that are commonly used to wrap the API into easy to
use functions by doing the API calls itself.

That results in that the user of it can be less concerned with how the code
actually works.

If you don’t use a wrapper, you will have to access the Reddits API directly,
which is exactly what we will do in this post.

Getting started

Since we are going to focus on the API from Reddit, let’s head over to their API
documentation
. I recommend that you get familiar with the documentation and also
pay extra attention to the the overview and the sections about “modhashes”,
“fullnames” and “type prefixes”.

The result from the API will return as either XML or JSON. In this post we will
use the JSON format.

Please refer to above post or the official documentation for more information
about the JSON structure.

API documentation

In the API documentation, you can see there are tons of things to do.

In this post, we have chosen to extract information from our own Reddit account.

The information we need for that is: GET /user/username/where[ .json | .xml ]

GET /user/username/where[ .json | .xml ]

? /user/username/overview
? /user/username/submitted
? /user/username/comments
? /user/username/liked
? /user/username/disliked
? /user/username/hidden
? /user/username/saved

Viewing the JSON output

If we for example want to use “comments”, the URL would be:
http://www.reddit.com/user/spilcm/comments/.json

You can see that we have replaced “username” and “where” with our own input.

To see the data response, you can either make a curl request, like this:

curl http://www.reddit.com/user/spilcm/comments/.json

…or just paste the URL into your browser.

You can see that the response is JSON. This may be difficult to look at in the
browser, unless you have the JSONView plugin installed.

These extensions are available for Firefox and Chrome.

Start coding

Now that we have the URL, let’s start to do some coding.

Open up your favourite IDLE / Editor and import the modules that we will need.

Importing the modules. The pprint and json modules are optional.

from pprint import pprint

import requests

import json

Make The API Call

Now its time to make the API call to Reddit.

r = requests.get(r'http://www.reddit.com/user/spilcm/comments/.json')

Now, we have a Response object called “r”. We can get all the information we need
from this object.

JSON Response Content

The Requests module comes with a builtin JSON decoder, which we can use for with
the JSON data.

As you could see on the image above, the output that we get is not really what we
want to display.

The question is, how do we extract useful data from it?

If we just want to look at the keys in the “r” object:

r = requests.get(r'http://www.reddit.com/user/spilcm/comments/.json')

data = r.json()

print data.keys()

That should give us the following output:

[u’kind’, u’data’]

These keys are very important to us.

Now its time to get the data that we are interested in.

Get the JSON feed and copy/paste the output into a JSON editor to get an easier
overview over the data.

An easy way of doing that is to paste JSON result into an online JSON editor.

I use http://jsoneditoronline.org/ but any JSON editor should do the work.

Let’s see an example of this:

r = requests.get(r'http://www.reddit.com/user/spilcm/comments/.json')
r.text

As you can see from the image, we get the same keys (kind, data) as we did before
when we printed the keys.

Convert JSON into a dictionary

Let’s convert the JSON data into Python dictionary.

You can do that like this:

r.json()

#OR

json.loads(r.text)

Now when we have a Python dictionary, we start using it to get the the results
we want.

Navigate to find useful data

Just navigate your way down until you find what you’re after.

r = requests.get(r'http://www.reddit.com/user/spilcm/comments/.json')

r.text

data = r.json()

print data['data']['children'][0]

The result is stored in the variable “data”.

To access our JSON data, we simple use the bracket notation, like this:
data[‘key’].

Remember that an array is indexed from zero.

Instead of printing each and every entry, we can use a for loop to iterate
through our dictionary.

for child in data['data']['children']:

    print child['data']['id'], "
", child['data']['author'],child['data']['body']

    print

We can access anything we want like this, just look up what data you are
interested in.

The complete script

As you can see in our complete script, we only have to import one module:
(requests)

import requests

r = requests.get(r'http://www.reddit.com/user/spilcm/comments/.json')

r.text

data = r.json()

for child in data['data']['children']:
    print child['data']['id'], "
", child['data']['author'],child['data']['body']
    print

When you run the script, you should see something similar to this:

More Reading

http://docs.python-requests.org/en/latest/

The post How to use Reddit API in Python appeared first on PythonForBeginners.com.

]]>
6044
How to access various Web Services in Python https://www.pythonforbeginners.com/python-on-the-web/how-to-access-various-web-services-in-python Thu, 27 Jun 2013 15:41:46 +0000 https://www.pythonforbeginners.com/?p=5706 Overview A very good way of learning Python is trying to work with various Web Services API’s. How do I access web services such as Youtube, Vimeo, Twitter? In order to answer that, we will first have to get some knowledge about API’s, JSON, Data structures etc. Getting Started For those of you that have […]

The post How to access various Web Services in Python appeared first on PythonForBeginners.com.

]]>
Overview

A very good way of learning Python is trying to work with various Web Services
API’s.

How do I access web services such as Youtube, Vimeo, Twitter?

In order to answer that, we will first have to get some knowledge about API’s,
JSON, Data structures etc.

Getting Started

For those of you that have followed us, you have hopefully gained some basic
Python knowledge. And for you who hasn’t, I’d suggest that you start reading
our pages at the very top of the site or click on the link below that you want
to read more about.

Python Tutorial

Basics (Overview)

Dictionary

Functions

Lists

Loops

Modules

Strings

API : Application Programming Interface

An API is a protocol intended to be used as an interface by software components
to communicate with each other. An API is a set of programming instructions and
standards for accessing web based software applications (such as above).

With API’s applications talk to each other without any user knowledge or
intervention.
Often, companies like Google, Vimeo and Twitter releases it’s API to the public
so that developers can develop products that are powered by its service.

It is important to know that an API is a software-to-software interface,
not a user interface.

API Key

Many services on the Internet (such as Twitter, Facebook..) requires that you
have an “API Key”.

An application programming interface key (API key) is a code passed in by
computer programs calling an API to identify the calling program, its developer,
or its user to the Web site.

API keys are used to track and control how the API is being used, for example
to prevent malicious use or abuse of the API.

The API key often acts as both a unique identifier and a secret token for
authentication, and will generally have a set of access rights on the API
associated with it.
When we interact with an API we often get the responses in a form called JSON.

Json

Let’s very quickly and without going too much in-dept see what JSON is.
JSON (JavaScript Object Notation) is a compact, text based format for computers
to exchange data.

It’s built on two structures:
– A collection of name/value pairs

– An ordered list of values.
JSON take these forms: objects, array, value, string, number

Object
– Unordered set of name/value pairs.
– Begins with { and ends with }.
– Each name is followed by : (colon)
– The name/value pairs are separated by , (comma).
Array
– Ordered collection of values.
– Begins with [ and ends with ].
– Values are separated by , (comma).
Value
– Can be a string in double quotes, number, or true or false or null,
or an object or an array.
String
– A sequence of zero or more Unicode characters, wrapped in double
quotes, using backslash escapes.
Number
– Integer, long, float

Accessing Web Services

Python provides us with the json and simplejson modules to interact with JSON.
At this time, we should know what an API is and what it does. Additional, we now
know the basics of JSON.

To get started with accessing web services, we first need to find an URL to
call the API.

Before we get the URL, I’d really recommend that you read the documentation
provided (if any).

The documentation describes how to use the API and contains important information
on how we can interact with it.
The URL that we need can often be found on the company’s website,
at the same place where the API documentation is.

As an example:

YouTube
http://gdata.youtube.com/feeds/api/standardfeeds/most_popular?v=2&alt=json

Vimeo
http://vimeo.com/api/v2/video/video_id.output

Reddit
http://www.reddit.com/user/spilcm/comments/.json

Please not that these can be outdated, hence, verify that you have the latest
version.

When you have an URL and you have read the documentation provided, we start with
importing the modules we need.

What modules do I need?

The modules I usually use when working with JSON are:
requests
json (or simplejson)
pprint

I used to use the urllib2 module to open the URL’s, but ever since Kenneth Reitz
gave us the Requests module, I’m letting that module do most of my HTTP tasks.

Working with the data

Once you know which URL you need and have imported the necessary modules,
we can use the request module to get the JSON feed.

r = requests.get(“http://www.reddit.com/user/spilcm/about/.json”)
r.text

You can copy and paste the output into a JSON editor to get an easier overview
over the data.

I use http://jsoneditoronline.org/ but any JSON editor should do the work.

The next step would be to convert the JSON output into a Python dictionary.

Converting the data

This will take the JSON string and make it a dictionary:
json.loads(r.text)

Note: You can also take a python object and serialize it to JSON, by using
json.dumps().

However, that is not what we want to do now.

Looping through the result

We know have a python dictionary and we can start using it to get the results
we want.

A common way of doing that is to loop through the result and get the data that
you are interested in.

This can sometimes be the tricky part and you need to look carefully how the
structure is presented.
Again, using a Json editor will make it easier.

Using the YouTube API

At this point, we should have enough knowledge and information to create a program

This program will show the most popular videos on YouTube.

#Import the modules
import requests
import json

# Get the feed
r = requests.get("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?v=2&alt=jsonc")
r.text

# Convert it to a Python dictionary
data = json.loads(r.text)

# Loop through the result.
for item in data['data']['items']:

    print "Video Title: %s" % (item['title'])

    print "Video Category: %s" % (item['category'])

    print "Video ID: %s" % (item['id'])

    print "Video Rating: %f" % (item['rating'])

    print "Embed URL: %s" % (item['player']['default'])

    print

See how we loop through the result to get the keys and values that we want.

YouTube, Vimeo and Twitter Examples

How to use the YouTube API in Python

How to use the Vimeo API in Python

How to use the Twitter API in Python

Parsing JSON

API Documentation for various Web Services

YouTube
https://developers.google.com/youtube/2.0/developers_guide_json

Vimeo
http://developer.vimeo.com/apis/

Twitter
https://dev.twitter.com/docs/api/1.1/overview

Reddit
http://www.reddit.com/dev/api

The post How to access various Web Services in Python appeared first on PythonForBeginners.com.

]]>
5706
Using pywhois for retrieving WHOIS information https://www.pythonforbeginners.com/python-on-the-web/using-pywhois Mon, 13 May 2013 06:13:59 +0000 https://www.pythonforbeginners.com/?p=5156 What is pywhois? pywhois is a Python module for retrieving WHOIS information of domains. pywhois works with Python 2.4+ and no external dependencies [Source] Installation The installation of pywhois is done through the pip command. pip install python-whois Now when the package is installed, you can start using it. Remember, that you have to import […]

The post Using pywhois for retrieving WHOIS information appeared first on PythonForBeginners.com.

]]>
What is pywhois?

pywhois is a Python module for retrieving WHOIS information of domains. pywhois works with Python 2.4+ and no external dependencies [Source]

Installation

The installation of pywhois is done through the pip command.

pip install python-whois

Now when the package is installed, you can start using it. Remember, that you have to import it first.

import whois

pywhois Usage

We can use the pywhois module to query a WHOIS server directly and to parse WHOIS data for a given domain. We are able to extract data for all the popular TLDs (com, org, net, …)

pywhois Examples

On the pywhois project website, we can see how we can use pywhois to extract data.

Let’s begin by importing the whois module and create a variable.

>>> import whois >>> w = whois.whois('pythonforbeginners.com’)

To print the values of all found attributes, we simple type:

>>> print w

The output should look something like this:

creation_date: [datetime.datetime(2012, 9, 15, 0, 0), '15 Sep 2012 20:41:00']
domain_name: ['PYTHONFORBEGINNERS.COM', 'pythonforbeginners.com']
...
...
updated_date: 2013-08-20 00:00:00
whois_server: whois.enom.com

We can print out any attribute we want. Say, that you just want to print out the expiration date:

>>> w.expiration_date 

Show the content downloaded from the whois server:

>>> w.text 

To make the program a bit more interactive, we can add a prompt where users can put any domain they want retrieve WHOIS information for.

import whois

data = raw_input("Enter a domain: ")
w = whois.whois(data)

print w

With the help of the pywhois module, we can use Python to do WHOIS lookups.

More reading

http://code.google.com/p/pywhois/

The post Using pywhois for retrieving WHOIS information appeared first on PythonForBeginners.com.

]]>
5156
Tweet Search with Python https://www.pythonforbeginners.com/python-on-the-web/tweet-search-with-python Mon, 22 Apr 2013 05:05:14 +0000 https://www.pythonforbeginners.com/?p=5033 Overview Twitter’s API is REST-based and will return results as either XML or JSON, as well as both RSS and ATOM feed formats. Public timelines can be accessed by any client, but all other Twitter methods require authentication. About this script The program is well documented and should be straightforward. Open up a text editor, […]

The post Tweet Search with Python appeared first on PythonForBeginners.com.

]]>
Overview

Twitter’s API is REST-based and will return results as either XML or JSON, as well as both RSS and ATOM feed formats. Public timelines can be accessed by any client, but all other Twitter methods require authentication.

About this script

The program is well documented and should be straightforward. Open up a text editor, copy & paste the code below.

Save the file as: “tweet_search.py” and exit the editor.

Getting Started

Let’s take a look at the program below that we call tweet_search.py

#!/usr/bin/python

import json
import sys
import urllib2
import os

usage = """
Usage: ./tweet_search.py 'keyword'
e.g ./tweet_search.py pythonforbeginners

Use "+" to replace whitespace"
e.g ./tweet_search.py "python+for+beginners"
"""

# Check that the user puts in an argument, else print the usage variable, then quit.
if len(sys.argv)!=2:
    print (usage)
    sys.exit(0)

# The screen name in Twitter, is the screen name of the user for whom to return results for. 

# Set the screen name to the second argument
screen = sys.argv[1]

# Open the twitter search URL the result will be shown in json format
url = urllib2.urlopen("http://search.twitter.com/search.json?q="+screen)

#convert the data and load it into json
data = json.load(url)

#to print out how many tweets there are
print len(data), "tweets"

# Start parse the tweets from the result

# Get only text
for tweet in data["results"]:
    print tweet["text"]

# Get the status and print out the contents
for status in data['results']:
    print "(%s) %s" % (status["created_at"], status["text"])

How does it work?

Let’s break down the script to see what it does.

The script starts with importing the modules we are going to need

Line 3-6

import json
import sys
import urllib2
import os

We create a usage variable to explain how to use the script.

Line 8-14 usage = """ Usage: ./tweet_search.py 'keyword' e.g ./tweet_search.py pythonforbeginners Use "+" to replace whitespace" e.g ./tweet_search.py "python+for+beginners" """

On Line 16 we check that the user puts in an argument, else print the usage variable, then quit.



if len(sys.argv)!=2:
    print (usage)
    sys.exit(0)

Line 21-24 sets the Twitter screen name to the second argument.



screen = sys.argv[1]

Line 27 open the twitter search URL and the result will be shown in json format.



url = urllib2.urlopen("http://search.twitter.com/search.json?q="+screen)

Line 30 converts the data and loads it into json



data = json.load(url)

On Line 33 we print out the number of tweets



print len(data), "tweets"

From Line 38 we start to parse the tweets from the result

for tweet in data["results"]:
    print tweet["text"]

The last thing we do in this script is to get the status and print out the contents (Line 42)



for status in data['results']:
    print "(%s) %s" % (status["created_at"], status["text"])

Go through the script line by line to see what it does. Make sure to look at it, and try to understand it.

The post Tweet Search with Python appeared first on PythonForBeginners.com.

]]>
5033
How to use urllib2 in Python https://www.pythonforbeginners.com/python-on-the-web/how-to-use-urllib2-in-python Fri, 22 Feb 2013 07:17:33 +0000 https://www.pythonforbeginners.com/?p=4440 Overview While the title of this posts says “Urllib2”, we are going to show some examples where you use urllib, since they are often used together. This is going to be an introduction post of urllib2, where we are going to focus on Getting URLs, Requests, Posts, User Agents and Error handling. Please see the […]

The post How to use urllib2 in Python appeared first on PythonForBeginners.com.

]]>
Overview

While the title of this posts says “Urllib2”, we are going to show some
examples where you use urllib, since they are often used together.

This is going to be an introduction post of urllib2, where we are going to
focus on Getting URLs, Requests, Posts, User Agents and Error handling.

Please see the official documentation for more information.

Also, this article is written for Python version 2.x

HTTP is based on requests and responses – the client makes requests and
servers send responses.

A program on the Internet can work as a client (access resources) or as
a server (makes services available).

An URL identifies a resource on the Internet.

What is Urllib2?

urllib2 is a Python module that can be used for fetching URLs.

It defines functions and classes to help with URL actions (basic and digest
authentication, redirections, cookies, etc)

The magic starts with importing the urllib2 module.

What is the difference between urllib and urllib2?

While both modules do URL request related stuff, they have different
functionality

urllib2 can accept a Request object to set the headers for a URL request,
urllib accepts only a URL.

urllib provides the urlencode method which is used for the generation
of GET query strings, urllib2 doesn’t have such a function.

Because of that urllib and urllib2 are often used together.

Please see the documentation for more information.

Urllib
Urllib2

What is urlopen?

urllib2 offers a very simple interface, in the form of the urlopen function.

This function is capable of fetching URLs using a variety of different protocols
(HTTP, FTP, …)

Just pass the URL to urlopen() to get a “file-like” handle to the remote data.

Additionaly, urllib2 offers an interface for handling common situations –
like basic authentication, cookies, proxies and so on.

These are provided by objects called handlers and openers.

Getting URLs

This is the most basic way to use the library.

Below you can see how to make a simple request with urllib2.

Begin by importing the urllib2 module.

Place the response in a variable (response)

The response is now a file-like object.

Read the data from the response into a string (html)

Do something with that string.

Note if there is a space in the URL, you will need to parse it using urlencode.

Let’s see an example of how this works.

import urllib2
response = urllib2.urlopen('https://www.pythonforbeginners.com/')
print response.info()
html = response.read()
# do something
response.close()  # best practice to close the file

Note: you can also use an URL starting with "ftp:", "file:", etc.).

The remote server accepts the incoming values and formats a plain text response
to send back.

The return value from urlopen() gives access to the headers from the HTTP server
through the info() method, and the data for the remote resource via methods like
read() and readlines().

Additionally, the file object that is returned by urlopen() is iterable.

Simple urllib2 script

Let’s show another example of a simple urllib2 script

import urllib2
response = urllib2.urlopen('http://python.org/')
print "Response:", response

# Get the URL. This gets the real URL. 
print "The URL is: ", response.geturl()

# Getting the code
print "This gets the code: ", response.code

# Get the Headers. 
# This returns a dictionary-like object that describes the page fetched, 
# particularly the headers sent by the server
print "The Headers are: ", response.info()

# Get the date part of the header
print "The Date is: ", response.info()['date']

# Get the server part of the header
print "The Server is: ", response.info()['server']

# Get all data
html = response.read()
print "Get all data: ", html

# Get only the length
print "Get the length :", len(html)

# Showing that the file object is iterable
for line in response:
 print line.rstrip()

# Note that the rstrip strips the trailing newlines and carriage returns before
# printing the output.
  

Download files with Urllib2

This small script will download a file from pythonforbeginners.com website

import urllib2

# file to be written to
file = "downloaded_file.html"

url = "https://www.pythonforbeginners.com/"
response = urllib2.urlopen(url)

#open the file for writing
fh = open(file, "w")

# read from request while writing to file
fh.write(response.read())
fh.close()

# You can also use the with statement:
with open(file, 'w') as f: f.write(response.read())

The difference in this script is that we use ‘wb’ , which means that we open the
file binary.

import urllib2

mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")

output = open('test.mp3','wb')

output.write(mp3file.read())

output.close()

Urllib2 Requests

The Request object represents the HTTP request you are making.

In its simplest form you create a request object that specifies the URL you want
to fetch.

Calling urlopen with this Request object returns a response object for the URL
requested.

The request function under the urllib2 class accepts both url and parameter.

When you don’t include the data (and only pass the url), the request being made
is actually a GET request

When you do include the data, the request being made is a POST request, where the
url will be your post url, and the parameter will be http post content.

Let’s take a look at the example below

import urllib2
import urllib

# Specify the url
url = 'https://www.pythonforbeginners.com'

# This packages the request (it doesn't make it) 
request = urllib2.Request(url)

# Sends the request and catches the response
response = urllib2.urlopen(request)

# Extracts the response
html = response.read()

# Print it out
print html 

You can set the outgoing data on the Request to post it to the server.

Additionally, you can pass data extra information(“metadata”) about the data or
the about request itself, to the server – this information is sent as HTTP
“headers”.

If you want to POST data, you have to first create the data to a dictionary.

Make sure that you understand what the code does.

# Prepare the data
query_args = { 'q':'query string', 'foo':'bar' }

# This urlencodes your data (that's why we need to import urllib at the top)
data = urllib.urlencode(query_args)

# Send HTTP POST request
request = urllib2.Request(url, data)

response = urllib2.urlopen(request)
 
html = response.read()

# Print the result
print html

User Agents

The way a browser identifies itself is through the User-Agent header.

By default urllib2 identifies itself as Python-urllib/x.y
where x and y are the major and minor version numbers of the Python release.

This could confuse the site, or just plain not work.

With urllib2 you can add your own headers with urllib2.

The reason why you would want to do that is that some websites dislike being
browsed by programs.

If you are creating an application that will access other people’s web resources,
it is courteous to include real user agent information in your requests,
so they can identify the source of the hits more easily.

When you create the Request object you can add your headers to a dictionary,
and use the add_header() to set the user agent value before opening the request.

That would look something like this:

# Importing the module
import urllib2

# Define the url
url = 'http://www.google.com/#q=my_search'

# Add your headers
headers = {'User-Agent' : 'Mozilla 5.10'}

# Create the Request. 
request = urllib2.Request(url, None, headers)

# Getting the response
response = urllib2.urlopen(request)

# Print the headers
print response.headers

You can also add headers with “add_header()”

syntax: Request.add_header(key, val)

urllib2.Request.add_header

The example below, use the Mozilla 5.10 as a User Agent, and that is also what
will show up in the web server log file.

import urllib2

req = urllib2.Request('http://192.168.1.2/')

req.add_header('User-agent', 'Mozilla 5.10')

res = urllib2.urlopen(req)

html = res.read()

print html

This is what will show up in the log file.
“GET / HTTP/1.1? 200 151 “-” “Mozilla 5.10?

urllib.urlparse

The urlparse module provides functions to analyze URL strings.

It defines a standard interface to break Uniform Resource Locator (URL)
strings up in several optional parts, called components, known as
(scheme, location, path, query and fragment)

Let’s say you have an url:
http://www.python.org:80/index.html

The scheme would be http

The location would be www.python.org:80

The path is index.html

We don’t have any query and fragment

The most common functions are urljoin and urlsplit

import urlparse

url = "http://python.org"

domain = urlparse.urlsplit(url)[1].split(':')[0]

print "The domain name of the url is: ", domain

For more information about urlparse, please see the official documentation.

urllib.urlencode

When you pass information through a URL, you need to make sure it only uses
specific allowed characters.

Allowed characters are any alphabetic characters, numerals, and a few special
characters that have meaning in the URL string.

The most commonly encoded character is the space character.

You see this character whenever you see a plus-sign (+) in a URL.

This represents the space character.

The plus sign acts as a special character representing a space in a URL

Arguments can be passed to the server by encoding them with and appending them
to the URL.

Let’s take a look at the following example.

import urllib
import urllib2

query_args = { 'q':'query string', 'foo':'bar' } # you have to pass in a dictionary  

encoded_args = urllib.urlencode(query_args)

print 'Encoded:', encoded_args

url = 'http://python.org/?' + encoded_args

print urllib2.urlopen(url).read()

If I would print this now, I would get an encoded string like this:
q=query+string&foo=bar

Python’s urlencode takes variable/value pairs and creates a properly escaped
querystring:

from urllib import urlencode

artist = "Kruder & Dorfmeister"

artist = urlencode({'ArtistSearch':artist})

This sets the variable artist equal to:

Output : ArtistSearch=Kruder+%26+Dorfmeister

Error Handling

This section of error handling is based on the information from Voidspace.org.uk great article:
Urllib2 – The Missing Manual

urlopen raises URLError when it cannot handle a response.

HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.

URLError

Often, URLError is raised because there is no network connection,
or the specified server doesn’t exist.

In this case, the exception raised will have a ‘reason’ attribute,
which is a tuple containing an error code and a text error message.

Example of URLError

req = urllib2.Request('http://www.pretend_server.org')

try: 
    urllib2.urlopen(req)

except URLError, e:
    print e.reason

(4, 'getaddrinfo failed')
HTTPError

Every HTTP response from the server contains a numeric “status code”.

Sometimes the status code indicates that the server is unable to fulfill
the request.

The default handlers will handle some of these responses for you (for example,
if the response is a “redirection” that requests the client fetch the document
from a different URL, urllib2 will handle that for you).

For those it can’t handle, urlopen will raise an HTTPError.

Typical errors include ‘404’ (page not found), ‘403’ (request forbidden),
and ‘401’ (authentication required).

When an error is raised the server responds by returning an HTTP error code
and an error page.

You can use the HTTPError instance as a response on the page returned.

This means that as well as the code attribute, it also has read, geturl,
and info, methods.

req = urllib2.Request('http://www.python.org/fish.html')

try:
    urllib2.urlopen(req)

except URLError, e:
    print e.code
    print e.read()
from urllib2 import Request, urlopen, URLError

req = Request(someurl)

try:
    response = urlopen(req)

except URLError, e:

    if hasattr(e, 'reason'):
        print 'We failed to reach a server.'
        print 'Reason: ', e.reason

    elif hasattr(e, 'code'):
        print 'The server could not fulfill the request.'
        print 'Error code: ', e.code
else:
    # everything is fine

Please take a look at the links below to get more understanding of the Urllib2
library.

Sources and further reading

http://pymotw.com/2/urllib2/
http://www.kentsjohnson.com/
http://www.voidspace.org.uk/python/articles/urllib2.shtml
http://techmalt.com/
http://www.hacksparrow.com/
http://docs.python.org/2/howto/urllib2.html
http://www.stackoverflow.com
http://www.oreillynet.com/

The post How to use urllib2 in Python appeared first on PythonForBeginners.com.

]]>
4440