01-20-2010, 12:22 PM
Tutorial on parsing HTML(and XML) in Python:
HTML Tags are certain words within angle brackets
HTML Elements have have a start tag, an end tag, and consist of the start tag, end tag and everything in between. The exception is when an HTML element contains its own end tag, like <br />(break) elements, and <img> elements, in these cases the start tag contains the closing like "/>".
HTML attributes are basically extra info to describe an element. You always put your attributes in the start tag in the format: name="value".
An HTML document consists of a bunch of elements, so its pretty much all just start tags, end tags, and plain text. All my examples above are all proper elements by the XHTML guidelines, but when you are parsing a webpage you cannot count that each page will be properly formed. And thats where the differences between XML and HTML come up, even though in this tutorial I will only be dealing with HTML, but HTML is honestly harder to parse in most cases because of the loose rule set it has.
Now that you understand the basics of what your parser is trying to get onto the next step: Getting the source from a web page in python.
Getting the source of a webpage in python is extremely easy, and this will probably be the smallest section in this tutorial.
All you really need to know is the urllib module, which most people already should.
To open a URL and grab the source you would have to open a connection to the web page, and then read the page:
Now you have the variable webpageSource which contains the source of www.google.com, this is pretty basic but there are some tricks that can be used.
I've decided to use the LXML module over other parsers(I was planning on using SGMLParser in this tutorial). But when it came to making a DOM representation from HTML source with Python it got a little hairy, SGMLParser couldn't do it at all, and I read a few OLD tutorials that used non maintained modules, which I didnt think would be useful for learning at all, if it wont work in the future.
What made me choose LXML was this blog: http://blog.ianbicking.org/2008/03/30/py...rformance/
Installing lxml
lxml isn't part of the standard python library and needs to be installed.
For people with linux it should be as easy as getting "Python-lxml" from your package manager.
For people where that doesnt work, or for those of you with windows there are instructions on how to install for different OS's here:
http://codespeak.net/lxml/installation.html
Once its installed you should be able to just "import lxml" without error. Please post if you have errors downloading this and I can try and help you install it personally.
The HTML DOM defines the objects and properties of all HTML elements, and the methods (interface) to access them. The DOM presents an HTML document as a tree-structure.
In other words: The HTML DOM is a good way on how to get, change, add, or delete HTML elements.
Every single item in HTML is represented as a "node" in the DOM, which makes it way easier to find items that dont have id's or finding a element under another element in a page.
First there is only 1 root element, and that root element has "children" or childNodes.
For example <html> is usually the root element, which usually has 2 children, <head> and <body>.
<head> and <body> are "siblings" with each other, and <html> is their "parent" element. <head> and <body> both can also have childNodes, and the tree can take off like that.
In DOM you might think that Text inbetween tags belong to that node, but thats not true they belong to there own childNode of that element.
Text is Always Stored in Text Nodes
For example <p>Hello</p>: <p> is the parentNode of a textNode that contains Hello. Although this doesnt matter because LXML has a method to grab the text from an elementobject, and all its childNodes.
The way LXML handles DOM representation is simply: Python lists.
An element is an array, and within that array is the children. Lets say HTML is the document variable: print(len(HTML)) would print 2 for the <head> and <body> childNodes.
Now you are probably wondering if its an array, how do you even get the name of the tag you are working with? LXML comes with properties and methods for each element to get its tagname, attributes, and other coolf stuff]
So now you should be able to grab an elementObject anywhere in a page, and have a small understand on how node tree's work.
In this section we are going to combine everything you have (hopefully) learned in this tutorial and make a program that will parse a photobucket page and grab all the pictures in the gallery. It might seem to difficult but with everything on this page it makes the task very possible.
When you are working on an application the first step is to plan out what you want the program to do and see what the site would do in that case. The first thing you will want is someone to enter a photobucket gallery, this is simply done like:
Now hoping that the user enters proper information, you have the only 2 variables you need to start working on grabbing the pictures.
The type of URL we are looking for is:
But when a person by default goes to a photobucket album it has the GET request below added on, if a person enters this into the raw_input it can mess up the script. You should NEVER hope what a user enters is going to be exactly what you expect, thats why if they dont enter a GET request at the end, you should add your own, and if they have do have a GET request, remove it and add your own. There are about a million more errors that COULD happen with user input, but I'm not going to cover them in this tutorial or even fix what I have brought up. Just PRETEND the user enters a URL without a get request. And if you want after the tutorial you can add your own exceptions to the script.
The ?start=0 is a GET request, and when you go to the next page you get ?start=20, so page 1 would have 0, page 2 would have 20 ect.
This is how you tell the server what pictures you want to view, so in your script you will need to have a variable that counts each page you are on, This way you can just multiple the variable by 20 and subtract 20 to get the right GET request. You will also have to check if you are on the last page, this is pretty easy if you count the images, and if there is less than 20 pages on a picture break your loop and print saying you are done.
A function to get the page source with a given page number
Now you should have the source of the photobucket album page 1 saved in a document object named 'doc', once you get the document object you can start parsing to download the pictures, that is what the getPics() function will contain, it will also increase the page number by 1 once its finished the images.
The next step is to parse the HTML for every direct link. Photobucket makes this very easy. If you look at the page source the direct link inputs for each thumbnail has its own unique ID!
So as you can see from the above code each image on the page has an id of urlcode+0-19. We can use lxml's elementObject.get_element_by_id(id) method to grab every element with the ID of urlcode+COUNT, and then we can grab the value attribute from that element to get the pictures direct link. Once we get the direct link to the picture we are free to download it using basic python.
The above function getPics() uses the global 'doc' variable to parse the web page. Every direct link it finds, it downloads the image, and increases the ID number, and the total number of images(the total variable is just for the sake of naming because the ID variable gets reset every page.) Then goes onto the next ID and downloads that, if the IDcounter variable is less then 20 but does not exist, it will raise the exception of no pictures on the page. If it gets all the way to 20 images it resets the IDcounter variable, and increases the page amount by 1 and called the getPage() function to grab the source of that page and save it into the doc variable.
Here is the whole code, its very very basic and if you enter something wrong you can consider it wont work, and it gives no sign of which image it is on. If you feel up to the challenge add some Print() statements to show which image it is currently on ect, all the variables are there ^__^ for you
Hopefully you found this tutorial useful ^__^
What is parser?(From Wikipedia)
The most common use of a parser is as a component of a compiler or interpreter. This parses the source code of a computer programming language to create some form of internal representation.
What is the point of this tutorial?
Learning how to grab any piece of information you want off any webpage, efficiently and easily.
The most common use of a parser is as a component of a compiler or interpreter. This parses the source code of a computer programming language to create some form of internal representation.
What is the point of this tutorial?
Learning how to grab any piece of information you want off any webpage, efficiently and easily.
Section 1 - Basics of HTML tags, attributes & difference from XML (Click to View)
Section Sources:
These pages will help you in this section.
HTML Tags are certain words within angle brackets
HTML Tag Examples Wrote:<html>
</img>
<br />
HTML Elements have have a start tag, an end tag, and consist of the start tag, end tag and everything in between. The exception is when an HTML element contains its own end tag, like <br />(break) elements, and <img> elements, in these cases the start tag contains the closing like "/>".
HTML Element Examples Wrote:<p>Me gusta comer me gatos caca, JAJA</p>
<br />
<a href="http://meatspin.com">http://www.google.com/</a>
<img src="someimageurl.jpg" />
HTML attributes are basically extra info to describe an element. You always put your attributes in the start tag in the format: name="value".
HTML Attribute Examples Wrote:<a id="LINK" href="www.google.com">GOOGLE</a>
id and href are attributes of the a element, and LINK is the id, and www.google.com is the href.
<input type="text" class="purpleInput" />
type and class are attributes of the input element and the type is "text" and the class is "purpleInput"
An HTML document consists of a bunch of elements, so its pretty much all just start tags, end tags, and plain text. All my examples above are all proper elements by the XHTML guidelines, but when you are parsing a webpage you cannot count that each page will be properly formed. And thats where the differences between XML and HTML come up, even though in this tutorial I will only be dealing with HTML, but HTML is honestly harder to parse in most cases because of the loose rule set it has.
Differences between XML and HTML(structure wise) Wrote:XML is case sensitive(all lowercase) while HTML is not.(obviously not in some cases like id attributes)
HTML does not require a proper closing tags.(ie. <a href="">, and <br> both work)
XHTML elements must be properly nested(ie. <b><i></b></i> is not proper), and can contain only one root element.(<html> element in most cases)
Now that you understand the basics of what your parser is trying to get onto the next step: Getting the source from a web page in python.
Section 2 - Source from a webpage with Python (Click to View)
Section Sources:
These pages will help you in this section.
Getting the source of a webpage in python is extremely easy, and this will probably be the smallest section in this tutorial.
All you really need to know is the urllib module, which most people already should.
To open a URL and grab the source you would have to open a connection to the web page, and then read the page:
Code:
import urllib
"""REMEMBER http:// in your urlopen URL"""
webpage = urllib.urlopen("http://www.google.com")
webpageSource = webpage.read()
Now you have the variable webpageSource which contains the source of www.google.com, this is pretty basic but there are some tricks that can be used.
Checking if the URL has changed Wrote:This uses the method geturl() for URL objects(what you made with urlopen on a URL)
This would be useful if you try to search or go to a page, but get forwarded to another. I used this to check if i was forwarded to search.php after trying to go to an anime URL, which meant the anime URL wasn't right.
Code:import urllib
webpage = urllib.urlopen("http://www.google.com")
if not webpage.geturl() == "http://www.google.com":
print "oh noez i been forwarded"
Sending requests Wrote:To send a get request you simply add(+) the data to the end of the URL string, and to send a post request you would send the data as a second argument in the urlopen function.And thats pretty much all for this section, now you should be able to grab the source from any web page, and even send data so you know you got the proper page if server side scripts affect the the HTML output(PHP duh).
Assuming DATA is a string variable containing data the webserver can work with:
GET request:
POST request:Code:import urllib
webpage = urllib.urlopen("http://www.somesite.com/search.php?"+DATA)
Code:import urllib
webpage = urllib.urlopen("http://www.somesite.com/search.php?", DATA)
Section 3 - Basics of the LXML module, and the DOM (Click to View)
Section Sources:
I highly recommend you to download the above PDF
- http://w3schools.com/htmldom/default.asp
- http://w3schools.com/xml/default.asp
- http://codespeak.net/lxml/lxmldoc-2.2.4.pdf
I highly recommend you to download the above PDF
I've decided to use the LXML module over other parsers(I was planning on using SGMLParser in this tutorial). But when it came to making a DOM representation from HTML source with Python it got a little hairy, SGMLParser couldn't do it at all, and I read a few OLD tutorials that used non maintained modules, which I didnt think would be useful for learning at all, if it wont work in the future.
What made me choose LXML was this blog: http://blog.ianbicking.org/2008/03/30/py...rformance/
lxml homepage Wrote:lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.
Installing lxml
lxml isn't part of the standard python library and needs to be installed.
For people with linux it should be as easy as getting "Python-lxml" from your package manager.
For people where that doesnt work, or for those of you with windows there are instructions on how to install for different OS's here:
http://codespeak.net/lxml/installation.html
Once its installed you should be able to just "import lxml" without error. Please post if you have errors downloading this and I can try and help you install it personally.
The HTML DOM defines the objects and properties of all HTML elements, and the methods (interface) to access them. The DOM presents an HTML document as a tree-structure.
In other words: The HTML DOM is a good way on how to get, change, add, or delete HTML elements.
Every single item in HTML is represented as a "node" in the DOM, which makes it way easier to find items that dont have id's or finding a element under another element in a page.
First there is only 1 root element, and that root element has "children" or childNodes.
For example <html> is usually the root element, which usually has 2 children, <head> and <body>.
<head> and <body> are "siblings" with each other, and <html> is their "parent" element. <head> and <body> both can also have childNodes, and the tree can take off like that.
In DOM you might think that Text inbetween tags belong to that node, but thats not true they belong to there own childNode of that element.
Text is Always Stored in Text Nodes
For example <p>Hello</p>: <p> is the parentNode of a textNode that contains Hello. Although this doesnt matter because LXML has a method to grab the text from an elementobject, and all its childNodes.
The way LXML handles DOM representation is simply: Python lists.
An element is an array, and within that array is the children. Lets say HTML is the document variable: print(len(HTML)) would print 2 for the <head> and <body> childNodes.
Now you are probably wondering if its an array, how do you even get the name of the tag you are working with? LXML comes with properties and methods for each element to get its tagname, attributes, and other coolf stuff]
Basics of lxml DOM walking Wrote:Code:>>> import lxml, urllib
>>> webpage = urllib.urlopen("http://www.google.com")
>>> webpageSource = webpage.read()
"""HOW TO MAKE THE DOCUMENT OBJECT FROM SOURCE"""
>>> doc = lxml.html.document_fromstring(webpageSource)
"""HOW TO WALK THE DOCUMENT"""
>>> print(doc.tag)
html
>>> firstChild = doc[0]
>>> firstChild.tag
'head'
>>> childOfFirstChild = firstChild[0]
>>> childOfFirstChild.tag
'meta'
>>> attributes = childOfFirstChild.attrib
>>> print attributes
{'content': 'text/html; charset=ISO-8859-1', 'http-equiv': 'content-type'}
So as you can see from the example, walking around the nodeTree is pretty basic, but there are shortcuts rather then starting at the root of the document. You can save any certain element as a variable and start walking from there.
And here are some useful LXML properties and methods:
lxml.html.document_fromstring(string) - Parses a document from the given string. This always creates a correct HTML document
elementObject.tag - returns the tag name of the element
elementObject.attrib - returns the attributes of an element as a dictionary
elementObject.getparent() - return the parentNodeObject of the element
elementObject.getprevious() - return the sibling above an elementObject
elementObject.getnext() - return the sibling under an elementObject
elementObject.find_class(className) - Returns a list of all the elements with the given CSS class name.
elementObject.get_element_by_id(id) - Return the element with the given id, or the default if none is found.
elementObject.text_content() - Returns the text content of the element, including the text content of its children,with no markup.
elementObject.get(attributeName) - Returns a list of values for elements with the attributeName.
elementObject.set(attributeName, Value) - Changes an attribute to a certain value.
So now you should be able to grab an elementObject anywhere in a page, and have a small understand on how node tree's work.
Section 4 - Using what you learned (Click to View)
Section Sources:
These pages will help you in this section.
In this section we are going to combine everything you have (hopefully) learned in this tutorial and make a program that will parse a photobucket page and grab all the pictures in the gallery. It might seem to difficult but with everything on this page it makes the task very possible.
When you are working on an application the first step is to plan out what you want the program to do and see what the site would do in that case. The first thing you will want is someone to enter a photobucket gallery, this is simply done like:
Code:
photobucketURL = raw_input("Enter a photobucket URL:\n")
downloadDirectory = raw_input("Enter local download directory:\n")
Now hoping that the user enters proper information, you have the only 2 variables you need to start working on grabbing the pictures.
The type of URL we are looking for is:
Code:
http://s946.photobucket.com/albums/ad305/Leave-me-alone-plz2/
Code:
http://s946.photobucket.com/albums/ad305/Leave-me-alone-plz2/?start=0
This is how you tell the server what pictures you want to view, so in your script you will need to have a variable that counts each page you are on, This way you can just multiple the variable by 20 and subtract 20 to get the right GET request. You will also have to check if you are on the last page, this is pretty easy if you count the images, and if there is less than 20 pages on a picture break your loop and print saying you are done.
A function to get the page source with a given page number
Code:
import urllib, lxml.html as html
downloadDirectory = raw_input("Enter local download directory:\n")
os.chdir(downloadDirectory)
photobucketURL = raw_input("Enter a photobucket URL:\n")
page = 1
def getPage(pageNum):
global doc
GETrequest = "?start="+str(page*20-20)
photobucketPage = urllib.urlopen(photobucketURL+GETrequest)
photobucketSource = photobucketPage.read()
doc = html.document_fromstring(photobucketSource)
getPics()
Now you should have the source of the photobucket album page 1 saved in a document object named 'doc', once you get the document object you can start parsing to download the pictures, that is what the getPics() function will contain, it will also increase the page number by 1 once its finished the images.
The next step is to parse the HTML for every direct link. Photobucket makes this very easy. If you look at the page source the direct link inputs for each thumbnail has its own unique ID!
Code:
<label for="urlcode1">Direct Link</label>
<input type="text" class="txtCode" name="urlcode1" id="urlcode1" value="http://i946.photobucket.com/albums/ad305/Leave-me-alone-plz2/Untitled-6.png" readonly="readonly" onclick="tr('album_thumb_link_code_click');trackCodeClick(event,'image','0'); copyToClipboard(this);">
Code:
import os, urllib, lxml.html as html
downloadDirectory = raw_input("Enter local download directory:\n")
os.chdir(downloadDirectory)
photobucketURL = raw_input("Enter a photobucket URL:\n")
page = 1
total = 1
def getPics():
global page, total
IDcounter = 0
while IDcounter < 20:
try:
imgID = "urlcode"+str(IDcounter)
imgElement = doc.get_element_by_id(imgID)
imgLink = imgElement.attrib['value']
picURL = urllib.urlopen(imgLink)
rPic = picURL.read()
dlSpot = open('image'+str(total), 'w')
image = dlSpot.write(rPic)
dlSpot.close()
IDcounter += 1
total += 1
except:
print "Less then 20 pictures on page"
break
else:
page += 1
getPage(page)
The above function getPics() uses the global 'doc' variable to parse the web page. Every direct link it finds, it downloads the image, and increases the ID number, and the total number of images(the total variable is just for the sake of naming because the ID variable gets reset every page.) Then goes onto the next ID and downloads that, if the IDcounter variable is less then 20 but does not exist, it will raise the exception of no pictures on the page. If it gets all the way to 20 images it resets the IDcounter variable, and increases the page amount by 1 and called the getPage() function to grab the source of that page and save it into the doc variable.
Here is the whole code, its very very basic and if you enter something wrong you can consider it wont work, and it gives no sign of which image it is on. If you feel up to the challenge add some Print() statements to show which image it is currently on ect, all the variables are there ^__^ for you
Code:
import os, urllib, lxml.html as html
downloadDirectory = raw_input("Enter local download directory:\n")
os.chdir(downloadDirectory)
photobucketURL = raw_input("Enter a photobucket URL:\n")
page = 1
total = 1
def getPics():
global page, total
IDcounter = 0
while IDcounter < 20:
try:
imgID = "urlcode"+str(IDcounter)
imgElement = doc.get_element_by_id(imgID)
imgLink = imgElement.attrib['value']
picURL = urllib.urlopen(imgLink)
rPic = picURL.read()
dlSpot = open('image'+str(total), 'w')
image = dlSpot.write(rPic)
dlSpot.close()
IDcounter += 1
total += 1
except:
print "Less then 20 pictures on page"
break
else:
page += 1
getPage(page)
def getPage(pageNum):
global doc
GETrequest = "?start="+str(pageNum*20-20)
photobucketPage = urllib.urlopen(photobucketURL+GETrequest)
photobucketSource = photobucketPage.read()
doc = html.document_fromstring(photobucketSource)
getPics()
getPage(page)
Hopefully you found this tutorial useful ^__^