Support Forums

Tutorial on parsing HTML(and XML) in Python:

What is parser?(From Wikipedia)
The most common use of a parser is as a component of a compiler or interpreter. This parses the source code of a computer programming language to create some form of internal representation.

What is the point of this tutorial?
Learning how to grab any piece of information you want off any webpage, efficiently and easily.

Section 1 - Basics of HTML tags, attributes & difference from XML (Click to View)

Section Sources:

These pages will help you in this section.

HTML Tags are certain words within angle brackets

HTML Tag Examples Wrote:<html>
</img>

HTML Elements have have a start tag, an end tag, and consist of the start tag, end tag and everything in between. The exception is when an HTML element contains its own end tag, like (break) elements, and <img> elements, in these cases the start tag contains the closing like "/>".

HTML Element Examples Wrote:Me gusta comer me gatos caca, JAJA
 
<a href="http://meatspin.com">http://www.google.com/</a>
<img src="someimageurl.jpg" />

HTML attributes are basically extra info to describe an element. You always put your attributes in the start tag in the format: name="value".

HTML Attribute Examples Wrote:<a id="LINK" href="www.google.com">GOOGLE</a>
id and href are attributes of the a element, and LINK is the id, and www.google.com is the href.
<input type="text" class="purpleInput" />
type and class are attributes of the input element and the type is "text" and the class is "purpleInput"

An HTML document consists of a bunch of elements, so its pretty much all just start tags, end tags, and plain text. All my examples above are all proper elements by the XHTML guidelines, but when you are parsing a webpage you cannot count that each page will be properly formed. And thats where the differences between XML and HTML come up, even though in this tutorial I will only be dealing with HTML, but HTML is honestly harder to parse in most cases because of the loose rule set it has.

Differences between XML and HTML(structure wise) Wrote:XML is case sensitive(all lowercase) while HTML is not.(obviously not in some cases like id attributes)
HTML does not require a proper closing tags.(ie. <a href="">, and both work)
XHTML elements must be properly nested(ie. is not proper), and can contain only one root element.(<html> element in most cases)

Now that you understand the basics of what your parser is trying to get onto the next step: Getting the source from a web page in python.

Section 2 - Source from a webpage with Python (Click to View)

Section 3 - Basics of the LXML module, and the DOM (Click to View)

Section 4 - Using what you learned (Click to View)

This is coming along nicely so far, Nyx-, just tell me when you're finished and I'll take another read. I'm itching to learn Python now.

Can't wait to see it finished...
I was looking for this and you're going straight into my direction, thanks!

Will be checking this later.. to see if you have it finished ;)

Awesome

I never really touched any parsers and just stuck with regex and logic lol, but this seems really handy ^_^

(01-20-2010, 04:05 PM)Master of The Universe Wrote: [ -> ]Can't wait to see it finished...
I was looking for this and you're going straight into my direction, thanks!

Will be checking this later.. to see if you have it finished ;)

hehe awesome, yea its finished now. PLEASE PLEASE PLEASE read it and tell me if theres any parts i need to go more in detail in, or parts that have typos, are ugly, ect so I can try and improve it ^__^. Having a web developer see if my tutorial is understandable would be perfect, atleast for the web development parts.

Ahh just pure knowledge, it's really easy to understand and I also don't see any kind of mistakes... for now ;)
I was following your tutorial, then I changed urllib to urllib2, now a question.... when does it matter which lib I use...?

Also I do it like Fallen and use regex, idk... it's better imo... ;)
I'm writting here a little script that I'll only be able to finish because of this tutorial, I'll post it later....

Thanks for the share....

(01-23-2010, 08:20 PM)Master of The Universe Wrote: [ -> ]Ahh just pure knowledge, it's really easy to understand and I also don't see any kind of mistakes... for now ;)
I was following your tutorial, then I changed urllib to urllib2, now a question.... when does it matter which lib I use...?

Also I do it like Fallen and use regex, idk... it's better imo... ;)
I'm writting here a little script that I'll only be able to finish because of this tutorial, I'll post it later....

Thanks for the share....

uhh i know they have some different functions ect, but for just opening up a link to read i dont think there would be much difference, you can look at the urllib library page and urllib2 and see what they added on, but i couldn't say forsure tbh

Yeah I'm looking at both oin the Docs... I'm not sure but it looks like the urllib2 has functions for cookie handling and such things... idk just started looking at Big Grin

But so far, your tut helped me understand the basics, I'm just now playing with different sites than just PB... Tongue

(01-24-2010, 05:07 PM)Master of The Universe Wrote: [ -> ]Yeah I'm looking at both oin the Docs... I'm not sure but it looks like the urllib2 has functions for cookie handling and such things... idk just started looking at

But so far, your tut helped me understand the basics, I'm just now playing with different sites than just PB...

yea hopefully you can help me on better "global" gallery rules ^_^, a.k.a. better way to grab gallery pictures from more then just a single site. what i did was parse for every link that ended in a .jpg, .gif, .png ect, then added it to a list and it wouldnt add more then one of a link into the list of direct links, but whats even harder is figuring out a good global rule to go to the next page in a gallery, cause on PB i use a get request which is terrible since i doubt a get request is going to be even similar over all, i'd have to think of a decent parsing rule to grab a next page button or something.

Ahh I know what you mean now...
I can think of a way maybe, but I'll check it out first...
I will go through few HTML sources along with the PB ones and look for any smiliar codes...

Nyx-

Psycho

Gaijin

Fallen

Nyx-

Gaijin

Nyx-

Gaijin

Nyx-

Gaijin