Support Forums

Full Version: Parsing Webpages in Python
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Tutorial on parsing HTML(and XML) in Python:


What is parser?(From Wikipedia)
The most common use of a parser is as a component of a compiler or interpreter. This parses the source code of a computer programming language to create some form of internal representation.

What is the point of this tutorial?
Learning how to grab any piece of information you want off any webpage, efficiently and easily.


Section 1 - Basics of HTML tags, attributes & difference from XML (Click to View)

Section 2 - Source from a webpage with Python (Click to View)

Section 3 - Basics of the LXML module, and the DOM (Click to View)

Section 4 - Using what you learned (Click to View)
This is coming along nicely so far, Nyx-, just tell me when you're finished and I'll take another read. I'm itching to learn Python now.
Can't wait to see it finished...
I was looking for this and you're going straight into my direction, thanks!

Will be checking this later.. to see if you have it finished ;)
Awesome Big Grin

I never really touched any parsers and just stuck with regex and logic lol, but this seems really handy ^_^
(01-20-2010, 04:05 PM)Master of The Universe Wrote: [ -> ]Can't wait to see it finished...
I was looking for this and you're going straight into my direction, thanks!

Will be checking this later.. to see if you have it finished ;)

hehe awesome, yea its finished now. PLEASE PLEASE PLEASE read it and tell me if theres any parts i need to go more in detail in, or parts that have typos, are ugly, ect so I can try and improve it ^__^. Having a web developer see if my tutorial is understandable would be perfect, atleast for the web development parts.
Ahh just pure knowledge, it's really easy to understand and I also don't see any kind of mistakes... for now ;)
I was following your tutorial, then I changed urllib to urllib2, now a question.... when does it matter which lib I use...?

Also I do it like Fallen and use regex, idk... it's better imo... ;)
I'm writting here a little script that I'll only be able to finish because of this tutorial, I'll post it later....

Thanks for the share....
(01-23-2010, 08:20 PM)Master of The Universe Wrote: [ -> ]Ahh just pure knowledge, it's really easy to understand and I also don't see any kind of mistakes... for now ;)
I was following your tutorial, then I changed urllib to urllib2, now a question.... when does it matter which lib I use...?

Also I do it like Fallen and use regex, idk... it's better imo... ;)
I'm writting here a little script that I'll only be able to finish because of this tutorial, I'll post it later....

Thanks for the share....

uhh i know they have some different functions ect, but for just opening up a link to read i dont think there would be much difference, you can look at the urllib library page and urllib2 and see what they added on, but i couldn't say forsure tbh
Yeah I'm looking at both oin the Docs... I'm not sure but it looks like the urllib2 has functions for cookie handling and such things... idk just started looking at Big Grin

But so far, your tut helped me understand the basics, I'm just now playing with different sites than just PB... Tongue
(01-24-2010, 05:07 PM)Master of The Universe Wrote: [ -> ]Yeah I'm looking at both oin the Docs... I'm not sure but it looks like the urllib2 has functions for cookie handling and such things... idk just started looking at Big Grin

But so far, your tut helped me understand the basics, I'm just now playing with different sites than just PB... Tongue

yea hopefully you can help me on better "global" gallery rules ^_^, a.k.a. better way to grab gallery pictures from more then just a single site. what i did was parse for every link that ended in a .jpg, .gif, .png ect, then added it to a list and it wouldnt add more then one of a link into the list of direct links, but whats even harder is figuring out a good global rule to go to the next page in a gallery, cause on PB i use a get request which is terrible since i doubt a get request is going to be even similar over all, i'd have to think of a decent parsing rule to grab a next page button or something.
Ahh I know what you mean now...
I can think of a way maybe, but I'll check it out first...
I will go through few HTML sources along with the PB ones and look for any smiliar codes...
Pages: 1 2