Python Web Scrapper with BeautifulSoup

You must have wondered how people are showing data which you might have already looked on few other sites. Well, they have scrapped the data from the source and manipulated that within their own application.

There are thousands of HTML (or SGML, or XML) parsing libraries for hundreds of languages out there, but we will use a Python library called BeautifulSoup which takes care of almost all of the work for you. The BeautifulSoup library is an extremely helpful tool to have at your disposal, since it not only gives you functions to search and modify your parse tree, but it also handles the broken and malformed HTML you’re likely to encounter on an average Web page.

You can download the library at its Web page. It also resides in some popular software repositories, such as the apt-get repository used in the Debian and Ubuntu distributions.

We’ll write a Web scraper that prints all the displayed text contained within <p> tags. This is a very simple implementation that is easy to trip up, but it should be enough to demonstrate how using the library works.

First up, we need to retrieve the source of the page that we want to scrape. The following code will take an address given on the command line and put the contents into the variable html:

Then we need to build a parse tree using BeautifulSoup:

At this point the code has already been cleaned up and converted to unicode by the BeautifulSoup library, you can print soup.prettify() to get a clean dump of the source code.

Instead, what we want is to print all of the text, without the tags, so we need to find out which parts of the parse tree are text. In BeautifulSoup there are two kinds of nodes in the parse tree, plain text is represented by the NavigableString class, whereas Tags hold mark-up. Tags are recursive structures, they can hold many children, each being either other Tags or NavigableStrings.

We want to write a recursive function that takes part of the tree: if it is a NavigableString print it out, otherwise, runs the function again on each subtree. Because we can iterate over a tag’s children simply by referring to that tag this is easy.

Then we just need to run that function on all the <p> tags. We can use BeautifulSoup’s in built parse tree searching functions to retrieve all of them:

That’s it. You’ve got a fully functioning, if basic, HTML scraper. For more help with searching the parse tree, look up the BeautifulSoup documentation.

The full code for this example is as follows:

Leave me a comment and let me hear your opinion. If you’ve got any thoughts, comments or suggestions for things we could add, leave a comment! Also please Subscribe to our RSS for latest tips, tricks and examples on cutting edge stuff.

0 I like it
0 I don't like it