Monday, May 30, 2011

Simple Python HTML,XML DOM Parser

This parser is intended to be simple and easy to use. You can download a copy from github . The parser can detect the various forms of tags like id=lorem , id='lorem', id="lorem". I have been using this script for my current scraping project.


I had been using the python module minidom but it's no use for the malformed xml and html . So I wrote this script using only regular expression.

Usage:

Copy the spp.py in your working directory and import the package.

 Instantiate the parser class by providing the url  or xml|html string

 To Get the 'id' attribute of the 'style' tag
 1.spp.parser('http://www.google.com').getByTag('style').item(0).attr('id')

 To get the attribute of a node whose id is 'csi'

 2.spp.parser('http://www.google.com').getById('csi).attr('style')

 # doc is the html or xml string you want to parse

 3.spp.parser(doc).getById('test').attr('href')

 To get the src of an image which is the 4th image of the document
 4.spp.parser(doc).getBytag('img').item(4).attr('src')

 To get the content of a node which has no child
 5.spp.parser(doc).getById('test').innerText()

There are some nice parsing tools for python available.
  1. Beautiful Soup
  2. lxml

1 comment:

  1. WOW It works Great. Perfect for lightweight work. Do You like to extend it? add features like getByClass,parent() etc.?

    ReplyDelete

About Me

Web Developer From Dhaka, Bangladesh.