Monday, May 30, 2011

Simple Python HTML,XML DOM Parser

This parser is intended to be simple and easy to use. You can download a copy from github . The parser can detect the various forms of tags like id=lorem , id='lorem', id="lorem". I have been using this script for my current scraping project.

I had been using the python module minidom but it's no use for the malformed xml and html . So I wrote this script using only regular expression.


Copy the in your working directory and import the package.

 Instantiate the parser class by providing the url  or xml|html string

 To Get the 'id' attribute of the 'style' tag

 To get the attribute of a node whose id is 'csi'


 # doc is the html or xml string you want to parse


 To get the src of an image which is the 4th image of the document

 To get the content of a node which has no child

There are some nice parsing tools for python available.
  1. Beautiful Soup
  2. lxml

1 comment:

  1. WOW It works Great. Perfect for lightweight work. Do You like to extend it? add features like getByClass,parent() etc.?


About Me

Web Developer From Dhaka, Bangladesh.