SIMPLE_THOUGHTS: Simple Python HTML,XML DOM Parser

This parser is intended to be simple and easy to use. You can download a copy from github . The parser can detect the various forms of tags like id=lorem , id='lorem', id="lorem". I have been using this script for my current scraping project.

I had been using the python module minidom but it's no use for the malformed xml and html . So I wrote this script using only regular expression.

Usage:

Copy the spp.py in your working directory and import the package.

Instantiate the parser class by providing the url or xml|html string

To Get the 'id' attribute of the 'style' tag
1.spp.parser('http://www.google.com').getByTag('style').item(0).attr('id')

To get the attribute of a node whose id is 'csi'

2.spp.parser('http://www.google.com').getById('csi).attr('style')

# doc is the html or xml string you want to parse

3.spp.parser(doc).getById('test').attr('href')

To get the src of an image which is the 4th image of the document
4.spp.parser(doc).getBytag('img').item(4).attr('src')

To get the content of a node which has no child
5.spp.parser(doc).getById('test').innerText()

There are some nice parsing tools for python available.

Monday, May 30, 2011

Simple Python HTML,XML DOM Parser

1 comment:

About Me

Blog Archive

Followers