This parser is intended to be simple and easy to use. You can download a copy from github . The parser can detect the various forms of tags like id=lorem , id='lorem', id="lorem". I have been using this script for my current scraping project.
I had been using the python module minidom but it's no use for the malformed xml and html . So I wrote this script using only regular expression.
Usage:
Copy the spp.py in your working directory and import the package.
Instantiate the parser class by providing the url or xml|html string
To Get the 'id' attribute of the 'style' tag
1.spp.parser('http://www.google.com').getByTag('style').item(0).attr('id')
To get the attribute of a node whose id is 'csi'
2.spp.parser('http://www.google.com').getById('csi).attr('style')
# doc is the html or xml string you want to parse
3.spp.parser(doc).getById('test').attr('href')
To get the src of an image which is the 4th image of the document
4.spp.parser(doc).getBytag('img').item(4).attr('src')
To get the content of a node which has no child
5.spp.parser(doc).getById('test').innerText()
There are some nice parsing tools for python available.
WOW It works Great. Perfect for lightweight work. Do You like to extend it? add features like getByClass,parent() etc.?
ReplyDelete