MSHTMLでHTMLパース

なにこれ面白い。
WindowsならMSHTML COMコンポーネントを使ってHTMLをパースできちゃう。

import win32com.client
import urllib2

data = urllib2.urlopen("http://www.python.org").read()
html = win32com.client.Dispatch("htmlfile")

html.write(data)

print "Titlte: %s" % html.title
metas = html.getElementsByTagName("meta")

for m in metas:
    print "%s: %s\n" % (m.name, m.content)
    pass

実行結果

Titlte: Python Programming Language -- Official Website
: text/html; charset=utf-8

keywords: python programming language object oriented web free source

description: Home page for Python, an interpreted, interactive, object-oriented, extensible
programming language. It provides an extraordinary combination of clarity and
versatility, and is free and comprehensively ported.