MSHTMLでHTMLパース
なにこれ面白い。
WindowsならMSHTML COMコンポーネントを使ってHTMLをパースできちゃう。
import win32com.client import urllib2 data = urllib2.urlopen("http://www.python.org").read() html = win32com.client.Dispatch("htmlfile") html.write(data) print "Titlte: %s" % html.title metas = html.getElementsByTagName("meta") for m in metas: print "%s: %s\n" % (m.name, m.content) pass
実行結果
Titlte: Python Programming Language -- Official Website
: text/html; charset=utf-8keywords: python programming language object oriented web free source
description: Home page for Python, an interpreted, interactive, object-oriented, extensible
programming language. It provides an extraordinary combination of clarity and
versatility, and is free and comprehensively ported.