Zanurkuj w Pythonie/Wyciąganie danych z dokumentu HTML: Różnice pomiędzy wersjami
Usunięta treść Dodana treść
m kopiowanie treści |
m formatowanie |
||
Linia 11:
}}
To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.
The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.
Example 8.5. Introducing urllib▼
▲'''Example 8.5. Introducing urllib'''
>>> import urllib 1▼
>>> sock = urllib.urlopen("http://diveintopython.org/") 2▼
>>>
▲ >>> sock = urllib.urlopen("http://diveintopython.org/") #(2)
>>> sock.close() 4▼
>>>
▲ >>> sock.close() #(4)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>▼
▲ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
<meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
<title>Dive Into Python</title>
<link rel='stylesheet' href='diveintopython.css' type='text/css'>
<link rev='made' href='mailto:mark@diveintopython.org'>
<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
<meta name='description' content='a free Python tutorial for experienced programmers'>
</head>
<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr><td class='header' width='1%' valign='top'>diveintopython.org</td>
<td width='99%' align='right'><hr size='1' noshade></td></tr>
<tr><td class='tagline' colspan='2'>Python for experienced programmers</td></tr>
[...
'''Example 8.6. Introducing urllister.py'''
If you have not already done so, you can download this and other examples used in this book.
from sgmllib import SGMLParser▼
class URLLister(SGMLParser):▼
def reset(self): #(1)▼
SGMLParser.reset(self)▼
self.urls = []▼
def start_a(self, attrs): #(2)▼
href = [v for k, v in attrs if k=='href'] #(3) (4)▼
if href:▼
self.urls.extend(href)▼
▲from sgmllib import SGMLParser
▲class URLLister(SGMLParser):
▲ def reset(self): 1
▲ SGMLParser.reset(self)
▲ self.urls = []
▲ def start_a(self, attrs): 2
▲ href = [v for k, v in attrs if k=='href'] 3 4
▲ if href:
▲ self.urls.extend(href)
▲1 reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.
▲2 start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.
▲3 You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.
▲4 String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.
>>> usock = urllib.urlopen("http://diveintopython.org/")▼
>>> parser = urllister.URLLister()▼
>>> parser.feed(usock.read()) 1▼
>>> usock.close() 2▼
>>> parser.close() 3▼
>>> for url in parser.urls: print url 4▼
toc/index.html▼
#download▼
#languages▼
toc/index.html▼
appendix/history.html▼
download/diveintopython-html-5.0.zip▼
download/diveintopython-pdf-5.0.zip▼
download/diveintopython-word-5.0.zip▼
download/diveintopython-text-5.0.zip▼
download/diveintopython-html-flat-5.0.zip▼
download/diveintopython-xml-5.0.zip▼
download/diveintopython-common-5.0.zip▼
>>> import urllib, urllister
▲ >>> usock = urllib.urlopen("http://diveintopython.org/")
▲ >>> parser = urllister.URLLister()
▲ >>> parser.feed(usock.read()) #(1)
▲ >>> usock.close() #(2)
▲ >>> parser.close() #(3)
▲ >>> for url in parser.urls: print url #(4)
▲ toc/index.html
▲ #download
▲ #languages
▲ toc/index.html
▲ appendix/history.html
▲ download/diveintopython-html-5.0.zip
▲ download/diveintopython-pdf-5.0.zip
▲ download/diveintopython-word-5.0.zip
▲ download/diveintopython-text-5.0.zip
▲ download/diveintopython-html-flat-5.0.zip
▲ download/diveintopython-xml-5.0.zip
▲ download/diveintopython-common-5.0.zip
... rest of output omitted for brevity ...▼
▲... rest of output omitted for brevity ...
▲1 Call the feed method, defined in SGMLParser, to get HTML into the parser.[1] It takes a string, which is what usock.read() returns.
▲2 Like files, you should close your URL objects as soon as you're done with them.
▲3 You should close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed.
▲4 Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.)
Footnotes
|