Wersja z 20:38, 13 kwi 2007 edytuj Piotr (dyskusja \| edycje) 6422 edycje m kopiowanie treści		Wersja z 19:03, 20 kwi 2007 edytuj anuluj edycję Piotr (dyskusja \| edycje) 6422 edycje m formatowanie następna edycja →
Linia 11: }} ~~8.3.~~== Extracting data from HTML documents == To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture. The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages. Example 8.5. Introducing urllib▼ ▲'''Example 8.5. Introducing urllib''' >>> import urllib 1▼ >>> sock = urllib.urlopen("http://diveintopython.org/") 2▼ >>> ~~htmlSource~~import =urllib ~~sock.read()~~ 3 #(1) ▲ >>> sock = urllib.urlopen("http://diveintopython.org/") #(2) >>> sock.close() 4▼ >>> ~~print~~ htmlSource = sock.read() 5#(3) ▲ >>> sock.close() #(4) <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>▼ ▲ >>> ~~import~~print ~~urllib~~htmlSource 1#(5) ▲ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head> <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'> <title>Dive Into Python</title> <link rel='stylesheet' href='diveintopython.css' type='text/css'> <link rev='made' href='mailto:mark@diveintopython.org'> <meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'> <meta name='description' content='a free Python tutorial for experienced programmers'> </head> <body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'> <table cellpadding='0' cellspacing='0' border='0' width='100%'> <tr><td class='header' width='1%' valign='top'>diveintopython.org</td> <td width='99%' align='right'><hr size='1' noshade></td></tr> <tr><td class='tagline' colspan='2'>Python for experienced programmers</td></tr> [...~~snip~~ciach...] 1# The urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages). 2# The simplest use of urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object. 3# The simplest thing to do with the file-like object returned by urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list. 4# When you're done with the object, make sure to close it, just like a normal file object. 5# You now have the complete HTML of the home page of http://diveintopython.org/ in a string, and you're ready to parse it. '''Example 8.6. Introducing urllister.py''' If you have not already done so, you can download this and other examples used in this book. from sgmllib import SGMLParser▼ class URLLister(SGMLParser):▼ def reset(self): #(1)▼ SGMLParser.reset(self)▼ self.urls = []▼ def start_a(self, attrs): #(2)▼ href = [v for k, v in attrs if k=='href'] #(3) (4)▼ if href:▼ self.urls.extend(href)▼ 1# reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.▼ ▲from sgmllib import SGMLParser 2# start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.▼ 3# You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.▼ ▲class URLLister(SGMLParser): 4# String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.▼ ▲ def reset(self): 1 ▲ SGMLParser.reset(self) ▲ self.urls = [] ▲ def start_a(self, attrs): 2 ▲ href = [v for k, v in attrs if k=='href'] 3 4 ▲ if href: ▲ self.urls.extend(href) ▲1 reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance. ▲2 start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list. ▲3 You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension. ▲4 String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs. ~~Example 8.7. Using urllister.py~~ ~~>>>~~'''Example ~~import~~8.7. ~~urllib,~~Using urllister.py''' >>> usock = urllib.urlopen("http://diveintopython.org/")▼ >>> parser = urllister.URLLister()▼ >>> parser.feed(usock.read()) 1▼ >>> usock.close() 2▼ >>> parser.close() 3▼ >>> for url in parser.urls: print url 4▼ toc/index.html▼ #download▼ #languages▼ toc/index.html▼ appendix/history.html▼ download/diveintopython-html-5.0.zip▼ download/diveintopython-pdf-5.0.zip▼ download/diveintopython-word-5.0.zip▼ download/diveintopython-text-5.0.zip▼ download/diveintopython-html-flat-5.0.zip▼ download/diveintopython-xml-5.0.zip▼ download/diveintopython-common-5.0.zip▼ >>> import urllib, urllister ▲ >>> usock = urllib.urlopen("http://diveintopython.org/") ▲ >>> parser = urllister.URLLister() ▲ >>> parser.feed(usock.read()) #(1) ▲ >>> usock.close() #(2) ▲ >>> parser.close() #(3) ▲ >>> for url in parser.urls: print url #(4) ▲ toc/index.html ▲ #download ▲ #languages ▲ toc/index.html ▲ appendix/history.html ▲ download/diveintopython-html-5.0.zip ▲ download/diveintopython-pdf-5.0.zip ▲ download/diveintopython-word-5.0.zip ▲ download/diveintopython-text-5.0.zip ▲ download/diveintopython-html-flat-5.0.zip ▲ download/diveintopython-xml-5.0.zip ▲ download/diveintopython-common-5.0.zip ... rest of output omitted for brevity ...▼ 1# Call the feed method, defined in SGMLParser, to get HTML into the parser.[1] It takes a string, which is what usock.read() returns.▼ ▲... rest of output omitted for brevity ... 2# Like files, you should close your URL objects as soon as you're done with them.▼ 3# You should close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed.▼ 4# Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.)▼ ▲1 Call the feed method, defined in SGMLParser, to get HTML into the parser.[1] It takes a string, which is what usock.read() returns. ▲2 Like files, you should close your URL objects as soon as you're done with them. ▲3 You should close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed. ▲4 Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.) Footnotes

Zanurkuj w Pythonie/Wyciąganie danych z dokumentu HTML: Różnice pomiędzy wersjami