Zanurkuj w Pythonie/Wyciąganie danych z dokumentu HTML: Różnice pomiędzy wersjami

Usunięta treść Dodana treść
Piotr (dyskusja | edycje)
Nie podano opisu zmian
Piotr (dyskusja | edycje)
Nie podano opisu zmian
Linia 1:
{{WEdycji}}
{{Podświetl|py}}
 
== Wyciąganie danych z dokumentu HTML ==
 
Linia 38:
'''Przykład 8.6. Wprowadzenie do <tt>urllister.py</tt>'''
 
Jeśli jeszcze tego nie zrobiłeś, możesz pobrać [http://diveintopython.org/download/diveintopython-examples-5.4.zip ten i inne przykłady] użyte w tej książce.
If you have not already done so, you can download this and other examples used in this book.
 
from sgmllib import SGMLParser
Linia 52:
self.urls.extend(href)
 
# <tt>reset</tt> jest wywoływany przez metodę <tt>__init__</tt> <tt>SGMLParser</tt>-a, a także można go wywołać ręcznie już po utworzeniu instancji parsera. Zatem jeśli potrzebujesz powtórnie zainicjalizować instancję parsera, który był wcześniej używany, zrobisz to za pomocą <tt>reset</tt> (nie <tt>__init__</tt>). Nie ma potrzeby tworzenia nowego obiektu.
# reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.
# Zawsze, kiedy parser odnajdzie znacznik <tt><nowiki><a></nowiki></tt>, wywoła metodę <tt>start_a</tt>. Znacznik może posiadać atrybut <tt>href</tt>, a także inne jak na przykład <tt>name</tt>, czy <tt>title</tt>. Parametr <tt>attrs</tt> jest listą krotek, <tt>[(atrybut, wartość), (atrybut, wartość), ...]</tt>. Znacznik ten być także samym <tt><nowiki><a></nowiki></tt>, poprawnym (lecz bezużytecznym) znacznikiem HTML, a w tym przypadku <tt>attrs</tt> będzie pustą listą.
# start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.
# Możemy stwierdzić, czy znacznik <tt><nowiki><a></nowiki></tt> posiada atrybut <tt>href</tt>, za pomocą prostego wielozmiennego wyrażenie listowego.
# You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.
# Porównywanie napisów (np. <tt>k=='href'</tt>) jest zawsze wrażliwe na wielkość liter, lecz w tym przypadku jest to bezpieczne, ponieważ <tt>SGMLParser</tt> konwertuje podczas tworzenia <tt>attrs</tt> nazwy atrybutów na małe litery.
# String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.
 
'''ExamplePrzykład 8.7. UsingKorzystanie z <tt>urllister.py</tt>'''
 
>>> import urllib, urllister
Linia 81:
... rest of output omitted for brevity ...
 
[1]# Call the feed method, defined in SGMLParser, to get HTML into the parser. <ref>The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image.
# Call the feed method, defined in SGMLParser, to get HTML into the parser.[1]</ref> It takes a string, which is what usock.read() returns.
# Like files, you should close your URL objects as soon as you're done with them.
# You should close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed.
# Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.)
 
'''Przypisy'''
Footnotes
----
 
<references/>
[1] The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image.
 
 
<noinclude>