Web scraping с помощью python

Dealing with stylesheet complexity

Some applications require a larger set of rather diverse stylesheets. lxml.etree allows you to deal with this in a number of ways. Here are some ideas to try.

The most simple way to reduce the diversity is by using XSLT parameters that you pass at call time to configure the stylesheets. The partial() function in the functools module may come in handy here. It allows you to bind a set of keyword arguments (i.e. stylesheet parameters) to a reference of a callable stylesheet. The same works for instances of the XPath() evaluator, obviously.

You may also consider creating stylesheets programmatically. Just create an XSL tree, e.g. from a parsed template, and then add or replace parts as you see fit. Passing an XSL tree into the XSLT() constructor multiple times will create independent stylesheets, so later modifications of the tree will not be reflected in the already created stylesheets. This makes stylesheet generation very straight forward.

Как редактировать XML при помощи ElementTree

Редактирование XML при помощи ElementTree это также очень просто. Чтобы все было немного интереснее, мы добавим другой блок назначения в XML:


<?xml version=»1.0″ ?> <zAppointments reminder=»15″> <appointment> <begin>1181251680</begin> <uid>040000008200E000</uid> <alarmTime>1181572063</alarmTime> <state></state> <location></location> <duration>1800</duration> <subject>Bring pizza home</subject> </appointment> <appointment> <begin>1181253977</begin> <uid>sdlkjlkadhdakhdfd</uid> <alarmTime>1181588888</alarmTime> <state>TX</state> <location>Dallas</location> <duration>1800</duration> <subject>Bring pizza home</subject> </appointment> </zAppointments>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

<?xml version=»1.0″?>

<zAppointments reminder=»15″>








<subject>Bring pizza home<subject>









<subject>Bring pizza home<subject>



Теперь мы напишем код для того, чтобы изменить каждое значение тега begin от секунд, начиная с эпохи на что-нибудь более читабельное. Мы используем модуль time python, чтобы облегчить себе жизнь:


# -*- coding: utf-8 -*- import time import xml.etree.cElementTree as ET

def editXML(filename): «»» Редактируем XML файл. «»» tree = ET.ElementTree(file=filename) root = tree.getroot() for begin_time in root.iter(«begin»): begin_time.text = time.ctime(int(begin_time.text)) tree = ET.ElementTree(root) with open(«updated.xml», «w») as f: tree.write(f)

if __name__ == «__main__»: editXML(«original_appt.xml»)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

# -*- coding: utf-8 -*-


importxml.etree.cElementTree asET



    Редактируем XML файл.     «»»



forbegin_time inroot.iter(«begin»)







Здесь мы создаем объект ElementTree под названием tree и извлечем из него root. Далее мы используем метод iter() чтобы найти все теги, помеченные “begin”

Обратите внимание на то, что метод iter() был добавлен в Python 2.7. В наем цикле for, мы указываем текстовое содержимое каждого объекта, чтобы получить более читабельный временной формат при помощи метода time.ctime()

Вы также можете обратить внимание на то, что нам нужно конвертировать строку для целых чисел, при передаче их к ctime. Результат будет выглядеть примерно следующим образом:


<zAppointments reminder=»15″> <appointment> <begin>Thu Jun 07 16:28:00 2007</begin> <uid>040000008200E000</uid> <alarmTime>1181572063</alarmTime> <state /> <location /> <duration>1800</duration> <subject>Bring pizza home</subject> </appointment> <appointment> <begin>Thu Jun 07 17:06:17 2007</begin> <uid>sdlkjlkadhdakhdfd</uid> <alarmTime>1181588888</alarmTime> <state>TX</state> <location>Dallas</location> <duration>1800</duration> <subject>Bring pizza home</subject> </appointment> </zAppointments>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

<zAppointments reminder=»15″>


<begin>Thu Jun071628002007<begin>






<subject>Bring pizza home<subject>



<begin>Thu Jun071706172007<begin>






<subject>Bring pizza home<subject>



Вы также можете использовать методы ElementTree, такие как find() или findall() для поиска конкретных тегов в вашем XML. Метод find() найдет только первый пример, в то время как findall() найдет каждый тег с указанной отметкой. Это очень полезно при решении задач, возникших при редактировании или при парсинге, что является темой нашего следующего раздела!

Method 2: Using BeautifulSoup (Reliable)

This is also another good choice, if, for some reason, the source XML is badly formatted. XML may not work very well if you don’t do some pre-processing to the file.

It turns out that BeautifulSoup works very well for all these types of files, so if you want to parse any kind of XML file, use this approach.

To install it, use and install the module:

pip3 install bs4

I’ll give you a small snippet for our previous XML file:

        <item name="item1">10</item>
        <item name="item2">20</item>
        <item name="item3">30</item>
        <item name="item4">40</item>

I’ll be passing this file then parsing it using .

from bs4 import BeautifulSoup

fd = open('sample.xml', 'r')

xml_file = fd.read()

soup = BeautifulSoup(xml_file, 'lxml')

for tag in soup.findAll("item"):
    # print(tag)


The syntax is similar to our module, so we’re still getting the attribute names using and . Exactly the same as before!



We’ve now parsed this using too! If your source file is badly formatted, this method is the way to go since BeautifulSoup has different rules for handling such files.

Microformat Example

This example parses the hCard microformat.

First we get the page:

>>> import urllib
>>> from lxml.html import fromstring
>>> url = 'http://microformats.org/'
>>> content = urllib.urlopen(url).read()
>>> doc = fromstring(content)
>>> doc.make_links_absolute(url)

Then we create some objects to put the information in:

>>> class Card(object):
...     def __init__(self, **kw):
...         for name, value in kw
...             setattr(self, name, value)
>>> class Phone(object):
...     def __init__(self, phone, types=()):
...         self.phone, self.types = phone, types

And some generally handy functions for microformats:

>>> def get_text(el, class_name):
...     els = el.find_class(class_name)
...     if els
...         return els.text_content()
...     else
...         return ''
>>> def get_value(el):
...     return get_text(el, 'value') or el.text_content()
>>> def get_all_texts(el, class_name):
...     return e.text_content() for e in els.find_class(class_name)]
>>> def parse_addresses(el):
...     # Ideally this would parse street, etc.
...     return el.find_class('adr')

Then the parsing:

>>> for el in doc.find_class('hcard'):
...     card = Card()
...     card.el = el
...     card.fn = get_text(el, 'fn')
...     card.tels = []
...     for tel_el in card.find_class('tel'):
...         card.tels.append(Phone(get_value(tel_el),
...                                get_all_texts(tel_el, 'type')))
...     card.addresses = parse_addresses(el)


Many software products come with the pick-two caveat, meaning that you must choose only two: speed, flexibility, or readability. When used carefully, lxml can provide all three. XML developers who have struggled with DOM performance or with the event-driven model of SAX now have the chance to work with higher-level pythonic libraries. Programmers coming from a Python background who are new to XML have an easy way to explore the expressivity of XPath and XSLT. Both coding styles can co-exist happily in an lxml-based application.

lxml has more to offer than what was explored here. Be sure to look into the module, especially for smaller datasets or applications that are not primarily XML-based. For content in HTML that might not be well formed, lxml provides two useful packages: the module and the BeautifulSoup parser. It’s also possible to extend lxml itself if you write Python modules that are callable from XSLT or create custom Python or C extensions. Find information about all of these in the lxml documentation mentioned in .

Related topics

  • Help getting lxml to work reliably on MacOS-X: Read this thread for invaluable help on installing lxml on MacOS X.
  • ElementTree Overview: Find information about the ElementTree API and cElementTree.
  • : In this section of the ElementTree documentation, get more information about the iteration pattern used in .
  • developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
  • Google U.S. copyright renewal data: Download and experiment with this U.S. copyright renewal data converted into XML by Google (371MB, zipped, 426,907 individual records).
  • Open Directory RDF content: Download RDF dumps of the Open Directory database (1.9GB, zipped, 5,354,663 individual records).
  • eXist: Check out this open source database management system that uses XQuery.
  • Psyco: Learn more about this Python extension module that can massively speed up the execution of Python code.
  • IBM trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2, Lotus, Rational, Tivoli, and WebSphere.
  • developerWorks podcasts: Listen to interesting interviews and discussions for software developers.


There are some key things to remember about XMLs and using .

Tags build the tree structure and designate what values should be delineated there. Using smart structuring can make it easy to read and write to an XML. Tags always need opening and closing brackets to show the parent and children relationships.

Attributes further describe how to validate a tag or allow for boolean designations. Attributes typically take very specific values so that the XML parser (and the user) can use the attributes to check the tag values.

is an important Python library that allows you to parse and navigate an XML document. Using breaks down the XML document in a tree structure that is easy to work with. When in doubt, print it out () — use this helpful print statement to view the entire XML document at once. It helps to check when editing, adding, or removing from an XML.

Now you are equipped to understand XML and begin parsing!


  • https://docs.python.org/3.5/library/xml.etree.elementtree.html
  • https://en.wikipedia.org/wiki/XML

Event types

The parse events are tuples (event-type, object). The event types supported by ElementTree and lxml.etree are the strings ‘start’, ‘end’, ‘start-ns’ and ‘end-ns’. The ‘start’ and ‘end’ events represent opening and closing elements. They are accompanied by the respective Element instance. By default, only ‘end’ events are generated, whereas the example above requested the generation of both ‘start’ and ‘end’ events.

The ‘start-ns’ and ‘end-ns’ events notify about namespace declarations. They do not come with Elements. Instead, the value of the ‘start-ns’ event is a tuple (prefix, namespaceURI) that designates the beginning of a prefix-namespace mapping. The corresponding end-ns event does not have a value (None). It is common practice to use a list as namespace stack and pop the last entry on the ‘end-ns’ event.

Парсинг с ElementTree

В данном разделе, мы научимся создавать XML файлы, редактировать и выполнять парсинг при помощи ElementTree. Для сравнения, мы используем тот же XML, который мы использовали в предыдущем разделе для того, чтобы продемонстрировать разницу в использовании minidom и ElementTree. Вот наш код:


<?xml version=»1.0″ ?> <zAppointments reminder=»15″> <appointment> <begin>1181251680</begin> <uid>040000008200E000</uid> <alarmTime>1181572063</alarmTime> <state></state> <location></location> <duration>1800</duration> <subject>Bring pizza home</subject> </appointment> </zAppointments>

1 2 3 4 5 6 7 8 9 10 11 12

<?xml version=»1.0″?>

<zAppointments reminder=»15″>








<subject>Bring pizza home<subject>



Давайте начнем с изучения того, как создавать такую XML структуру при помощи Python

Парсинг на примере книги

Что-ж, результат нашего примера немного скучный. Большую часть времени, вам нужно будет сохранить извлеченные данные, и сделать с ними что-нибудь, а не просто вывести его в stdout. Так что в следующем нашем примере мы создадим структуру данных для сбора результатов. В данном примере структура наших данных будет представлять собой список словарей. Мы используем пример книги MSDN. Сохраните следующий код XML под названием example.xml.


<?xml version=»1.0″?> <catalog> <book id=»bk101″> <author>Gambardella, Matthew</author> <title>XML Developer’s Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id=»bk102″> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id=»bk103″> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> </catalog>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

<?xml version=»1.0″?>


<book id=»bk101″>






<description>An in-depth look at creating applications



<book id=»bk102″>


<title>Midnight Rain<title>




<description>Aformer architect battles corporate zombies,

an evil sorceress,andher own childhood to become queen

of the world.<description>


<book id=»bk103″>


<title>Maeve Ascendant<title>




<description>After the collapse ofananotechnology

society inEngland,the young survivors lay the

foundation foranewsociety.<description>



Теперь мы выполним парсинг данного XML и вставим его в нашу структуру данных!


# -*- coding: utf-8 -*- from lxml import etree

def parseBookXML(xmlFile): with open(xmlFile) as fobj: xml = fobj.read() root = etree.fromstring(xml)

book_dict = {} books = [] for book in root.getchildren(): for elem in book.getchildren(): if not elem.text: text = «None» else: text = elem.text print(elem.tag + » => » + text) book_dict = text if book.tag == «book»: books.append(book_dict) book_dict = {} return books

if __name__ == «__main__»: parseBookXML(«books.xml»)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

# -*- coding: utf-8 -*-

fromlxml importetree







forbook inroot.getchildren()

forelem inbook.getchildren()





print(elem.tag+» => «+text)








Данный пример весьма похож на предыдущий, так что мы сосредоточимся только на различиях между ними. Перед началом итерации над контекстом, мы создадим объект пустого словаря и пустой список Python. Далее, в цикле, мы создадим наш словарь вот так:


book_dict = text

1 book_dictelem.tag=text

Текст может быть как elem.text так и None. Наконец, если тег окажется книгой, тогда мы в конце книжной секции, и нам нужно добавить словарь в наш список, а также сбросить словарь для следующей книги. Как мы видим, это именно то, что мы сделали. Более реалистичным примером будет размещение извлеченных данных в Python класс Book. Ранее я делал последнее с json feeds. Теперь мы готовы к тому, чтобы приступить к парсингу XML с lxml.objectify!

The parseString Method

There is one more method to create a SAX parser and to parse the specified XML string.

xml.sax.parseString(xmlstring, contenthandler)

Here is the detail of the parameters −

  • xmlstring − This is the name of the XML string to read from.

  • contenthandler − This must be a ContentHandler object.

  • errorhandler − If specified, errorhandler must be a SAX ErrorHandler object.



import xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Call when an element starts
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print "*****Movie*****"
         title = attributes
         print "Title:", title

   # Call when an elements ends
   def endElement(self, tag):
      if self.CurrentData == "type":
         print "Type:", self.type
      elif self.CurrentData == "format":
         print "Format:", self.format
      elif self.CurrentData == "year":
         print "Year:", self.year
      elif self.CurrentData == "rating":
         print "Rating:", self.rating
      elif self.CurrentData == "stars":
         print "Stars:", self.stars
      elif self.CurrentData == "description":
         print "Description:", self.description
      self.CurrentData = ""

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content
if ( __name__ == "__main__"):
   # create an XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # override the default ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )

This would produce following result −

Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

Parsing XML with lxml.objectify¶

The lxml module has a module called objectify that can turn XML documents into Python objects. I find “objectified” XML documents very easy to work with and I hope you will too. You may need to jump through a hoop or two to install it as pip doesn’t work with lxml on Windows. Be sure to go to the Python Package index and look for a version that’s been made for your version of Python. Also note that the latest pre-built installer for lxml only supports Python 3.2 (at the time of writing), so if you have a newer version of Python, you may have some difficulty getting lxml installed for your version.

Anyway, once you have it installed, we can start going over this wonderful piece of XML again:

<?xml version="1.0" ?>
<zAppointments reminder="15">
        <subject>Bring pizza home</subject>
        <subject>Check MS Office website for updates</subject>

Now we need to write some code that can parse and modify the XML. Let’s take a look at this little demo that shows a bunch of the neat abilities that objectify provides.

from lxml import etree, objectify

def parseXML(xmlFile):
    """Parse the XML file"""
    with open(xmlFile) as f
        xml = f.read()

    root = objectify.fromstring(xml)

    # returns attributes in element node as dict
    attrib = root.attrib

    # how to extract element data
    begin = root.appointment.begin
    uid = root.appointment.uid

    # loop over elements and print their tags and text
    for appt in root.getchildren():
        for e in appt.getchildren():
            print("%s => %s" % (e.tag, e.text))

    # how to change an element's text
    root.appointment.begin = "something else"

    # how to add a new element
    root.appointment.new_element = "new data"

    # remove the py:pytype stuff
    obj_xml = etree.tostring(root, pretty_print=True)

    # save your xml
    with open("new.xml", "w") as f

if __name__ == "__main__"
    f = r'path\to\sample.xml'

The code is pretty well commented, but we’ll spend a little time going over it anyway. First we pass it our sample XML file and objectify it. If you want to get access to a tag’s attributes, use the attrib property. It will return a dictionary of the attributes of the tag. To get to sub-tag elements, you just use dot notation. As you can see, to get to the begin tag’s value, we can just do something like this:

begin = root.appointment.begin

One thing to be aware of is if the value happens to have leading zeroes, the returned value may have them truncated. If that is important to you, then you should use the following syntax instead:

begin = root.appointment.begin.text

If you need to iterate over the children elements, you can use the iterchildren method. You may have to use a nested for loop structure to get everything. Changing an element’s value is as simple as just assigning it a new value.

root.appointment.new_element = "new data"

Finding elements quickly

After parsing, the most common XML task is to locate specific data of interest inside the parsed tree. lxml offers several approaches, from a simplified search syntax to full XPath 1.0. As a user, you should be aware of the performance characteristics and optimization techniques for each approach.

Avoid use of and

The and methods, inherited from the ElementTree API, locate one or more descendant nodes using a simplified XPath-like expression language called ElementPath. Users migrating from ElementTree to lxml can naturally continue to use the find/ElementPath syntax.

lxml supplies two other options for discovering subnodes: the / methods and true XPath. In cases where the expression should match a node name, it is far faster (in some cases twice as fast) to use the or methods with their optional tag parameter when compared to their equivalent ElementPath expressions.

For more complex patterns, use the class to precompile search patterns. Simple patterns that mimic the behavior of with tag arguments (for example, ) execute in effectively the same time as their equivalents. It’s important to precompile, though. Compiling the pattern in each execution of the loop or using the method on an element (described in the lxml documentation, see ) can be almost twice as slow as compiling once and then using that pattern repeatedly.

XPath evaluation in lxml is fast. If only a subset of nodes needs to be serialized, it is much better to limit with precise XPath expressions up front than to inspect all the nodes later. For example, limiting the sample serialization to include only titles containing the word , as in , takes 60 percent of the time to serialize the full set.

Listing 8. Conditional serialization with XPath classes
def write_if_node(out, node):
    if node is not None:
        out.write(etree.tostring(node, encoding='utf-8'))

def serialize_with_xpath(elem, xp1, xp2):
    '''Take our source <Record> element and apply two pre-compiled XPath classes.
    Return a node only if the first expression matches.
    r = etree.Element('Record')

    t = etree.SubElement(r, 'Title')
    x = xp1(elem)
    if x:
        t.text = x.text
        for c in xp2(elem):
        return r

xp1 = etree.XPath("child::Title")
xp2 = etree.XPath("child::Copyright")
out = open('out.xml', 'w')
context = etree.iterparse('copyright.xml', events=('end',), tag='Record')
   lambda elem: write_if_node(out, serialize_with_xpath(elem, xp1, xp2)))

Finding nodes in other parts of the document

Note that, even when using , it is possible to use XPath predicates based on looking ahead of the current node. To find all nodes that are immediately followed by a record whose title contains the word , do this:

etree.XPath("Title[contains(../Record/following::Record/Title/text(), 'night')]")

However, when using the memory-efficient iteration strategy described in , this command returns nothing because preceding nodes are cleared as parsing proceeds through the document:

etree.XPath("Title[contains(../Record/preceding::Record/Title/text(), 'night')]")

While it is possible to write an efficient algorithm to solve this particular problem, tasks involving analyses across nodes—especially those that might be randomly distributed in the document—are usually more suited for an XML database that uses XQuery, such as eXist.

Errors and messages

Like most of the processing oriented objects in lxml.etree, XSLT provides an error log that lists messages and error output from the last run. See the for a description of the error log.

>>> xslt_root = etree.XML('''\
... <xsl:stylesheet version="1.0"
...     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...     <xsl:template match="/">
...         <xsl:message terminate="no">STARTING</xsl:message>
...         <foo><xsl:value-of select="/a/b/text()" /></foo>
...         <xsl:message terminate="no">DONE</xsl:message>
...     </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_root)

>>> doc_root = etree.XML('<a><b>Text</b></a>')
>>> result = transform(doc_root)
>>> str(result)
'<?xml version="1.0"?>\n<foo>Text</foo>\n'

>>> print(transform.error_log)
<string>:0:0:ERROR:XSLT:ERR_OK: DONE

>>> for entry in transform.error_log
...     print('message from line %s, col %s%s' % (
...                entry.line, entry.column, entry.message))
...     print('domain: %s (%d)' % (entry.domain_name, entry.domain))
...     print('type: %s (%d)' % (entry.type_name, entry.type))
...     print('level: %s (%d)' % (entry.level_name, entry.level))
...     print('filename: %s' % entry.filename)
message from line 0, col 0: STARTING
domain: XSLT (22)
type: ERR_OK (0)
level: ERROR (2)
filename: <string>
message from line 0, col 0: DONE
domain: XSLT (22)
type: ERR_OK (0)
level: ERROR (2)
filename: <string>

С этим читают