当前位置：网站首页>XPath ultra detailed summary

XPath ultra detailed summary

2022-07-16 07:55:00 【Steal the mask and run away】

XPath

XPath, Full name XML Path Language, namely XML Path language ⾔, It is ⼀⻔ stay XML ⽂ Search for information in the file ⾔. Initially ⽤ To search XML ⽂ Stall , But the same applies ⽤ On HTML ⽂ File search . So I'm climbing ⾍ Can make ⽤ XPath Do the corresponding information extraction .

1、XPath overview

XPath The selection function of ⼗ Strong points ⼤, It provides ⾮ Often concise path selection expression . in addition , It also provides more than 100 A built-in function ,⽤ It depends on the string 、 The number 、 Time matching and nodes 、 Sequence processing, etc ,⼏ Almost all nodes you want to locate can ⽤ XPath Choose .
Officer, ⽅⽂ files ： https://www.w3.org/TR/xpath/

2、XPath Common rules

Insert picture description here
Here's a list XPath The common rules of , Examples are as follows ：

//title[@lang=‘eng’]
This is a ⼀ individual XPath The rules , It means to select all names as title, At the same time, attribute lang The value of is eng The node of , after ⾯ Meeting
adopt Python Of lxml library , benefit ⽤ XPath Into the ⾏ HTML Parsing .

3、 install

windows—>pyton3 In the environment ： pip install lxml
linux In the environment ： pip install lxml

4、 Instance Introduction

from lxml import etree
text = ''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>

⾸ forerunner ⼊ lxml Library etree modular , And then declare ⼀ paragraph HTML ⽂ Ben , transfer ⽤ HTML Class into ⾏ initialization , Successfully constructed ⼀ individual XPath Parse object . Be careful ： HTML ⽂ At the end of the book ⼀ individual li The node is not closed , however etree Modules can ⾃ Dynamic correction HTML ⽂ Ben . transfer ⽤ tostring() ⽅ Method can output the corrected HTML Code , But the result is bytes type , Sure ⽤decode() ⽅ Law converts it into str type , give the result as follows ：

<html><body><div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul>
</div>
</body></html>

After treatment , li The node label is completed , And also ⾃ It's been added to body、 html node . You can also directly read ⽂ Ben ⽂ Piece in ⾏ analysis ：

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

test.html The content of is on ⾯ example ⼦ Of HTML Code , The contents are as follows ：

<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>

The output results are slightly different this time , More ⼀ individual DOCTYPE Statement , But it has no effect on parsing , give the result as follows ：

<html><body><div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul>
</div></body></html>

stay python Chinese envoy ⽤xpath

from lxml import etree
#  The first ⼀ Kind of ⽅ type , Directly in python Parsing in code html character string 
text = """ <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> .... </ul> </div> """
resp_html = etree.HTML(text)
# The first ⼆ Kind of ⽅ type , Read ⼀ individual html⽂ File and parse 
html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

5、 All nodes

⽤ With // At the beginning XPath Rules to select all nodes that meet the requirements ：

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)
#  shipment ⾏ result 
""" [<Element html at 0x1d6610ebe08>, <Element body at 0x1d6610ebf08>, <Element div at 0x1d6610ebf48>, <Element ul at 0x1d6610ebf88>, <Element li at 0x1d6610ebfc8>, <Element a at 0x1d661115088>, <Element li at 0x1d6611150c8>, <Element a at 0x1d661115108>, <Element li at 0x1d661115148>, <Element a at 0x1d661115048>, <Element li at 0x1d661115188>, <Element a at 0x1d6611151c8>, <Element li at 0x1d661115208>, <Element a at 0x1d661115248>] """

Represents matching all nodes , The result is ⼀ A list , Every element is ⼀ individual Element type , Followed by the node name .
You can also specify the matching node name ：

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
#  shipment ⾏ result 
[<Element li at 0x23fb219af08>, <Element li at 0x23fb219af48>, <Element li at
0x23fb219af88>,
<Element li at 0x23fb219afc8>, <Element li at 0x23fb21c5048>]
<Element li at 0x23fb219af08>

When taking out one of the objects, you can directly ⽤ Indexes .

6、 Child node

adopt / or // You can find the ⼦ Nodes or ⼦ Sun node . choice li All nodes directly a ⼦ node ：

from lxml import etree
html = etree.parse('.test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

Here / ⽤ To get direct ⼦ node , If you want to get all of ⼦ Sun node , take / Switch to // that will do .

7、 Parent node

know ⼦ node , Inquire about ⽗ Nodes can ⽤ … To achieve ：

#  get  href  The attribute is  link4.html  Of  a  Node ⽗ Node  class  attribute 
# ⽅ Law ⼀
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)
# ⽅ Law ⼆
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)
#  shipment ⾏ result ： ['item-1']

8、 Attribute matching

When matching, you can ⽤@ Sign into ⾏ Property filter ：

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-inactive"]')
print(result)
#  shipment ⾏ result ： [<Element li at 0x2089793a3c8>]

9、 Text acquisition

There are two kinds of ⽅ Law ：⼀ Is to obtain ⽂ Get directly after this node ⽂ Ben ,⼆ Is to make ⽤ //.

#  The first ⼀ Kind of 
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)
#  The first ⼆ Kind of 
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]//text()')
print(result)

The first ⼆ Kind of ⽅ When you get the completion code, change ⾏ production ⽣ Special characters for , Recommend to make ⽤ The first ⼀ Kind of ⽅ Law , It can ensure that the results obtained are neat .

10、 Property acquisition

stay xpath In the syntax ,@ Symbols are equivalent to filters , You can get the attribute value of the node directly ：

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
#  shipment ⾏ result ： ['link1.html', 'link2.html', 'link3.html', 'link4.html',
'link5.html']

11、 Attribute multi value matching

occasionally , Some nodes may have more than one value for an attribute ：

from lxml import etree
text = ''' <li class="li li-first"><a href="link.html">first item</a></li> '''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
#  shipment ⾏ result ： ['first item']

12、 Multiple attribute matching

When the current node has multiple attributes , Need to match at the same time ：

from lxml import etree
text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> '''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
#  shipment ⾏ result ： ['first item']

Expand ：XPath Operator
Insert picture description here

13、 Select in order

The matching result has multiple nodes , You need to select the second or last , You can get ：

from lxml import etree
text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> '''
html = etree.HTML(text)
#  For the first ⼀ individual 
result = html.xpath('//li[1]/a/text()')
print(result)
#  Get last ⼀ individual 
result = html.xpath('//li[last()]/a/text()')
print(result)
#  Get the first two 
result = html.xpath('//li[position()<3]/a/text()')
print(result)
#  Get the penultimate 
result = html.xpath('//li[last()-2]/a/text()')
print(result)
"""  shipment ⾏ result ： ['first item'] ['fifth item'] ['first item', 'second item'] ['third item'] """

XPath Provided in 100 Multiple functions , Including access to 、 The number 、 Logic 、 node 、 Sequence and other processing functions , Specific work ⽤ You can refer to : http://www.w3school.com.cn/xpath/xpath_functions.asp

14、 Node axis selection

XPath Provides a number of node axis selection methods , Including sub elements , Parent element , Ancestral elements, etc ：

from lxml import etree
text = ''' <div> <ul> <li class="item-0"><a href="link1.html"><span>first item</span></a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> '''
html = etree.HTML(text)
#  Get all ancestor nodes 
result = html.xpath('//li[1]/ancestor::*')
print(result)
#  obtain  div  Ancestral node 
result = html.xpath('//li[1]/ancestor::div')
print(result)
#  Get all attribute values of the current node 
result = html.xpath('//li[1]/attribute::*')
print(result)
#  obtain  href  The property value is  link1.html  Direct ⼦ node 
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
#  Get all ⼦ The grandchild node contains  span  Node but does not contain  a  node 
result = html.xpath('//li[1]/descendant::span')
print(result)
#  Get the... After all current nodes ⼆ Nodes 
result = html.xpath('//li[1]/following::*[2]')
print(result)
#  Get all peer nodes after the current node 
result = html.xpath('//li[1]/following-sibling::*')
print(result)
""" [<Element html at 0x231a8965388>, <Element body at 0x231a8965308>, <Element div at 0x231a89652c8>, <Element ul at 0x231a89653c8>] [<Element div at 0x231a89652c8>] ['item-0'] [<Element a at 0x231a89653c8>] [<Element span at 0x231a89652c8>] [<Element a at 0x231a89653c8>] [<Element li at 0x231a8965308>, <Element li at 0x231a8965408>, <Element li at 0x231a8965448>, <Element li at 0x231a8965488>] """

More references ： How to use the shaft ： http://www.w3school.com.cn/xpath/xpath_axes.asp XPath Of ⽤ Law ： http://www.w3school.com.cn/xpath/index.asp Python lxml Of ⽤ Law ： http://lxml.de

原网站

版权声明
本文为[Steal the mask and run away]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/197/202207131738590693.html