当前位置:网站首页>XPath ultra detailed summary
XPath ultra detailed summary
2022-07-16 07:55:00 【Steal the mask and run away】
XPath
XPath, Full name XML Path Language, namely XML Path language ⾔, It is ⼀⻔ stay XML ⽂ Search for information in the file ⾔. Initially ⽤ To search XML ⽂ Stall , But the same applies ⽤ On HTML ⽂ File search . So I'm climbing ⾍ Can make ⽤ XPath Do the corresponding information extraction .
1、XPath overview
XPath The selection function of ⼗ Strong points ⼤, It provides ⾮ Often concise path selection expression . in addition , It also provides more than 100 A built-in function ,⽤ It depends on the string 、 The number 、 Time matching and nodes 、 Sequence processing, etc ,⼏ Almost all nodes you want to locate can ⽤ XPath Choose .
Officer, ⽅⽂ files : https://www.w3.org/TR/xpath/
2、XPath Common rules

Here's a list XPath The common rules of , Examples are as follows :
//title[@lang=‘eng’]
This is a ⼀ individual XPath The rules , It means to select all names as title, At the same time, attribute lang The value of is eng The node of , after ⾯ Meeting
adopt Python Of lxml library , benefit ⽤ XPath Into the ⾏ HTML Parsing .
3、 install
windows—>pyton3 In the environment : pip install lxml
linux In the environment : pip install lxml
4、 Instance Introduction
from lxml import etree
text = ''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
⾸ forerunner ⼊ lxml Library etree modular , And then declare ⼀ paragraph HTML ⽂ Ben , transfer ⽤ HTML Class into ⾏ initialization , Successfully constructed ⼀ individual XPath Parse object . Be careful : HTML ⽂ At the end of the book ⼀ individual li The node is not closed , however etree Modules can ⾃ Dynamic correction HTML ⽂ Ben . transfer ⽤ tostring() ⽅ Method can output the corrected HTML Code , But the result is bytes type , Sure ⽤decode() ⽅ Law converts it into str type , give the result as follows :
<html><body><div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul>
</div>
</body></html>
After treatment , li The node label is completed , And also ⾃ It's been added to body、 html node . You can also directly read ⽂ Ben ⽂ Piece in ⾏ analysis :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
test.html The content of is on ⾯ example ⼦ Of HTML Code , The contents are as follows :
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
The output results are slightly different this time , More ⼀ individual DOCTYPE Statement , But it has no effect on parsing , give the result as follows :
<html><body><div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul>
</div></body></html>
stay python Chinese envoy ⽤xpath
from lxml import etree
# The first ⼀ Kind of ⽅ type , Directly in python Parsing in code html character string
text = """ <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> .... </ul> </div> """
resp_html = etree.HTML(text)
# The first ⼆ Kind of ⽅ type , Read ⼀ individual html⽂ File and parse
html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
5、 All nodes
⽤ With // At the beginning XPath Rules to select all nodes that meet the requirements :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)
# shipment ⾏ result
""" [<Element html at 0x1d6610ebe08>, <Element body at 0x1d6610ebf08>, <Element div at 0x1d6610ebf48>, <Element ul at 0x1d6610ebf88>, <Element li at 0x1d6610ebfc8>, <Element a at 0x1d661115088>, <Element li at 0x1d6611150c8>, <Element a at 0x1d661115108>, <Element li at 0x1d661115148>, <Element a at 0x1d661115048>, <Element li at 0x1d661115188>, <Element a at 0x1d6611151c8>, <Element li at 0x1d661115208>, <Element a at 0x1d661115248>] """
- Represents matching all nodes , The result is ⼀ A list , Every element is ⼀ individual Element type , Followed by the node name .
You can also specify the matching node name :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
# shipment ⾏ result
[<Element li at 0x23fb219af08>, <Element li at 0x23fb219af48>, <Element li at
0x23fb219af88>,
<Element li at 0x23fb219afc8>, <Element li at 0x23fb21c5048>]
<Element li at 0x23fb219af08>
When taking out one of the objects, you can directly ⽤ Indexes .
6、 Child node
adopt / or // You can find the ⼦ Nodes or ⼦ Sun node . choice li All nodes directly a ⼦ node :
from lxml import etree
html = etree.parse('.test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
Here / ⽤ To get direct ⼦ node , If you want to get all of ⼦ Sun node , take / Switch to // that will do .
7、 Parent node
know ⼦ node , Inquire about ⽗ Nodes can ⽤ … To achieve :
# get href The attribute is link4.html Of a Node ⽗ Node class attribute
# ⽅ Law ⼀
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)
# ⽅ Law ⼆
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)
# shipment ⾏ result : ['item-1']
8、 Attribute matching
When matching, you can ⽤@ Sign into ⾏ Property filter :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-inactive"]')
print(result)
# shipment ⾏ result : [<Element li at 0x2089793a3c8>]
9、 Text acquisition
There are two kinds of ⽅ Law :⼀ Is to obtain ⽂ Get directly after this node ⽂ Ben ,⼆ Is to make ⽤ //.
# The first ⼀ Kind of
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)
# The first ⼆ Kind of
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]//text()')
print(result)
The first ⼆ Kind of ⽅ When you get the completion code, change ⾏ production ⽣ Special characters for , Recommend to make ⽤ The first ⼀ Kind of ⽅ Law , It can ensure that the results obtained are neat .
10、 Property acquisition
stay xpath In the syntax ,@ Symbols are equivalent to filters , You can get the attribute value of the node directly :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
# shipment ⾏ result : ['link1.html', 'link2.html', 'link3.html', 'link4.html',
'link5.html']
11、 Attribute multi value matching
occasionally , Some nodes may have more than one value for an attribute :
from lxml import etree
text = ''' <li class="li li-first"><a href="link.html">first item</a></li> '''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
# shipment ⾏ result : ['first item']
12、 Multiple attribute matching
When the current node has multiple attributes , Need to match at the same time :
from lxml import etree
text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> '''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
# shipment ⾏ result : ['first item']
Expand :XPath Operator 
13、 Select in order
The matching result has multiple nodes , You need to select the second or last , You can get :
from lxml import etree
text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> '''
html = etree.HTML(text)
# For the first ⼀ individual
result = html.xpath('//li[1]/a/text()')
print(result)
# Get last ⼀ individual
result = html.xpath('//li[last()]/a/text()')
print(result)
# Get the first two
result = html.xpath('//li[position()<3]/a/text()')
print(result)
# Get the penultimate
result = html.xpath('//li[last()-2]/a/text()')
print(result)
""" shipment ⾏ result : ['first item'] ['fifth item'] ['first item', 'second item'] ['third item'] """
XPath Provided in 100 Multiple functions , Including access to 、 The number 、 Logic 、 node 、 Sequence and other processing functions , Specific work ⽤ You can refer to : http://www.w3school.com.cn/xpath/xpath_functions.asp
14、 Node axis selection
XPath Provides a number of node axis selection methods , Including sub elements , Parent element , Ancestral elements, etc :
from lxml import etree
text = ''' <div> <ul> <li class="item-0"><a href="link1.html"><span>first item</span></a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> '''
html = etree.HTML(text)
# Get all ancestor nodes
result = html.xpath('//li[1]/ancestor::*')
print(result)
# obtain div Ancestral node
result = html.xpath('//li[1]/ancestor::div')
print(result)
# Get all attribute values of the current node
result = html.xpath('//li[1]/attribute::*')
print(result)
# obtain href The property value is link1.html Direct ⼦ node
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
# Get all ⼦ The grandchild node contains span Node but does not contain a node
result = html.xpath('//li[1]/descendant::span')
print(result)
# Get the... After all current nodes ⼆ Nodes
result = html.xpath('//li[1]/following::*[2]')
print(result)
# Get all peer nodes after the current node
result = html.xpath('//li[1]/following-sibling::*')
print(result)
""" [<Element html at 0x231a8965388>, <Element body at 0x231a8965308>, <Element div at 0x231a89652c8>, <Element ul at 0x231a89653c8>] [<Element div at 0x231a89652c8>] ['item-0'] [<Element a at 0x231a89653c8>] [<Element span at 0x231a89652c8>] [<Element a at 0x231a89653c8>] [<Element li at 0x231a8965308>, <Element li at 0x231a8965408>, <Element li at 0x231a8965448>, <Element li at 0x231a8965488>] """
More references : How to use the shaft : http://www.w3school.com.cn/xpath/xpath_axes.asp XPath Of ⽤ Law : http://www.w3school.com.cn/xpath/index.asp Python lxml Of ⽤ Law : http://lxml.de
边栏推荐
猜你喜欢

C#笔记-基础知识,问答,WPF

Alipay computer website payment

How to apply @transactional transaction annotation to perfection?

CCF 202012-2 期末预测之最佳阈值

Day 8 of leetcode question brushing

Redis主从集群搭建及哨兵模式配置

Use redis' sorted set to make weekly hot reviews

Day 5 of DL

GoFrame Step by Step Demo - P1

01 knapsack filling form implementation
随机推荐
XPath超详细总结
How to self-study software testing? [super comprehensive analysis from 0 to 1] (with learning notes)
Attack and defense World Web
主从复制读写分离保姆级教学
jsonp原理
It is said that software testing can be done by everyone, but why are there still a large number of people who are discouraged every year?
wordpress中文网站代码下载
A simple JVM tuning. Write it in your resume
Re 正则表达式
A man with an annual salary of 35W was dismissed from the test, and his words were thought-provoking
还在用策略模式解决 if-else?Map+函数式接口方法才是YYDS
Automatically back up mysql. And keep the case for 7 days
Concurrent simulation of program ape's exclusive "pressure test tool"
Jsonp principle
STC定时器初值计算
MATPLOTLIB—fail to allocate bitmap
一次简单的 JVM 调优,拿去写到简历里
Dynamic open point segment tree
Azkaban概述
How to quickly test the new product just taken over