当前位置:网站首页>How to extract dates from web pages?
How to extract dates from web pages?
2022-06-24 22:10:00 【Blue92120】
Although when extracting the news text , The accuracy is relatively high , But because the regular expression is used to extract the news release time , Therefore, the extraction effect is sometimes not so satisfactory .
Recently I found out Python A third party library , be called htmldate, After testing , It is more accurate to extract the release time of news . Let's see how this library works .
use first pip install :
python3 -m pip install htmldate
then , We use Requests perhaps Selenium Get the source code of the website :
import requests
from htmldate import find_date
html = requests.get('https://www.kingname.info/2022/03/09/this-is-gnelist/').content.decode('utf-8')
date = find_date(html)
print(date)
The operation effect is shown in the figure below :

And the release time of this article , Is, indeed, 3 month 9 Number

Let's take another look at Netease News , Encourage each other Enhance friendship ( Wonderful bloom ) | Paralympic Games | Chinese delegation | Snowboarding | Win gold _ Netease government affairs [2] The release time of this news is shown in the figure below :

Now let's use Requests Get its source code , Then extract the release time :

The release date is really right , But how did the later time get lost ? If you want to keep the hours, minutes and seconds , One parameter can be added outputformat, Its value is that you are datetime.strftime The value entered in :
find_date(html, outputformat='%Y-%m-%d %H:%M:%S')
The operation effect is as shown in the figure :

find_date Parameters of , In addition to the web page source code , You can also pass in URL, Or is it lxml Inside Dom object ,
for example :
from lxml.html import fromstring
selector = fromstring(html)
date = find_date(selector)
边栏推荐
- 好想送对象一束花呀
- Cannot find reference 'imread' in 'appears in pycharm__ init__. py‘
- [notes of wuenda] fundamentals of machine learning
- Elegant custom ThreadPoolExecutor thread pool
- 一个女孩子居然做了十年硬件。。。
- [notes of Wu Enda] multivariable linear regression
- Flutter 如何使用在线转码工具将 JSON 转为 Model
- leetcode:55. 跳跃游戏【经典贪心】
- 专科出身,2年进苏宁,5年跳阿里,论我是怎么快速晋升的?
- Two implementation methods of stack
猜你喜欢

How to achieve energy conservation and environmental protection of full-color outdoor LED display

【OpenCV 例程200篇】209. HSV 颜色空间的彩色图像分割

How to grab the mobile phone bag for analysis? Fiddler artifact may help you!

Filtered data analysis

ansible基本配置

嵌入式开发:技巧和窍门——干净地从引导加载程序跳转到应用程序代码
![[200 opencv routines] 209 Color image segmentation in HSV color space](/img/fa/9a40015cbcf9c78808f147e510be4c.jpg)
[200 opencv routines] 209 Color image segmentation in HSV color space

openGauss内核:简单查询的执行

985测试工程师被吊打,学历和经验到底谁更重要?

leetcode-201_ 2021_ 10_ seventeen
随机推荐
机器学习:梯度下降法
栈的两种实现方式
leetcode_ one thousand three hundred and sixty-five
First order model realizes photo moving (with tool code) | machine learning
The leader of ERP software in printing industry
刷题笔记(十八)--二叉树:公共祖先问题
Application practice | massive data, second level analysis! Flink+doris build a real-time data warehouse scheme
直播软件app开发,左右自动滑动的轮播图广告
Object.defineProperty和Reflect.defineProperty的容错问题
“阿里健康”们的逻辑早就变了
Jianmu continuous integration platform v2.5.0 release
[notes of Wu Enda] convolutional neural network
Excel布局
并查集+建图
产业互联网时代,并不存在传统意义上的互联网
是真干不过00后,给我卷的崩溃,想离职了...
Object. Defineproperty and reflect Fault tolerance of defineproperty
[theory] deep learning in the covid-19 epic: a deep model for urban traffic revitalization index
Practice of hierarchical management based on kubesphere
拖动拖动拖动