当前位置:网站首页>CDN access log quality performance monitoring and operation statistical analysis best practices
CDN access log quality performance monitoring and operation statistical analysis best practices
2022-06-24 01:12:00 【Log service CLS assistant】
author :v god
Introduction : Cloud native log service (Cloud Log Service,CLS) It is a one-stop service provided by Tencent cloud Log data Solution platform , Provides data collection from logs 、 Log storage to log retrieval , Chart analysis 、 Monitoring alarm 、 Log delivery and other services , Assist users to solve business problems through logs Operation and maintenance 、 Service monitoring 、 Log audit and other scenarios .
CDN Is a very important Internet infrastructure , The user can go through CDN, Quick access to all kinds of pictures in the network , Video and other resources . During the visit ,CDN A large amount of log data will be generated , Through to CDN Analysis of access logs , A lot of useful information can be mined for CDN Quality and performance analysis , Troubleshooting , Client distribution , User behavior analysis .
Prerequisite : CDN Log collection to CLS Side , See Operation details .
What is? CDN?
CDN Content distribution network (Content Delivery Network,CDN) It's in the existing Internet Add a new layer of network architecture , It is composed of high-performance acceleration nodes all over the world . These high-performance service nodes will store business content according to certain caching policies , When a user sends a request to a business content , The request will be dispatched to the service node closest to the user , Rapid response directly by the service node , Effectively reduce user access delay , Improve usability .
Tradition CDN Log analysis
At present , various CDN Service provider , It usually provides basic monitoring indicators in real time , For example, the number of requests , Broadband and other information . however , In many specific analysis scenarios , These default real-time indicators may not meet the customized analysis needs of users . therefore , Usually the user will further CDN Download the original log of , Conduct in-depth analysis and mining offline .
Locate problems in real time 、 Fast verification and other interactive analysis scenarios , Users set up an offline analysis cluster by themselves , Not only does it require a lot of O & M development costs and labor costs , And the real-time data generation cannot be guaranteed , It is not surprising that the delay is more than half an hour ; If in CDN Log alert , Analysis scenarios such as obstacle removal , Poor flexibility , Unable to quickly respond to real-time interactive query requirements .
CDN to CLS programme
Tencent cloud CDN And CLS The log service is open , Users can CDN Real time delivery of data to CLS The log service , And further use CLS Log service retrieval and SQL Analytical ability , To meet the personalized real-time log analysis needs of users in different scenarios :
- Log one click delivery
- Ten billion level log , Second level analysis
- Dashboard Dashboard real-time log Visualization
- One minute real-time alarm
CDN Log Introduction
CDN Log field description
Field name | Original log type | Log service type | explain |
|---|---|---|---|
app_id | Integer | long | Tencent cloud account APPID |
client_ip | String | text | client IP |
file_size | Integer | long | file size |
hit | String | text | cache HIT / MISS, stay CDN Edge node hit 、 Parent node hits are marked as HIT |
host | String | text | domain name |
http_code | Integer | long | HTTP Status code |
isp | String | text | Operator, |
method | String | text | HTTP Method |
param | String | text | URL Parameters carried |
proto | String | text | HTTP Protocol identification |
prov | String | text | Operator province |
referer | String | text | Referer Information ,HTTP Source address |
request_range | String | text | Range Parameters , Request scope |
request_time | Integer | long | response time ( millisecond ), It refers to the time taken by the node to respond to all packets after receiving the request and then to the client |
request_port | String | long | The client and CDN The port on which the node establishes a connection , If none; otherwise - |
rsp_size | Integer | long | Number of bytes returned |
time | Integer | long | Request time ,UNIX Time stamp , Unit is : second . |
ua | String | text | User-Agent Information |
url | String | text | Request path |
uuid | String | text | Unique identification of the request |
version | Integer | long | CDN Real time log version |
1. CDN Quality monitoring
scene 1: monitor CDN The access delay is higher than a certain threshold
Use percentages in mathematical statistics ( for example 99% Maximum delay ) As the trigger condition of alarm, it is more accurate , Use average , Individual value triggering alarm will cause some individual request delay to be averaged , Unable to reflect the real situation . For example, use the following query analysis statement to calculate a day window (1440 minute ) Average delay size of each minute in ,50% The delay size of the quantile , and 90% The delay size of the quantile .
* | select avg(request_time) as l, approx_percentile(request_time, 0.5) as p50, approx_percentile(request_time, 0.99) as p99, time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') as time group by time order by time desc limit 1440
in the light of 99% The delay of is greater than 100ms Alarm , And display the affected domain name in the alarm information 、url、client_ip, In order to quickly determine the error situation . The alarm settings are as follows :
* | select approx_percentile(request_time, 0.99) as p99
By configuring multi-dimensional analysis , Display the affected domain name in the alarm information , client ip,url, Help developers quickly locate problems .
Once the alarm is triggered , Through WeChat , Enterprise WeChat , Get key information by SMS at the first time .
scene 2: Resource access error surge alarm , When the year-on-year increase exceeds a certain threshold , The alarm informs the user
When the number of page access errors surges , It is often possible to explain CDN Back end server failed , Or request overload . We can set the alarm to detect the alarm within a certain time range (eg. A minute ) Request a year-on-year increase in the number of errors to monitor , When the year-on-year increase exceeds a certain threshold , The alarm informs the user .
Number of errors in the last minute
* | select * from (select * from (select * from (select date_trunc('minute', __TIMESTAMP__) as time,count(*) as errct where http_code>=400 group by time order by time desc limit 2)) order by time desc limit 1)Number of errors in the last minute
* | select * from (select * from (select * from (select date_trunc('minute', __TIMESTAMP__) as time,count(*) as errct where http_code>=400 group by time order by time desc limit 2)) order by time asc limit 1The trigger condition of alarm strategy configuration is 【 Number of errors in the last minute 】-【 Number of errors in the last minute 】 > Specify thresholds
$2.errct-$1.errct >100
2. CDN Quality and performance analysis
CDN Provide in the log , Contains a wealth of content , We can view... From multiple dimensions CDN The overall quality and performance of :
- Health
- cache hit rate
- Average download speed
- Number of downloads by the operator 、 Download traffic 、 Speed
- Request delayed response
Health
Statistics http_code Less than 500 Percent of all requests .
* | select round(sum(case when http_code<500 then 1.00 else 0.00 end) / cast(count(*) as double) * 100,1) as " Health "
cache hit rate
Statistics return_code Less than 400 In the request of , hit by “hit” Percent of requests .
http_code<400 | select round(sum(case when hit='hit' then 1.00 else 0.00 end) / cast(count(*) as double) * 100,1) as " cache hit rate "
Average download speed
Statistics over a period of time , The average download speed is obtained by dividing the total download volume by the total time taken .
* | select sum(rsp_size/1024.0) / sum(request_time/1000.0) as " Average download speed (kb/s)"
Number of downloads by the operator 、 Download traffic 、 Speed
The principle of same , Use ip_to_provider function , take client_ip Convert to the corresponding operator .
* | select ip_to_provider(client_ip) as isp , sum(rsp_size)* 1.0 /(sum(request_time)+1) as " Download speed (KB/s)" , sum(rsp_size/1024.0/1024.0) as " Total downloads (MB)", count(*) as c group by isp order by c desc limit 10
Request delayed response
Count the access delay according to each window , The appropriate delay time window can be divided according to the actual situation of the application .
* | select case when request_time < 5000 then '~5s' when request_time < 6000 then '5s~6s' when request_time < 7000 then '6s~7s' when request_time < 8000 then '7~8s' when request_time < 10000 then '8~10s' when request_time < 15000 then '10~15s' else '15s~' end as latency , count(*) as count group by latency
3. CDN Troubleshooting
demand
Access errors have always been an important part of the service experience , When something goes wrong , Need to quickly locate the current error QPS And the ratio , Which domain names and URI Maximum impact , Whether it is related to the region 、 About the operator , Is the release of a new version leading to .
Solution
see 4xx,5xx Error code distribution
* | select http_code , count(*) as c where http_code >= 400 group by http_code order by c desc
From the following error distribution diagram , The main mistake is that 404 error , Indicates that the accessed file or content does not exist , At this time, you need to check whether the resource has been deleted or destroyed .
about http_code > 400 Request , We make a multi-dimensional analysis of it , For example, according to the domain name and uri The dimension of top Sort ; Province , Check the number of errors from the operator's perspective ; View client distribution .
domain name
* | select host , count(*) as count where http_code > 400 group by host order by count desc limit 10
url
* | select url , count(*) as count where http_code > 400 group by url order by count desc limit 10
Province , Operator analysis
* | select client_ip, ip_to_province(client_ip) as "province", ip_to_provider(client_ip) as " Operator, " , count(*) as " Wrong number " where http_code >= 400 group by client_ip order by " Wrong number " DESC limit 100
Client distribution
* | select ua as " Client version ", count(*) as " Wrong number " where http_code > 400 group by ua order by " Wrong number " desc limit 10sql
It can be seen from the figure , Error set Safari client , Find out after locating the problem , Is the new version bug Lead to in Safari Under browser window , Frequent failure to access resources .
4. User behavior analysis
demand
- Where do most users come from , Inside or outside
- Which resource users are popular resources
- Whether there are users downloading resources crazily , Whether the behavior is in line with expectations
Solution
Access source analysis
* | select ip_to_province(client_ip) as province , count(*) as c group by province order by c desc limit 50
visit TopUrl
http_code < 400 | select url ,count(*) as " Number of visits ", round(sum(rsp_size)/1024.0/1024.0/1024.0, 2) as " Total downloads (GB)" group by url order by " Number of visits " desc limit 100
Download traffic Top domain name , Download data according to the size of each domain name Top Sort
* | select host, sum(rsp_size/1024) as " Total downloads " group by host order by " Total downloads " desc limit 100
Downloads Top User statistics
* | SELECT CASE WHEN ip_to_country(client_ip)=' Hong Kong ' THEN concat(client_ip, ' ( Hong Kong )') WHEN ip_to_province(client_ip)='' THEN concat(client_ip, ' ( Unknown IP )') WHEN ip_to_provider(client_ip)=' Intranet IP' THEN concat(client_ip, ' (Private IP )') ELSE concat(client_ip, ' ( ', ip_to_country(client_ip), '/', ip_to_province(client_ip), '/', if(ip_to_city(client_ip)='-1', 'Unknown city', ip_to_city(client_ip)), ' ',ip_to_provider(client_ip), ' )') END AS client, pv as " Total visits ", error_count as " Number of bad accesses " , throughput as " Total downloads (GB)" from (select client_ip , count(*) as pv, round(sum(rsp_size)/1024.0/1024/1024.0, 1) AS throughput , sum(if(http_code > 400, 1, 0)) AS error_count from log group by client_ip order by throughput desc limit 100)
Effective access to Top User statistics
* | SELECT CASE WHEN ip_to_country(client_ip)=' Hong Kong ' THEN concat(client_ip, ' ( Hong Kong )') WHEN ip_to_province(client_ip)='' THEN concat(client_ip, ' ( Unknown IP )') WHEN ip_to_provider(client_ip)=' Intranet IP' THEN concat(client_ip, ' (Private IP )') ELSE concat(client_ip, ' ( ', ip_to_country(client_ip), '/', ip_to_province(client_ip), '/', if(ip_to_city(client_ip)='-1', 'Unknown city', ip_to_city(client_ip)), ' ',ip_to_provider(client_ip), ' )') END AS client, pv as " Total visits ", (pv - success_count) as " Number of bad accesses " , throughput as " Total downloads (GB)" from (select client_ip , count(*) as pv, round(sum(rsp_size)/1024.0/1024/1024.0, 1) AS throughput , sum(if(http_code < 400, 1, 0)) AS success_count from log group by client_ip order by success_count desc limit 100)
visit PV、UV Statistics , Count the number of visits in a certain period of time and the number of independent client ip Change trend of
* | select date_trunc('minute', __TIMESTAMP__) as time, count(*) as pv,count( distinct client_ip) as uv group by time order by time limit 1000 That's about CDN How to access the log , If you have more interesting logging practices , Welcome to contribute and share !
The articles :
CLB Operation and maintenance & Operational best practices --- Access log big insight
【 Tencent cloud log service CLS】serverless In application CLS Service details
【 The log service CLS】 Apply workflow ASW Access CLS Practice sharing
【 The log service CLS】Python Development API Access CLS( Source code attached 、 The detailed steps )
【 The log service CLS】Nginx Access log access Tencent cloud log service
边栏推荐
- 【小程序】相对路径和绝对路径的表示符
- Alibaba interview question: multi thread related
- LMS Virtual. Derivation method of lab acoustic simulation results
- VS2022保存格式化插件
- [shutter] how to use shutter packages and plug-ins
- 用一个软件纪念自己故去的母亲,这或许才是程序员最大的浪漫吧
- 阿里巴巴面试题:多线程相关
- CVPR2022 | 可精简域适应
- 苹果Iphone14搭载北斗导航系统,北斗VS GPS有哪些优势?
- 7 tips for preventing DDoS Attacks
猜你喜欢

用一个软件纪念自己故去的母亲,这或许才是程序员最大的浪漫吧

WinSCP和PuTTY的安装和使用

CVPR2022 | 可精简域适应

Handwritten digit recognition using SVM, Bayesian classification, binary tree and CNN

【ICPR 2021】遥感图中的密集小目标检测:Tiny Object Detection in Aerial Images

对抗训练理论分析:自适应步长快速对抗训练

Cross domain and jsonp
![[CVPR 2022] high resolution small object detection: cascaded sparse query for accelerating high resolution smal object detection](/img/79/7dfc30565ddee0769ef5f1bc239b5d.png)
[CVPR 2022] high resolution small object detection: cascaded sparse query for accelerating high resolution smal object detection
![[applet] when compiling the preview applet, a -80063 error prompt appears](/img/4e/722d76aa0ca3576164fbed4e2c4db2.png)
[applet] when compiling the preview applet, a -80063 error prompt appears

机器学习中 TP FP TN FN的概念
随机推荐
C language: structure array implementation to find the lowest student record
对抗训练理论分析:自适应步长快速对抗训练
The concept of TP FP TN FN in machine learning
An accident caused by a MySQL misoperation, and the "high availability" cannot withstand it!
股票网上开户安全吗?需要满足什么条件?
实时计算框架:Spark集群搭建与入门案例
实时计算框架:Flink集群搭建与运行机制
Forward design of business application data technology architecture
Part of the problem solution of unctf2020
WinSCP和PuTTY的安装和使用
SQL database: summary of knowledge points, no suspension at the end of the period
[technical grass planting] use webhook to automatically deploy my blogs on multiple sites in Tencent cloud
【Flutter】如何使用Flutter包和插件
DML操作
numpy.linalg.lstsq(a,b,rcond=-1)解析
Remember the performance optimization with 18 times improvement at one time
[OSG] OSG development (04) - create multiple scene views
Gin framework: automatically add requestid
杂乱的知识点
[iccv workshop 2021] small target detection based on density map: coarse-grained density map guided object detection in aerial images