当前位置:网站首页>CDN access log quality performance monitoring and operation statistical analysis best practices
CDN access log quality performance monitoring and operation statistical analysis best practices
2022-06-24 01:12:00 【Log service CLS assistant】
author :v god
Introduction : Cloud native log service (Cloud Log Service,CLS) It is a one-stop service provided by Tencent cloud Log data Solution platform , Provides data collection from logs 、 Log storage to log retrieval , Chart analysis 、 Monitoring alarm 、 Log delivery and other services , Assist users to solve business problems through logs Operation and maintenance 、 Service monitoring 、 Log audit and other scenarios .
CDN Is a very important Internet infrastructure , The user can go through CDN, Quick access to all kinds of pictures in the network , Video and other resources . During the visit ,CDN A large amount of log data will be generated , Through to CDN Analysis of access logs , A lot of useful information can be mined for CDN Quality and performance analysis , Troubleshooting , Client distribution , User behavior analysis .
Prerequisite : CDN Log collection to CLS Side , See Operation details .
What is? CDN?
CDN Content distribution network (Content Delivery Network,CDN) It's in the existing Internet Add a new layer of network architecture , It is composed of high-performance acceleration nodes all over the world . These high-performance service nodes will store business content according to certain caching policies , When a user sends a request to a business content , The request will be dispatched to the service node closest to the user , Rapid response directly by the service node , Effectively reduce user access delay , Improve usability .
Tradition CDN Log analysis
At present , various CDN Service provider , It usually provides basic monitoring indicators in real time , For example, the number of requests , Broadband and other information . however , In many specific analysis scenarios , These default real-time indicators may not meet the customized analysis needs of users . therefore , Usually the user will further CDN Download the original log of , Conduct in-depth analysis and mining offline .
Locate problems in real time 、 Fast verification and other interactive analysis scenarios , Users set up an offline analysis cluster by themselves , Not only does it require a lot of O & M development costs and labor costs , And the real-time data generation cannot be guaranteed , It is not surprising that the delay is more than half an hour ; If in CDN Log alert , Analysis scenarios such as obstacle removal , Poor flexibility , Unable to quickly respond to real-time interactive query requirements .
CDN to CLS programme
Tencent cloud CDN And CLS The log service is open , Users can CDN Real time delivery of data to CLS The log service , And further use CLS Log service retrieval and SQL Analytical ability , To meet the personalized real-time log analysis needs of users in different scenarios :
- Log one click delivery
- Ten billion level log , Second level analysis
- Dashboard Dashboard real-time log Visualization
- One minute real-time alarm
CDN Log Introduction
CDN Log field description
Field name | Original log type | Log service type | explain |
|---|---|---|---|
app_id | Integer | long | Tencent cloud account APPID |
client_ip | String | text | client IP |
file_size | Integer | long | file size |
hit | String | text | cache HIT / MISS, stay CDN Edge node hit 、 Parent node hits are marked as HIT |
host | String | text | domain name |
http_code | Integer | long | HTTP Status code |
isp | String | text | Operator, |
method | String | text | HTTP Method |
param | String | text | URL Parameters carried |
proto | String | text | HTTP Protocol identification |
prov | String | text | Operator province |
referer | String | text | Referer Information ,HTTP Source address |
request_range | String | text | Range Parameters , Request scope |
request_time | Integer | long | response time ( millisecond ), It refers to the time taken by the node to respond to all packets after receiving the request and then to the client |
request_port | String | long | The client and CDN The port on which the node establishes a connection , If none; otherwise - |
rsp_size | Integer | long | Number of bytes returned |
time | Integer | long | Request time ,UNIX Time stamp , Unit is : second . |
ua | String | text | User-Agent Information |
url | String | text | Request path |
uuid | String | text | Unique identification of the request |
version | Integer | long | CDN Real time log version |
1. CDN Quality monitoring
scene 1: monitor CDN The access delay is higher than a certain threshold
Use percentages in mathematical statistics ( for example 99% Maximum delay ) As the trigger condition of alarm, it is more accurate , Use average , Individual value triggering alarm will cause some individual request delay to be averaged , Unable to reflect the real situation . For example, use the following query analysis statement to calculate a day window (1440 minute ) Average delay size of each minute in ,50% The delay size of the quantile , and 90% The delay size of the quantile .
* | select avg(request_time) as l, approx_percentile(request_time, 0.5) as p50, approx_percentile(request_time, 0.99) as p99, time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') as time group by time order by time desc limit 1440
in the light of 99% The delay of is greater than 100ms Alarm , And display the affected domain name in the alarm information 、url、client_ip, In order to quickly determine the error situation . The alarm settings are as follows :
* | select approx_percentile(request_time, 0.99) as p99
By configuring multi-dimensional analysis , Display the affected domain name in the alarm information , client ip,url, Help developers quickly locate problems .
Once the alarm is triggered , Through WeChat , Enterprise WeChat , Get key information by SMS at the first time .
scene 2: Resource access error surge alarm , When the year-on-year increase exceeds a certain threshold , The alarm informs the user
When the number of page access errors surges , It is often possible to explain CDN Back end server failed , Or request overload . We can set the alarm to detect the alarm within a certain time range (eg. A minute ) Request a year-on-year increase in the number of errors to monitor , When the year-on-year increase exceeds a certain threshold , The alarm informs the user .
Number of errors in the last minute
* | select * from (select * from (select * from (select date_trunc('minute', __TIMESTAMP__) as time,count(*) as errct where http_code>=400 group by time order by time desc limit 2)) order by time desc limit 1)Number of errors in the last minute
* | select * from (select * from (select * from (select date_trunc('minute', __TIMESTAMP__) as time,count(*) as errct where http_code>=400 group by time order by time desc limit 2)) order by time asc limit 1The trigger condition of alarm strategy configuration is 【 Number of errors in the last minute 】-【 Number of errors in the last minute 】 > Specify thresholds
$2.errct-$1.errct >100
2. CDN Quality and performance analysis
CDN Provide in the log , Contains a wealth of content , We can view... From multiple dimensions CDN The overall quality and performance of :
- Health
- cache hit rate
- Average download speed
- Number of downloads by the operator 、 Download traffic 、 Speed
- Request delayed response
Health
Statistics http_code Less than 500 Percent of all requests .
* | select round(sum(case when http_code<500 then 1.00 else 0.00 end) / cast(count(*) as double) * 100,1) as " Health "
cache hit rate
Statistics return_code Less than 400 In the request of , hit by “hit” Percent of requests .
http_code<400 | select round(sum(case when hit='hit' then 1.00 else 0.00 end) / cast(count(*) as double) * 100,1) as " cache hit rate "
Average download speed
Statistics over a period of time , The average download speed is obtained by dividing the total download volume by the total time taken .
* | select sum(rsp_size/1024.0) / sum(request_time/1000.0) as " Average download speed (kb/s)"
Number of downloads by the operator 、 Download traffic 、 Speed
The principle of same , Use ip_to_provider function , take client_ip Convert to the corresponding operator .
* | select ip_to_provider(client_ip) as isp , sum(rsp_size)* 1.0 /(sum(request_time)+1) as " Download speed (KB/s)" , sum(rsp_size/1024.0/1024.0) as " Total downloads (MB)", count(*) as c group by isp order by c desc limit 10
Request delayed response
Count the access delay according to each window , The appropriate delay time window can be divided according to the actual situation of the application .
* | select case when request_time < 5000 then '~5s' when request_time < 6000 then '5s~6s' when request_time < 7000 then '6s~7s' when request_time < 8000 then '7~8s' when request_time < 10000 then '8~10s' when request_time < 15000 then '10~15s' else '15s~' end as latency , count(*) as count group by latency
3. CDN Troubleshooting
demand
Access errors have always been an important part of the service experience , When something goes wrong , Need to quickly locate the current error QPS And the ratio , Which domain names and URI Maximum impact , Whether it is related to the region 、 About the operator , Is the release of a new version leading to .
Solution
see 4xx,5xx Error code distribution
* | select http_code , count(*) as c where http_code >= 400 group by http_code order by c desc
From the following error distribution diagram , The main mistake is that 404 error , Indicates that the accessed file or content does not exist , At this time, you need to check whether the resource has been deleted or destroyed .
about http_code > 400 Request , We make a multi-dimensional analysis of it , For example, according to the domain name and uri The dimension of top Sort ; Province , Check the number of errors from the operator's perspective ; View client distribution .
domain name
* | select host , count(*) as count where http_code > 400 group by host order by count desc limit 10
url
* | select url , count(*) as count where http_code > 400 group by url order by count desc limit 10
Province , Operator analysis
* | select client_ip, ip_to_province(client_ip) as "province", ip_to_provider(client_ip) as " Operator, " , count(*) as " Wrong number " where http_code >= 400 group by client_ip order by " Wrong number " DESC limit 100
Client distribution
* | select ua as " Client version ", count(*) as " Wrong number " where http_code > 400 group by ua order by " Wrong number " desc limit 10sql
It can be seen from the figure , Error set Safari client , Find out after locating the problem , Is the new version bug Lead to in Safari Under browser window , Frequent failure to access resources .
4. User behavior analysis
demand
- Where do most users come from , Inside or outside
- Which resource users are popular resources
- Whether there are users downloading resources crazily , Whether the behavior is in line with expectations
Solution
Access source analysis
* | select ip_to_province(client_ip) as province , count(*) as c group by province order by c desc limit 50
visit TopUrl
http_code < 400 | select url ,count(*) as " Number of visits ", round(sum(rsp_size)/1024.0/1024.0/1024.0, 2) as " Total downloads (GB)" group by url order by " Number of visits " desc limit 100
Download traffic Top domain name , Download data according to the size of each domain name Top Sort
* | select host, sum(rsp_size/1024) as " Total downloads " group by host order by " Total downloads " desc limit 100
Downloads Top User statistics
* | SELECT CASE WHEN ip_to_country(client_ip)=' Hong Kong ' THEN concat(client_ip, ' ( Hong Kong )') WHEN ip_to_province(client_ip)='' THEN concat(client_ip, ' ( Unknown IP )') WHEN ip_to_provider(client_ip)=' Intranet IP' THEN concat(client_ip, ' (Private IP )') ELSE concat(client_ip, ' ( ', ip_to_country(client_ip), '/', ip_to_province(client_ip), '/', if(ip_to_city(client_ip)='-1', 'Unknown city', ip_to_city(client_ip)), ' ',ip_to_provider(client_ip), ' )') END AS client, pv as " Total visits ", error_count as " Number of bad accesses " , throughput as " Total downloads (GB)" from (select client_ip , count(*) as pv, round(sum(rsp_size)/1024.0/1024/1024.0, 1) AS throughput , sum(if(http_code > 400, 1, 0)) AS error_count from log group by client_ip order by throughput desc limit 100)
Effective access to Top User statistics
* | SELECT CASE WHEN ip_to_country(client_ip)=' Hong Kong ' THEN concat(client_ip, ' ( Hong Kong )') WHEN ip_to_province(client_ip)='' THEN concat(client_ip, ' ( Unknown IP )') WHEN ip_to_provider(client_ip)=' Intranet IP' THEN concat(client_ip, ' (Private IP )') ELSE concat(client_ip, ' ( ', ip_to_country(client_ip), '/', ip_to_province(client_ip), '/', if(ip_to_city(client_ip)='-1', 'Unknown city', ip_to_city(client_ip)), ' ',ip_to_provider(client_ip), ' )') END AS client, pv as " Total visits ", (pv - success_count) as " Number of bad accesses " , throughput as " Total downloads (GB)" from (select client_ip , count(*) as pv, round(sum(rsp_size)/1024.0/1024/1024.0, 1) AS throughput , sum(if(http_code < 400, 1, 0)) AS success_count from log group by client_ip order by success_count desc limit 100)
visit PV、UV Statistics , Count the number of visits in a certain period of time and the number of independent client ip Change trend of
* | select date_trunc('minute', __TIMESTAMP__) as time, count(*) as pv,count( distinct client_ip) as uv group by time order by time limit 1000 That's about CDN How to access the log , If you have more interesting logging practices , Welcome to contribute and share !
The articles :
CLB Operation and maintenance & Operational best practices --- Access log big insight
【 Tencent cloud log service CLS】serverless In application CLS Service details
【 The log service CLS】 Apply workflow ASW Access CLS Practice sharing
【 The log service CLS】Python Development API Access CLS( Source code attached 、 The detailed steps )
【 The log service CLS】Nginx Access log access Tencent cloud log service
边栏推荐
- The best Base64 encoding and decoding tutorial in the whole network, with 6 examples!
- Alibaba interview question: multi thread related
- 2022 postgraduate entrance examination experience sharing [preliminary examination, school selection, re examination, adjustment, school recruitment and social recruitment]
- Architecture solutions
- 用一个软件纪念自己故去的母亲,这或许才是程序员最大的浪漫吧
- jdbc
- Open source model library of flying propeller industry: accelerating the development and application of enterprise AI tasks
- Remember the performance optimization with 18 times improvement at one time
- 股票网上开户安全吗?需要满足什么条件?
- skywalking 安装部署实践
猜你喜欢

实时计算框架:Flink集群搭建与运行机制

Icml'22 | progcl: rethinking difficult sample mining in graph contrast learning
![[CVPR 2022] high resolution small object detection: cascaded sparse query for accelerating high resolution smal object detection](/img/79/7dfc30565ddee0769ef5f1bc239b5d.png)
[CVPR 2022] high resolution small object detection: cascaded sparse query for accelerating high resolution smal object detection

用一个软件纪念自己故去的母亲,这或许才是程序员最大的浪漫吧

Data management: business data cleaning and implementation scheme

一次 MySQL 误操作导致的事故,「高可用」都顶不住了!

Everything I see is the category of my precise positioning! Open source of a new method for saliency map visualization

机器学习中 TP FP TN FN的概念
![2022 postgraduate entrance examination experience sharing [preliminary examination, school selection, re examination, adjustment, school recruitment and social recruitment]](/img/05/e204f526e2f3e90ed9a7ad0361a72e.png)
2022 postgraduate entrance examination experience sharing [preliminary examination, school selection, re examination, adjustment, school recruitment and social recruitment]

CVPR2022 | 可精简域适应
随机推荐
GNN上分利器!与其绞尽脑汁炼丹,不如给你的GNN撒点trick吧
[Hongke case] how can 3D data become operable information Object detection and tracking
Dart series part: asynchronous programming in dart
Kitten paw: FOC control 15-mras method of PMSM
Intelligent + fault-tolerant server is the best partner in the edge computing scenario
Real time preview of RTSP video based on webrtc
Local cache selection (guava/caffeine/ohc) and performance comparison
[CVPR 2020] conference version: a physics based noise formation model for extreme low light raw denoising
杂乱的知识点
ICML'22 | ProGCL: 重新思考图对比学习中的难样本挖掘
实时计算框架:Spark集群搭建与入门案例
LMS Virtual. Derivation method of lab acoustic simulation results
How to write peer-reviewed papers
LSF opens job idle information to view the CPU time/elapse time usage of the job
[technical grass planting] use webhook to automatically deploy my blogs on multiple sites in Tencent cloud
985 Android programmers won the oral offer of Alibaba P6 in 40 days. After the successful interview, they sorted out these interview ideas
paddle使用指南
Gin framework: automatically add requestid
Sockfwd a data forwarding gadget
[shutter] how to use shutter packages and plug-ins