当前位置:网站首页>CDN access log quality performance monitoring and operation statistical analysis best practices

CDN access log quality performance monitoring and operation statistical analysis best practices

2022-06-24 01:12:00 Log service CLS assistant

author :v god

Introduction : Cloud native log service (Cloud Log Service,CLS) It is a one-stop service provided by Tencent cloud Log data Solution platform , Provides data collection from logs 、 Log storage to log retrieval , Chart analysis 、 Monitoring alarm 、 Log delivery and other services , Assist users to solve business problems through logs Operation and maintenance 、 Service monitoring 、 Log audit and other scenarios .

CDN Is a very important Internet infrastructure , The user can go through CDN, Quick access to all kinds of pictures in the network , Video and other resources . During the visit ,CDN A large amount of log data will be generated , Through to CDN Analysis of access logs , A lot of useful information can be mined for CDN Quality and performance analysis , Troubleshooting , Client distribution , User behavior analysis .

Prerequisite : CDN Log collection to CLS Side , See Operation details .

What is? CDN?

CDN Content distribution network (Content Delivery Network,CDN) It's in the existing Internet Add a new layer of network architecture , It is composed of high-performance acceleration nodes all over the world . These high-performance service nodes will store business content according to certain caching policies , When a user sends a request to a business content , The request will be dispatched to the service node closest to the user , Rapid response directly by the service node , Effectively reduce user access delay , Improve usability .

Tradition CDN Log analysis

At present , various CDN Service provider , It usually provides basic monitoring indicators in real time , For example, the number of requests , Broadband and other information . however , In many specific analysis scenarios , These default real-time indicators may not meet the customized analysis needs of users . therefore , Usually the user will further CDN Download the original log of , Conduct in-depth analysis and mining offline .

Locate problems in real time 、 Fast verification and other interactive analysis scenarios , Users set up an offline analysis cluster by themselves , Not only does it require a lot of O & M development costs and labor costs , And the real-time data generation cannot be guaranteed , It is not surprising that the delay is more than half an hour ; If in CDN Log alert , Analysis scenarios such as obstacle removal , Poor flexibility , Unable to quickly respond to real-time interactive query requirements .

CDN to CLS programme

Tencent cloud CDN And CLS The log service is open , Users can CDN Real time delivery of data to CLS The log service , And further use CLS Log service retrieval and SQL Analytical ability , To meet the personalized real-time log analysis needs of users in different scenarios :

  • Log one click delivery
  • Ten billion level log , Second level analysis
  • Dashboard Dashboard real-time log Visualization
  • One minute real-time alarm

CDN Log Introduction

CDN Log field description

Field name

Original log type

Log service type

explain

app_id

Integer

long

Tencent cloud account APPID

client_ip

String

text

client IP

file_size

Integer

long

file size

hit

String

text

cache HIT / MISS, stay CDN Edge node hit 、 Parent node hits are marked as HIT

host

String

text

domain name

http_code

Integer

long

HTTP Status code

isp

String

text

Operator,

method

String

text

HTTP Method

param

String

text

URL Parameters carried

proto

String

text

HTTP Protocol identification

prov

String

text

Operator province

referer

String

text

Referer Information ,HTTP Source address

request_range

String

text

Range Parameters , Request scope

request_time

Integer

long

response time ( millisecond ), It refers to the time taken by the node to respond to all packets after receiving the request and then to the client

request_port

String

long

The client and CDN The port on which the node establishes a connection , If none; otherwise -

rsp_size

Integer

long

Number of bytes returned

time

Integer

long

Request time ,UNIX Time stamp , Unit is : second .

ua

String

text

User-Agent Information

url

String

text

Request path

uuid

String

text

Unique identification of the request

version

Integer

long

CDN Real time log version

1. CDN Quality monitoring

scene 1: monitor CDN The access delay is higher than a certain threshold

Use percentages in mathematical statistics ( for example 99% Maximum delay ) As the trigger condition of alarm, it is more accurate , Use average , Individual value triggering alarm will cause some individual request delay to be averaged , Unable to reflect the real situation . For example, use the following query analysis statement to calculate a day window (1440 minute ) Average delay size of each minute in ,50% The delay size of the quantile , and 90% The delay size of the quantile .

* | select avg(request_time) as l, approx_percentile(request_time, 0.5) as p50, approx_percentile(request_time, 0.99) as p99, time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') as time group by time order by time desc limit 1440
Access data statistics

in the light of 99% The delay of is greater than 100ms Alarm , And display the affected domain name in the alarm information 、url、client_ip, In order to quickly determine the error situation . The alarm settings are as follows :

* | select approx_percentile(request_time, 0.99) as p99
Monitoring configuration

By configuring multi-dimensional analysis , Display the affected domain name in the alarm information , client ip,url, Help developers quickly locate problems .

Alarm multidimensional analysis configuration

Once the alarm is triggered , Through WeChat , Enterprise WeChat , Get key information by SMS at the first time .

Alarm message sending

scene 2: Resource access error surge alarm , When the year-on-year increase exceeds a certain threshold , The alarm informs the user

When the number of page access errors surges , It is often possible to explain CDN Back end server failed , Or request overload . We can set the alarm to detect the alarm within a certain time range (eg. A minute ) Request a year-on-year increase in the number of errors to monitor , When the year-on-year increase exceeds a certain threshold , The alarm informs the user .

Number of errors in the last minute

* | select * from (select * from (select * from (select date_trunc('minute', __TIMESTAMP__) as time,count(*) as errct where http_code>=400 group by time order by time desc limit 2)) order by time desc limit 1)

Number of errors in the last minute

* | select *  from (select * from (select * from (select date_trunc('minute', __TIMESTAMP__) as time,count(*) as errct where http_code>=400 group by time order by time desc limit 2)) order by time asc limit 1

The trigger condition of alarm strategy configuration is 【 Number of errors in the last minute 】-【 Number of errors in the last minute 】 > Specify thresholds

$2.errct-$1.errct >100
Monitoring task

2. CDN Quality and performance analysis

CDN Provide in the log , Contains a wealth of content , We can view... From multiple dimensions CDN The overall quality and performance of :

  • Health
  • cache hit rate
  • Average download speed
  • Number of downloads by the operator 、 Download traffic 、 Speed
  • Request delayed response

Health

Statistics http_code Less than 500 Percent of all requests .

* | select round(sum(case when http_code<500 then 1.00 else 0.00 end) / cast(count(*) as double) * 100,1) as " Health "
Health statistics

cache hit rate

Statistics return_code Less than 400 In the request of , hit by “hit” Percent of requests .

http_code<400 | select round(sum(case when hit='hit' then 1.00 else 0.00 end) / cast(count(*) as double) * 100,1) as " cache hit rate "
Cache hit rate statistics

Average download speed

Statistics over a period of time , The average download speed is obtained by dividing the total download volume by the total time taken .

* | select sum(rsp_size/1024.0) / sum(request_time/1000.0) as " Average download speed (kb/s)"
Average download speed statistics

Number of downloads by the operator 、 Download traffic 、 Speed

The principle of same , Use ip_to_provider function , take client_ip Convert to the corresponding operator .

* | select ip_to_provider(client_ip) as isp , sum(rsp_size)* 1.0 /(sum(request_time)+1) as " Download speed (KB/s)" , sum(rsp_size/1024.0/1024.0) as  " Total downloads (MB)",  count(*) as c   group by  isp  order by c desc  limit 10
Carrier download speed 、 Statistics of times

Request delayed response

Count the access delay according to each window , The appropriate delay time window can be divided according to the actual situation of the application .

* | select case when request_time < 5000 then  '~5s'  when request_time < 6000 then '5s~6s'  when request_time < 7000 then '6s~7s' when request_time < 8000 then '7~8s' when request_time < 10000 then '8~10s' when request_time < 15000 then '10~15s' else '15s~' end as  latency , count(*) as count group by latency
Request delay distribution

3. CDN Troubleshooting

demand

Access errors have always been an important part of the service experience , When something goes wrong , Need to quickly locate the current error QPS And the ratio , Which domain names and URI Maximum impact , Whether it is related to the region 、 About the operator , Is the release of a new version leading to .

Solution

see 4xx,5xx Error code distribution

* |  select  http_code , count(*) as c where http_code >= 400 group by http_code order by c  desc

From the following error distribution diagram , The main mistake is that 404 error , Indicates that the accessed file or content does not exist , At this time, you need to check whether the resource has been deleted or destroyed .

Error request status distribution

about http_code > 400 Request , We make a multi-dimensional analysis of it , For example, according to the domain name and uri The dimension of top Sort ; Province , Check the number of errors from the operator's perspective ; View client distribution .

domain name

* |  select  host , count(*) as count where http_code > 400   group by host  order by count desc limit 10

url

* |  select  url , count(*) as count where http_code > 400   group by url  order by count desc limit 10

Province , Operator analysis

* | select client_ip, ip_to_province(client_ip) as "province", ip_to_provider(client_ip) as  " Operator, " , count(*) as " Wrong number "  where http_code >= 400 group by client_ip   order by " Wrong number " DESC limit 100

Client distribution

* | select ua as " Client version ", count(*) as " Wrong number "  where http_code > 400 group by ua order by " Wrong number " desc limit 10sql
Client version

It can be seen from the figure , Error set Safari client , Find out after locating the problem , Is the new version bug Lead to in Safari Under browser window , Frequent failure to access resources .

4. User behavior analysis

demand

  • Where do most users come from , Inside or outside
  • Which resource users are popular resources
  • Whether there are users downloading resources crazily , Whether the behavior is in line with expectations

Solution

Access source analysis

* | select ip_to_province(client_ip) as province ,  count(*) as c group by province order by c desc limit 50
Access source analysis

visit TopUrl

http_code < 400 | select url ,count(*) as  " Number of visits ", round(sum(rsp_size)/1024.0/1024.0/1024.0, 2) as " Total downloads (GB)" group by url order by " Number of visits " desc limit 100
Access statistics

Download traffic Top domain name , Download data according to the size of each domain name Top Sort

* | select host, sum(rsp_size/1024) as  " Total downloads " group by host order by  " Total downloads "  desc  limit 100
Download traffic data sorting

Downloads Top User statistics

* | SELECT CASE WHEN ip_to_country(client_ip)=' Hong Kong ' THEN concat(client_ip, ' ( Hong Kong )') WHEN ip_to_province(client_ip)='' THEN concat(client_ip, ' ( Unknown IP )') WHEN ip_to_provider(client_ip)=' Intranet IP' THEN concat(client_ip, ' (Private IP )') ELSE concat(client_ip, ' ( ', ip_to_country(client_ip), '/', ip_to_province(client_ip), '/', if(ip_to_city(client_ip)='-1', 'Unknown city', ip_to_city(client_ip)), ' ',ip_to_provider(client_ip), ' )') END AS client, pv as " Total visits ", error_count as " Number of bad accesses " , throughput as " Total downloads (GB)"  from  (select  client_ip , count(*) as pv, round(sum(rsp_size)/1024.0/1024/1024.0, 1) AS throughput , sum(if(http_code  > 400, 1, 0)) AS error_count from log   group by client_ip order by throughput desc limit 100)
Downloads Top User statistics

Effective access to Top User statistics

* | SELECT CASE WHEN ip_to_country(client_ip)=' Hong Kong ' THEN concat(client_ip, ' ( Hong Kong )') WHEN ip_to_province(client_ip)='' THEN concat(client_ip, ' ( Unknown IP )') WHEN ip_to_provider(client_ip)=' Intranet IP' THEN concat(client_ip, ' (Private IP )') ELSE concat(client_ip, ' ( ', ip_to_country(client_ip), '/', ip_to_province(client_ip), '/', if(ip_to_city(client_ip)='-1', 'Unknown city', ip_to_city(client_ip)), ' ',ip_to_provider(client_ip), ' )') END AS client, pv as  " Total visits ", (pv - success_count)  as " Number of bad accesses " , throughput as " Total downloads (GB)"  from  (select  client_ip , count(*) as pv, round(sum(rsp_size)/1024.0/1024/1024.0, 1) AS throughput , sum(if(http_code  < 400, 1, 0)) AS success_count from log   group by client_ip order by success_count desc limit 100)
Effective access to Top User statistics

visit PV、UV Statistics , Count the number of visits in a certain period of time and the number of independent client ip Change trend of

* | select date_trunc('minute', __TIMESTAMP__) as time, count(*) as pv,count( distinct client_ip) as uv group by time order by time limit 1000 
visit PV、UV Statistics

That's about CDN How to access the log , If you have more interesting logging practices , Welcome to contribute and share !

One stop log data solution platform

The articles :

CLB Operation and maintenance & Operational best practices --- Access log big insight

【 Tencent cloud log service CLS】serverless In application CLS Service details

【 The log service CLS】 Apply workflow ASW Access CLS Practice sharing

【 The log service CLS】Python Development API Access CLS( Source code attached 、 The detailed steps )

【 The log service CLS】Nginx Access log access Tencent cloud log service

原网站

版权声明
本文为[Log service CLS assistant]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/11/20211120000108382l.html