当前位置:网站首页>Battle history between redis and me under billion level traffic
Battle history between redis and me under billion level traffic
2022-06-24 08:39:00 【2020labs assistant】
One 、 background
One day, I received feedback from the upstream caller , One of the offers Dubbo Interface , It is fused for a short time at a fixed time every day , The exception information thrown is the provider dubbo The thread pool is depleted . At present dubbo Daily interface requests 18 100 million times , Error request 94W/ God , This is the beginning of the optimization journey .
Two 、 Quick response
2.1 Rapid positioning
First of all, carry out routine system information monitoring ( machine 、JVM Memory 、GC、 Threads ), It was found that although there was a slight spike , But within reason , And it doesn't match with the time of reporting the error , Ignore for a moment .
Second, the flow analysis , It is found that there will be a sudden increase of traffic at a fixed time every day , The point of sudden increase in traffic is also consistent with the time point of error reporting , The preliminary judgment is that the short-term large flow leads to .
3、 ... and 、 Looking for performance bottlenecks
3.1 Interface process analysis
3.1.1 flow chart
3.1.2 Process analysis
Call downstream interface after receiving request. , Use hystrix Fuse , The fusing time is 500MS;
According to the data returned by the downstream interface , Encapsulation of detail data , The first step is to get it from the local cache , If the local cache does not , From Redis Return to source ,Redis If there is none, return directly , The asynchronous thread returns to the source from the database .
If the first step calls the downstream interface exception , Then the data is disclosed , The process is to get the information from the local cache first , If the local cache does not , From Redis Return to source ,Redis If there is none, return directly , The asynchronous thread returns to the source from the database .
3.2 Performance bottleneck investigation
3.2.1 The downstream interface service takes a long time
Call chain display , Although the downstream interface P99 There is a spike in the peak flow , beyond 1S, But because of the setting of time-out ( Melting time 500MS,coreSize&masSize=50, Average time consumption of downstream interface 10MS following ), Judging the downstream interface is not the key point of the problem , To further eliminate interference , It can quickly fail when downstream services have spikes , Adjust the fusing time to 100MS,dubbo Timeout time 100MS.
3.2.2 Get details local cache no data ,Redis Back to the source
With the help of call chain platform , The first step is to analyze Redis Request traffic , To judge the hit rate of local cache , Find out Redis The traffic of is the interface traffic 2 times , In terms of design, this should not happen . Start code Review, We found that there was a problem with the logic .
Not read from local cache , But directly from Redis We got the data from ,Redis The maximum response time did find unreasonable spikes , Further analysis found that Redis Response time and Dubbo99 The situation of thread stabbing is basically the same , I feel that I have found the cause of the problem , I'm happy in my heart .
3.2.3 Get the bottom data, no data in local cache ,Redis Back to the source
normal
3.2.4 Record the result of the request Redis
Because the current Redis We've done resource isolation , And not in DB Slow log found in background , At this point, the analysis results in Redis There are many reasons for the slowdown , But everything else is ignored subjectively , Attention is asking Redis It's time to double the traffic , Therefore, priority should be given 3.2.2 Problems in .
Four 、 Solution
4.1 3.3.2 Problems in positioning
After the launch Redis Traffic doubling problem solved ,Redis The maximum response time was relieved , But it still hasn't been solved completely , It shows that large traffic query is not the most fundamental reason .
4.2 Redis Capacity expansion
stay Redis After the abnormal traffic problem is solved , The problem has not been completely solved , What we can do now is to calm down , Carefully sort out the causes Redis The reason for the slow , The train of thought is mainly from the following three aspects :
- There are slow queries
- Redis Service performance bottleneck
- Unreasonable client configuration
Based on the above ideas , Check one by one ; Inquire about Redis Slow query log , No slow queries found .
Borrowing call chain platform to analyze slow Redis command , Without the interference of slow query caused by large traffic , The problem location process is fast , A large number of time-consuming requests setex On the way , Occasionally slow requests for queries are also in setex After method , according to Redis The characteristic judgment of single thread setex yes Redis99 The prime culprit of thread stabbing . Find the specific statement , After positioning to specific business , First, apply for expansion Redis, from 6 individual master Expand to 8 individual master.
In terms of the results , The expansion basically has no effect , explain redis The service itself is not a performance bottleneck , At this point, the remaining one is the client related configuration .
4.3 Optimization of client parameters
4.3.1 Connection pool optimization
Redis Expansion has no effect , Aiming at the possible problems of the client , There are two directions to the point of doubt .
The first is that the client is processing Redis In cluster mode , There is a problem with the management of the connection BUG, The second is the unreasonable setting of connection pool parameters , At this point, source code analysis and connection pool parameter adjustment are carried out synchronously .
4.3.1.1 Judge whether there is any problem in client connection management BUG
At the end of the analysis , After the client processes the source code of the connection pool , No problem , It's the same as expected , Cache connection pool by slot , The first hypothesis is ruled out , Source code is as follows .
1、setEx public String setex(final byte[] key, final int seconds, final byte[] value) { return new JedisClusterCommand<String>(connectionHandler, maxAttempts) { @Override public String execute(Jedis connection) { return connection.setex(key, seconds, value); } }.runBinary(key); } 2、runBinary public T runBinary(byte[] key) { if (key == null) { throw new JedisClusterException("No way to dispatch this command to Redis Cluster."); } return runWithRetries(key, this.maxAttempts, false, false); } 3、runWithRetries private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) { if (attempts <= 0) { throw new JedisClusterMaxRedirectionsException("Too many Cluster redirections?"); } Jedis connection = null; try { if (asking) { // TODO: Pipeline asking with the original command to make it // faster.... connection = askConnection.get(); connection.asking(); // if asking success, reset asking flag asking = false; } else { if (tryRandomNode) { connection = connectionHandler.getConnection(); } else { connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key)); } } return execute(connection); } 4、getConnectionFromSlot public Jedis getConnectionFromSlot(int slot) { JedisPool connectionPool = cache.getSlotPool(slot); if (connectionPool != null) { // It can't guaranteed to get valid connection because of node // assignment return connectionPool.getResource(); } else { renewSlotCache(); //It's abnormal situation for cluster mode, that we have just nothing for slot, try to rediscover state connectionPool = cache.getSlotPool(slot); if (connectionPool != null) { return connectionPool.getResource(); } else { //no choice, fallback to new connection to random node return getConnection(); } } }
4.3.1.2 Analyze connection pool parameters
Through communication with middleware team , And reference commons-pool2 The official document is amended as follows ;
After parameter adjustment ,1S The number of requests above has been reduced , But there are still , The amount of upstream feedback degradation is reduced from 0 90 About ten thousand a day 6W individual ( About maxWaitMillis Set to 200MS Why will there be more than one 200MS Request , There is an explanation below ).
4.3.2 Continue to optimize
Optimization cannot stop , How to make Redis All write requests for are reduced to 200MS within , At this time, the optimization idea is to adjust the client configuration parameters , analysis Jedis Access to connection related source code ;
Jedis Get the connection source code
final AbandonedConfig ac = this.abandonedConfig; if (ac != null && ac.getRemoveAbandonedOnBorrow() && (getNumIdle() < 2) && (getNumActive() > getMaxTotal() - 3) ) { removeAbandoned(ac); } PooledObject<T> p = null; // Get local copy of current config so it is consistent for entire // method execution final boolean blockWhenExhausted = getBlockWhenExhausted(); boolean create; final long waitTime = System.currentTimeMillis(); while (p == null) { create = false; p = idleObjects.pollFirst(); if (p == null) { p = create(); if (p != null) { create = true; } } if (blockWhenExhausted) { if (p == null) { if (borrowMaxWaitMillis < 0) { p = idleObjects.takeFirst(); } else { p = idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS); } } if (p == null) { throw new NoSuchElementException( "Timeout waiting for idle object"); } } else { if (p == null) { throw new NoSuchElementException("Pool exhausted"); } } if (!p.allocate()) { p = null; } if (p != null) { try { factory.activateObject(p); } catch (final Exception e) { try { destroy(p); } catch (final Exception e1) { // Ignore - activation failure is more important } p = null; if (create) { final NoSuchElementException nsee = new NoSuchElementException( "Unable to activate object"); nsee.initCause(e); throw nsee; } } if (p != null && (getTestOnBorrow() || create && getTestOnCreate())) { boolean validate = false; Throwable validationThrowable = null; try { validate = factory.validateObject(p); } catch (final Throwable t) { PoolUtils.checkRethrow(t); validationThrowable = t; } if (!validate) { try { destroy(p); destroyedByBorrowValidationCount.incrementAndGet(); } catch (final Exception e) { // Ignore - validation failure is more important } p = null; if (create) { final NoSuchElementException nsee = new NoSuchElementException( "Unable to validate object"); nsee.initCause(validationThrowable); throw nsee; } } } } } updateStatsBorrow(p, System.currentTimeMillis() - waitTime); return p.getObject();
The general process of getting the connection is as follows :
Whether there are idle connections , If there is a free connection, return directly , Create without ;
If the maximum number of connections is exceeded during creation , Then judge whether there are other threads creating the connection , If not, return directly , If so, wait maxWaitMis Time ( Other threads may fail to create ), If the maximum connection is not exceeded , To create a connection ( At this time, the waiting time for getting the connection may be greater than maxWaitMs).
If the creation is not successful , Then determine whether the connection is blocked , If not, throw an exception , Insufficient connection pool , If so, judge maxWaitMillis Is less than 0, If it is less than 0 Then block and wait , If it is greater than 0 Then block and wait maxWaitMillis.
The next step is to determine whether a connection needs to be made according to the parameters check etc. .
According to the above process analysis ,maxWaitMills The current setting is 200, The total maximum blocking time of the above processes is 400MS, Most of the time 200MS, There should be no excess 400MS A sharp stab in the neck .
At this point, the problem may arise in creating the connection , Because it takes time to create a connection , And the creation time is uncertain , Focus on whether there is such a scene , adopt DB Background monitoring Redis Connection .
Analysis of the above figure shows that , It's really in a few minutes (9:00,12:00,19:00...),redis There is an increase in the number of connections , Follow Redis The stab time was basically the same . Feeling ( After all the previous attempts , I'm not sure ) The problem is clear ( When the sudden increase of flow comes , Connection pool available connections can not meet the demand , The connection is created , Cause request waiting ).
The idea is to create a connection pool when the service starts , Minimize the creation of new connections , Modify connection pool parameters vivo.cache.depend.common.poolConfig.minIdle, It turned out to be ineffective ???
Don't say anything , Start rolling source code ,jedis The bottom layer uses commons-poll2 To manage the connection , View the commons-pool2-2.6.2.jar Part of the source code ;
CommonPool2 Source code
public GenericObjectPool(final PooledObjectFactory<T> factory, final GenericObjectPoolConfig<T> config) { super(config, ONAME_BASE, config.getJmxNamePrefix()); if (factory == null) { jmxUnregister(); // tidy up throw new IllegalArgumentException("factory may not be null"); } this.factory = factory; idleObjects = new LinkedBlockingDeque<>(config.getFairness()); setConfig(config); }
I found that there was no initial connection , Start consulting middleware team , Middleware team gives the source code (commons-pool2-2.4.2.jar) as follows , One more time after method execution startEvictor Method call ?
1、 Initialize connection pool public GenericObjectPool(PooledObjectFactory<T> factory, GenericObjectPoolConfig config) { super(config, ONAME_BASE, config.getJmxNamePrefix()); if (factory == null) { jmxUnregister(); // tidy up throw new IllegalArgumentException("factory may not be null"); } this.factory = factory; idleObjects = new LinkedBlockingDeque<PooledObject<T>>(config.getFairness()); setConfig(config); startEvictor(getTimeBetweenEvictionRunsMillis()); }
Why not ??? Start checking Jar package , The versions are different , The version of middleware is given in the V2.4.2, What is actually used in the project V2.6.2, analysis startEvictor One step of the logic is to deal with the connection pool preheating logic .
Jedis Connection pool preheating
1、final void startEvictor(long delay) { synchronized (evictionLock) { if (null != evictor) { EvictionTimer.cancel(evictor); evictor = null; evictionIterator = null; } if (delay > 0) { evictor = new Evictor(); EvictionTimer.schedule(evictor, delay, delay); } } } 2、class Evictor extends TimerTask { /** * Run pool maintenance. Evict objects qualifying for eviction and then * ensure that the minimum number of idle instances are available. * Since the Timer that invokes Evictors is shared for all Pools but * pools may exist in different class loaders, the Evictor ensures that * any actions taken are under the class loader of the factory * associated with the pool. */ @Override public void run() { ClassLoader savedClassLoader = Thread.currentThread().getContextClassLoader(); try { if (factoryClassLoader != null) { // Set the class loader for the factory ClassLoader cl = factoryClassLoader.get(); if (cl == null) { // The pool has been dereferenced and the class loader // GC'd. Cancel this timer so the pool can be GC'd as // well. cancel(); return; } Thread.currentThread().setContextClassLoader(cl); } // Evict from the pool try { evict(); } catch(Exception e) { swallowException(e); } catch(OutOfMemoryError oome) { // Log problem but give evictor thread a chance to continue // in case error is recoverable oome.printStackTrace(System.err); } // Re-create idle instances. try { ensureMinIdle(); } catch (Exception e) { swallowException(e); } } finally { // Restore the previous CCL Thread.currentThread().setContextClassLoader(savedClassLoader); } } } 3、 void ensureMinIdle() throws Exception { ensureIdle(getMinIdle(), true); } 4、 private void ensureIdle(int idleCount, boolean always) throws Exception { if (idleCount < 1 || isClosed() || (!always && !idleObjects.hasTakeWaiters())) { return; } while (idleObjects.size() < idleCount) { PooledObject<T> p = create(); if (p == null) { // Can't create objects, no reason to think another call to // create will work. Give up. break; } if (getLifo()) { idleObjects.addFirst(p); } else { idleObjects.addLast(p); } } if (isClosed()) { // Pool closed while object was being added to idle objects. // Make sure the returned object is destroyed rather than left // in the idle object pool (which would effectively be a leak) clear(); } }
modify Jar edition , Configuration center added vivo.cache.depend.common.poolConfig.timeBetweenEvictionRunsMillis( Check for free connections in the connection pool once , Spend more free time than minEvictableIdleTimeMillis Millisecond disconnection , Until the number of connections in the connection pool reaches minIdle until ).
vivo.cache.depend.common.poolConfig.minEvictableIdleTimeMillis( The time that connections in the connection pool can be idle , millisecond ) Two parameters , After restarting the service , Normal preheating of connection pool , Finally from Redis Solve the problem at the same level .
The optimization results are as follows , The performance problem has been basically solved ;
5、 ... and 、 summary
When there is an online problem , The first thing to consider is fast recovery of online business , Minimize business impact , So for online business , Do a good job of current limiting in advance 、 Fuse 、 Demotion and other strategies , When there is a problem online, you can quickly find a recovery solution . Proficiency in the use of the company's monitoring platforms , It determines the speed of the positioning problem , Every development should use the monitoring platform skillfully ( machine 、 service 、 Interface 、DB etc. ) As a basic ability .
Redis When slow response occurs , Priority can be given to Redis Cluster server ( Machine load 、 Does the service have slow queries )、 Business code ( Is there a BUG)、 client ( Is the connection pool configuration reasonable ) Three aspects to investigate , Basically, we can find out most of them Redis Slow response problem .
Redis When the system cold starts, the connection pool , Preheating of connection pool , Different commons-pool2 Version of , The strategy of cold start is also different , But they all need to be configured minEvictableIdleTimeMillis Parameters will take effect , May have a look common-pool2 Official documents , Know the common parameters well , It can locate problems quickly .
The default parameters of connection pool are a little weak in solving high traffic services , It needs to be optimized for large traffic scenarios , If the traffic on the business is not very large, you can directly use the default parameters .
Specific problems should be analyzed , When you can't solve the problem, you should change your mind , Try to solve the problem in various ways .
author :vivo Internet server team -Wang Shaodong
边栏推荐
- Easydss anonymous live channel data volume instability optimization scheme sharing
- 487. number of maximum consecutive 1 II ●●
- uniapp 热更新后台管理
- 将mysql的数据库导出xxx.sql,将xxx.sql文件导入到服务器的mysql中。项目部署。
- App Startup
- io模型初探
- 根据网络上的视频的m3u8文件通过ffmpeg进行合成视频
- Win10 cloud, add Vietnamese
- The reason why the qtimer timer does not work
- 新技术实战,一步步用Activity Results API封装权限申请库
猜你喜欢
Opencv实现图像的基本变换
ZUCC_编译语言原理与编译_实验03 编译器入门
OpenCV to realize the basic transformation of image
MAYA重新拓布
12-- merge two ordered linked lists
LabVIEW finds prime numbers in an array of n elements
Pat 1157: school anniversary
ZUCC_编译语言原理与编译_实验05 正则表达式、有限自动机、词法分析
A preliminary study of IO model
RCNN、Fast-RCNN、Faster-RCNN介绍
随机推荐
2022 tea artist (intermediate) work license question bank and online simulation examination
js中通过key查找和更新对象中指定值的方法
2021-03-09 comp9021 class 7 Notes
Common date formatter and QT method for obtaining current time
饼状统计图,带有标注线,都可以自行设定其多种参数选项
Opencv实现图像的基本变换
Centos7安装jdk8以及mysql5.7以及Navicat连接虚拟机mysql的出错以及解决方法(附mysql下载出错解决办法)
Fundamentals of 3D mathematics [17] inverse square theorem
Tencent conference API - get rest API & webhook application docking information
Maya re deployment
新准则金融资产三分类:AMC、FVOCI和FVTPL
ZUCC_ Principles of compiling language and compilation_ Experiment 04 language and grammar
io模型初探
LabVIEW finds prime numbers in an array of n elements
05 Ubuntu installing mysql8
App Startup
Variable declaration and some special variables in shell
貸款五級分類
Paper notes: multi label learning dm2l
DHCP, TFTP Foundation