当前位置：网站首页>Battle history between redis and me under billion level traffic

Battle history between redis and me under billion level traffic

2022-06-24 08:39:00 【2020labs assistant】

One 、 background

One day, I received feedback from the upstream caller , One of the offers Dubbo Interface , It is fused for a short time at a fixed time every day , The exception information thrown is the provider dubbo The thread pool is depleted . At present dubbo Daily interface requests 18 100 million times , Error request 94W/ God , This is the beginning of the optimization journey .

Two 、 Quick response

2.1 Rapid positioning

First of all, carry out routine system information monitoring （ machine 、JVM Memory 、GC、 Threads ）, It was found that although there was a slight spike , But within reason , And it doesn't match with the time of reporting the error , Ignore for a moment .

Second, the flow analysis , It is found that there will be a sudden increase of traffic at a fixed time every day , The point of sudden increase in traffic is also consistent with the time point of error reporting , The preliminary judgment is that the short-term large flow leads to .

Traffic trends

Degraded amount

Interface 99 Line

3、 ... and 、 Looking for performance bottlenecks

3.1 Interface process analysis

3.1.1 flow chart

3.1.2 Process analysis

Call downstream interface after receiving request. , Use hystrix Fuse , The fusing time is 500MS;
According to the data returned by the downstream interface , Encapsulation of detail data , The first step is to get it from the local cache , If the local cache does not , From Redis Return to source ,Redis If there is none, return directly , The asynchronous thread returns to the source from the database .
If the first step calls the downstream interface exception , Then the data is disclosed , The process is to get the information from the local cache first , If the local cache does not , From Redis Return to source ,Redis If there is none, return directly , The asynchronous thread returns to the source from the database .

3.2 Performance bottleneck investigation

3.2.1 The downstream interface service takes a long time

Call chain display , Although the downstream interface P99 There is a spike in the peak flow , beyond 1S, But because of the setting of time-out （ Melting time 500MS,coreSize&masSize=50, Average time consumption of downstream interface 10MS following ）, Judging the downstream interface is not the key point of the problem , To further eliminate interference , It can quickly fail when downstream services have spikes , Adjust the fusing time to 100MS,dubbo Timeout time 100MS.

3.2.2 Get details local cache no data ,Redis Back to the source

With the help of call chain platform , The first step is to analyze Redis Request traffic , To judge the hit rate of local cache , Find out Redis The traffic of is the interface traffic 2 times , In terms of design, this should not happen . Start code Review, We found that there was a problem with the logic .

Not read from local cache , But directly from Redis We got the data from ,Redis The maximum response time did find unreasonable spikes , Further analysis found that Redis Response time and Dubbo99 The situation of thread stabbing is basically the same , I feel that I have found the cause of the problem , I'm happy in my heart .

Redis Request traffic

Service interface request traffic

Dubbo99 Line

Redis Maximum response time

3.2.3 Get the bottom data, no data in local cache ,Redis Back to the source

normal

3.2.4 Record the result of the request Redis

Because the current Redis We've done resource isolation , And not in DB Slow log found in background , At this point, the analysis results in Redis There are many reasons for the slowdown , But everything else is ignored subjectively , Attention is asking Redis It's time to double the traffic , Therefore, priority should be given 3.2.2 Problems in .

Four 、 Solution

4.1 3.3.2 Problems in positioning

Before going online Redis Request quantity

After the launch Redis Request quantity

After the launch Redis Traffic doubling problem solved ,Redis The maximum response time was relieved , But it still hasn't been solved completely , It shows that large traffic query is not the most fundamental reason .

redis Maximum response time （ Before going online ）

redis Maximum response time （ After the launch ）

4.2 Redis Capacity expansion

stay Redis After the abnormal traffic problem is solved , The problem has not been completely solved , What we can do now is to calm down , Carefully sort out the causes Redis The reason for the slow , The train of thought is mainly from the following three aspects :

There are slow queries
Redis Service performance bottleneck
Unreasonable client configuration

Based on the above ideas , Check one by one ; Inquire about Redis Slow query log , No slow queries found .

Borrowing call chain platform to analyze slow Redis command , Without the interference of slow query caused by large traffic , The problem location process is fast , A large number of time-consuming requests setex On the way , Occasionally slow requests for queries are also in setex After method , according to Redis The characteristic judgment of single thread setex yes Redis99 The prime culprit of thread stabbing . Find the specific statement , After positioning to specific business , First, apply for expansion Redis, from 6 individual master Expand to 8 individual master.

Redis Before the expansion

Redis After expansion

In terms of the results , The expansion basically has no effect , explain redis The service itself is not a performance bottleneck , At this point, the remaining one is the client related configuration .

4.3 Optimization of client parameters

4.3.1 Connection pool optimization

Redis Expansion has no effect , Aiming at the possible problems of the client , There are two directions to the point of doubt .

The first is that the client is processing Redis In cluster mode , There is a problem with the management of the connection BUG, The second is the unreasonable setting of connection pool parameters , At this point, source code analysis and connection pool parameter adjustment are carried out synchronously .

4.3.1.1 Judge whether there is any problem in client connection management BUG

At the end of the analysis , After the client processes the source code of the connection pool , No problem , It's the same as expected , Cache connection pool by slot , The first hypothesis is ruled out , Source code is as follows .

1、setEx
  public String setex(final byte[] key, final int seconds, final byte[] value) {
    return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
      @Override
      public String execute(Jedis connection) {
        return connection.setex(key, seconds, value);
      }
    }.runBinary(key);
  }
 
2、runBinary
  public T runBinary(byte[] key) {
    if (key == null) {
      throw new JedisClusterException("No way to dispatch this command to Redis Cluster.");
    }
 
    return runWithRetries(key, this.maxAttempts, false, false);
  }
3、runWithRetries
  private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
    if (attempts <= 0) {
      throw new JedisClusterMaxRedirectionsException("Too many Cluster redirections?");
    }
 
    Jedis connection = null;
    try {
 
      if (asking) {
        // TODO: Pipeline asking with the original command to make it
        // faster....
        connection = askConnection.get();
        connection.asking();
 
        // if asking success, reset asking flag
        asking = false;
      } else {
        if (tryRandomNode) {
          connection = connectionHandler.getConnection();
        } else {
          connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
        }
      }
 
      return execute(connection);
 
    }
 
4、getConnectionFromSlot
  public Jedis getConnectionFromSlot(int slot) {
    JedisPool connectionPool = cache.getSlotPool(slot);
    if (connectionPool != null) {
      // It can't guaranteed to get valid connection because of node
      // assignment
      return connectionPool.getResource();
    } else {
      renewSlotCache(); //It's abnormal situation for cluster mode, that we have just nothing for slot, try to rediscover state
      connectionPool = cache.getSlotPool(slot);
      if (connectionPool != null) {
        return connectionPool.getResource();
      } else {
        //no choice, fallback to new connection to random node
        return getConnection();
      }
    }
  }

4.3.1.2 Analyze connection pool parameters

Through communication with middleware team , And reference commons-pool2 The official document is amended as follows ;

After parameter adjustment ,1S The number of requests above has been reduced , But there are still , The amount of upstream feedback degradation is reduced from 0 90 About ten thousand a day 6W individual （ About maxWaitMillis Set to 200MS Why will there be more than one 200MS Request , There is an explanation below ）.

After parameter optimization Reds Maximum response time

Error amount of interface after parameter optimization

4.3.2 Continue to optimize

Optimization cannot stop , How to make Redis All write requests for are reduced to 200MS within , At this time, the optimization idea is to adjust the client configuration parameters , analysis Jedis Access to connection related source code ;

Jedis Get the connection source code

final AbandonedConfig ac = this.abandonedConfig;
if (ac != null && ac.getRemoveAbandonedOnBorrow() &&
        (getNumIdle() < 2) &&
        (getNumActive() > getMaxTotal() - 3) ) {
    removeAbandoned(ac);
}

PooledObject<T> p = null;

// Get local copy of current config so it is consistent for entire
// method execution
final boolean blockWhenExhausted = getBlockWhenExhausted();

boolean create;
final long waitTime = System.currentTimeMillis();

while (p == null) {
    create = false;
    p = idleObjects.pollFirst();
    if (p == null) {
        p = create();
        if (p != null) {
            create = true;
        }
    }
    if (blockWhenExhausted) {
        if (p == null) {
            if (borrowMaxWaitMillis < 0) {
                p = idleObjects.takeFirst();
            } else {
                p = idleObjects.pollFirst(borrowMaxWaitMillis,
                        TimeUnit.MILLISECONDS);
            }
        }
        if (p == null) {
            throw new NoSuchElementException(
                    "Timeout waiting for idle object");
        }
    } else {
        if (p == null) {
            throw new NoSuchElementException("Pool exhausted");
        }
    }
    if (!p.allocate()) {
        p = null;
    }

    if (p != null) {
        try {
            factory.activateObject(p);
        } catch (final Exception e) {
            try {
                destroy(p);
            } catch (final Exception e1) {
                // Ignore - activation failure is more important
            }
            p = null;
            if (create) {
                final NoSuchElementException nsee = new NoSuchElementException(
                        "Unable to activate object");
                nsee.initCause(e);
                throw nsee;
            }
        }
        if (p != null && (getTestOnBorrow() || create && getTestOnCreate())) {
            boolean validate = false;
            Throwable validationThrowable = null;
            try {
                validate = factory.validateObject(p);
            } catch (final Throwable t) {
                PoolUtils.checkRethrow(t);
                validationThrowable = t;
            }
            if (!validate) {
                try {
                    destroy(p);
                    destroyedByBorrowValidationCount.incrementAndGet();
                } catch (final Exception e) {
                    // Ignore - validation failure is more important
                }
                p = null;
                if (create) {
                    final NoSuchElementException nsee = new NoSuchElementException(
                            "Unable to validate object");
                    nsee.initCause(validationThrowable);
                    throw nsee;
                }
            }
        }
    }
}

updateStatsBorrow(p, System.currentTimeMillis() - waitTime);

return p.getObject();

The general process of getting the connection is as follows :

Whether there are idle connections , If there is a free connection, return directly , Create without ;
If the maximum number of connections is exceeded during creation , Then judge whether there are other threads creating the connection , If not, return directly , If so, wait maxWaitMis Time （ Other threads may fail to create ）, If the maximum connection is not exceeded , To create a connection （ At this time, the waiting time for getting the connection may be greater than maxWaitMs）.
If the creation is not successful , Then determine whether the connection is blocked , If not, throw an exception , Insufficient connection pool , If so, judge maxWaitMillis Is less than 0, If it is less than 0 Then block and wait , If it is greater than 0 Then block and wait maxWaitMillis.
The next step is to determine whether a connection needs to be made according to the parameters check etc. .

According to the above process analysis ,maxWaitMills The current setting is 200, The total maximum blocking time of the above processes is 400MS, Most of the time 200MS, There should be no excess 400MS A sharp stab in the neck .

At this point, the problem may arise in creating the connection , Because it takes time to create a connection , And the creation time is uncertain , Focus on whether there is such a scene , adopt DB Background monitoring Redis Connection .

DB Background monitoring Redis Service connection

Analysis of the above figure shows that , It's really in a few minutes （9:00,12:00,19:00...）,redis There is an increase in the number of connections , Follow Redis The stab time was basically the same . Feeling （ After all the previous attempts , I'm not sure ） The problem is clear （ When the sudden increase of flow comes , Connection pool available connections can not meet the demand , The connection is created , Cause request waiting ）.

The idea is to create a connection pool when the service starts , Minimize the creation of new connections , Modify connection pool parameters vivo.cache.depend.common.poolConfig.minIdle, It turned out to be ineffective ？？？

Don't say anything , Start rolling source code ,jedis The bottom layer uses commons-poll2 To manage the connection , View the commons-pool2-2.6.2.jar Part of the source code ;

CommonPool2 Source code

public GenericObjectPool(final PooledObjectFactory<T> factory,
        final GenericObjectPoolConfig<T> config) {
 
    super(config, ONAME_BASE, config.getJmxNamePrefix());
 
    if (factory == null) {
        jmxUnregister(); // tidy up
        throw new IllegalArgumentException("factory may not be null");
    }
    this.factory = factory;
 
    idleObjects = new LinkedBlockingDeque<>(config.getFairness());
 
    setConfig(config);
}

I found that there was no initial connection , Start consulting middleware team , Middleware team gives the source code （commons-pool2-2.4.2.jar） as follows , One more time after method execution startEvictor Method call ？

1、 Initialize connection pool 
public GenericObjectPool(PooledObjectFactory<T> factory,
            GenericObjectPoolConfig config) {
super(config, ONAME_BASE, config.getJmxNamePrefix());
if (factory == null) {
            jmxUnregister(); // tidy up
throw new IllegalArgumentException("factory may not be null");
        }
this.factory = factory;
        idleObjects = new LinkedBlockingDeque<PooledObject<T>>(config.getFairness());
        setConfig(config);
        startEvictor(getTimeBetweenEvictionRunsMillis());
    }

Why not ？？？ Start checking Jar package , The versions are different , The version of middleware is given in the V2.4.2, What is actually used in the project V2.6.2, analysis startEvictor One step of the logic is to deal with the connection pool preheating logic .

Jedis Connection pool preheating

1、final void startEvictor(long delay) {
        synchronized (evictionLock) {
            if (null != evictor) {
                EvictionTimer.cancel(evictor);
                evictor = null;
                evictionIterator = null;
            }
            if (delay > 0) {
                evictor = new Evictor();
                EvictionTimer.schedule(evictor, delay, delay);
            }
        }
    }
2、class Evictor extends TimerTask {
       /**
         * Run pool maintenance.  Evict objects qualifying for eviction and then
         * ensure that the minimum number of idle instances are available.
         * Since the Timer that invokes Evictors is shared for all Pools but
         * pools may exist in different class loaders, the Evictor ensures that
         * any actions taken are under the class loader of the factory
         * associated with the pool.
         */
        @Override
        public void run() {
            ClassLoader savedClassLoader =
                    Thread.currentThread().getContextClassLoader();
            try {
                if (factoryClassLoader != null) {
                    // Set the class loader for the factory
                    ClassLoader cl = factoryClassLoader.get();
                    if (cl == null) {
                        // The pool has been dereferenced and the class loader
                        // GC'd. Cancel this timer so the pool can be GC'd as
                        // well.
                        cancel();
                        return;
                    }
                    Thread.currentThread().setContextClassLoader(cl);
                }
 
                // Evict from the pool
                try {
                    evict();
                } catch(Exception e) {
                    swallowException(e);
                } catch(OutOfMemoryError oome) {
                    // Log problem but give evictor thread a chance to continue
                    // in case error is recoverable
                    oome.printStackTrace(System.err);
                }
                // Re-create idle instances.
                try {
                    ensureMinIdle();
                } catch (Exception e) {
                    swallowException(e);
                }
            } finally {
                // Restore the previous CCL
                Thread.currentThread().setContextClassLoader(savedClassLoader);
            }
        }
    }
3、 void ensureMinIdle() throws Exception {
        ensureIdle(getMinIdle(), true);
    }
4、 private void ensureIdle(int idleCount, boolean always) throws Exception {
        if (idleCount < 1 || isClosed() || (!always && !idleObjects.hasTakeWaiters())) {
            return;
        }
 
        while (idleObjects.size() < idleCount) {
            PooledObject<T> p = create();
            if (p == null) {
                // Can't create objects, no reason to think another call to
                // create will work. Give up.
                break;
            }
            if (getLifo()) {
                idleObjects.addFirst(p);
            } else {
                idleObjects.addLast(p);
            }
        }
        if (isClosed()) {
            // Pool closed while object was being added to idle objects.
            // Make sure the returned object is destroyed rather than left
            // in the idle object pool (which would effectively be a leak)
            clear();
        }
    }

modify Jar edition , Configuration center added vivo.cache.depend.common.poolConfig.timeBetweenEvictionRunsMillis（ Check for free connections in the connection pool once , Spend more free time than minEvictableIdleTimeMillis Millisecond disconnection , Until the number of connections in the connection pool reaches minIdle until ）.

vivo.cache.depend.common.poolConfig.minEvictableIdleTimeMillis（ The time that connections in the connection pool can be idle , millisecond ） Two parameters , After restarting the service , Normal preheating of connection pool , Finally from Redis Solve the problem at the same level .

The optimization results are as follows , The performance problem has been basically solved ;

Redis response time （ Before optimization ）

Redis response time （ After optimization ）

Interface 99 Line （ Before optimization ）

Interface 99 Line （ After optimization ）

5、 ... and 、 summary

When there is an online problem , The first thing to consider is fast recovery of online business , Minimize business impact , So for online business , Do a good job of current limiting in advance 、 Fuse 、 Demotion and other strategies , When there is a problem online, you can quickly find a recovery solution . Proficiency in the use of the company's monitoring platforms , It determines the speed of the positioning problem , Every development should use the monitoring platform skillfully （ machine 、 service 、 Interface 、DB etc. ） As a basic ability .

Redis When slow response occurs , Priority can be given to Redis Cluster server （ Machine load 、 Does the service have slow queries ）、 Business code （ Is there a BUG）、 client （ Is the connection pool configuration reasonable ） Three aspects to investigate , Basically, we can find out most of them Redis Slow response problem .

Redis When the system cold starts, the connection pool , Preheating of connection pool , Different commons-pool2 Version of , The strategy of cold start is also different , But they all need to be configured minEvictableIdleTimeMillis Parameters will take effect , May have a look common-pool2 Official documents , Know the common parameters well , It can locate problems quickly .

The default parameters of connection pool are a little weak in solving high traffic services , It needs to be optimized for large traffic scenarios , If the traffic on the business is not very large, you can directly use the default parameters .

Specific problems should be analyzed , When you can't solve the problem, you should change your mind , Try to solve the problem in various ways .