当前位置：网站首页>Problems caused by redis caching scenario

Problems caused by redis caching scenario

2022-06-21 20:34:00 【InfoQ】

1. Data consistency

We know ,Redis It is mainly used for caching , Just use the cache , Whether it's local memory for caching or using Redis Do the cache , Then there will be the problem of data synchronization .

In general , We are all

Read the cached data first , Cache data has , The result is returned immediately ; If there is no data , Read data from the database , And synchronize the read data to the cache , Provide the return data of the next read request

This can effectively reduce the pressure on the database , But if

Modify delete

Data in the database , The memory is unable to perceive the data changes in the database . This will cause the inconsistency between the data in the database and the data in the cache , Then how to solve it ？

There are several common solutions ：

Update cache first , Updating the database ;

Update the database first , Updating cache ;

So let's delete the cache , Updating the database ;

Update the database first , Deleting cache .

1.1 Update cache first , Updating the database

We usually don't consider this plan . The reason is that the cache was successfully updated , An exception occurred while updating the database , As a result, the cached data is completely inconsistent with the database data , And it's hard to detect , Because the data in the cache always exists .

1.2 Update the database first , Updating cache

We generally do not consider this scheme , The reason is the same as the first one , Database update succeeded , Cache update failed , Data inconsistency will also occur .

That is, for updating the cache , Generally do not consider , Mainly from below 2 Point consideration ：

1、 Concurrency issues

If there's a request at the same time A And request B Update operation , Then there will be ：

Threads A Update the database ;

Threads B Update the database ;

Threads B Update cache ;

Threads A Update cache .

So there's a request A Update cache should be better than request B Update the cache early , But because of the Internet and so on ,B But than A The cache was updated earlier . This leads to dirty data , So don't think about .

2、 Business scenario problems

If you are one

There are many scenarios for writing databases

, and

There are few data reading scenarios

Business needs , The adoption of this scheme will lead to , I haven't read the data yet , The cache is updated frequently , Waste performance .

Secondly, many times , Caching scenarios at complex points , Caching is not just a direct value from the database . For example, a field of a table may be updated , Then its corresponding cache , You need to query the data of the other two tables and perform operations , To calculate the latest value of the cache .

And does it mean , Every time the database is modified , Must update the corresponding cache , Maybe it's like this , But for more complex scenarios of cache data calculation , That's not it . If you frequently modify multiple tables involved in a cache , Cache is also updated frequently . But the problem is , Will this cache be accessed frequently ？

for instance ： A field of the table involved in the cache , stay 1 It's changed in minutes 20 Time , Or is it 100 Time , Then cache updates 20 Time 、100 Time , But this cache is 1 Only read in minutes 1 Time , There's a lot of cold data .

actually , If you just delete the cache , So in 1 Within minutes , This cache is just a recalculation , Significantly reduced overhead . Use cache to calculate cache .

Actually delete cache , Instead of updating the cache , It's just one. Lazy The idea of computation , Don't do complex calculations every time , Whether it's going to work or not , But let it recalculate when it needs to be used .

In the final analysis, it is choice

Update cache

still

Retire cache

Well , Mainly depends on

The complexity of updating the cache

, The cost of updating the cache is small , At this point, we should prefer to update the cache , To ensure a higher cache hit rate , Updating the cache is expensive , At this point, we should be more inclined to eliminate the cache . But eliminating the cache is simple , And the side effects are only added once

cache miss

, Therefore, it is generally used as a general processing method .

1.3 So let's delete the cache , Updating the database

There will also be problems with the scheme , The specific reasons are as follows ：

If there are two requests , request A（ update operation ） And request B（ Query operation ）;

request A Will be deleted first Redis Data in , Then go to the database for update operation ;

Request at this time B notice Redis Data in time and space , The value is queried in the database , Add to Redis in .

But at this point the request A It didn't update successfully , Or the transaction has not yet been committed , request B Go to the database to query for the old value , Then the database and Redis Data inconsistency .

So what's the solution ？ The simplest solution is

Delay double delete

The strategy of , namely ：

Eliminate the cache first ;

Write the database again ;

Sleep 1 second , Eliminate the cache again .

This practice can make 1s Internally caused data deletion .

that , This 1 How does the second determine , How long should I sleep ？

For the above case , Self evaluate the time-consuming of reading data and business logic of your project . The sleep time for writing the data is then based on the time spent reading the data's business logic , Add a few hundred ms that will do . The purpose of this is , Is to make sure that the read request ends , Write requests can remove cached dirty data caused by read requests .

1.4 Update the database first , Deleting cache

This way, , go by the name of

Cache Aside Pattern

For read requests

Read cache first , Reading the database ;

If exist , Then return to ;

If it does not exist , Read database , Then take out the data and put it into the cache , Return response at the same time .

For write requests

Update the database first ;

Delete the cache .

This situation still has concurrency problems ？

Suppose there are two requests , A request A Do query operation , A request B Do update operation , So this is going to happen ：

The cache just failed ;

request A Query the database , Get an old value ;

request B Writes the new value to the database ;

request B Delete cache ;

request A Writes the old value found to the cache .

But the probability of this situation is actually very low , If the above occurs , Then step 3 Write database operations than steps 2 The read database operation takes less time , It's possible to make steps 4 Before the steps 5.

But actually , Database

Read operations are much faster than write operations

Of , So step 3 It takes more time than steps 2 shorter , It's very difficult for this to happen .

however , In theory, there is still a possibility , So how to deal with this situation ？ There are usually two kinds ：

Design an expiration time for the cache ;

Asynchronous delay delete .

1.5 There is a problem with the delete policy

therefore , about Redis Cache consistency , The usual approach is to delete the cache , Since there is an action to delete , If there is a problem during the deletion phase , As a result, the data has not been deleted , At this time, every query is wrong data . How can this be solved ？

Generally speaking, there are two schemes ：

Use the message queue to retry the deletion compensation

Update the database first ;

In the face of Redis An error was found during the deletion operation , Delete failed ;

At this time will be Redis Of key Send to message queue as message body ;

After the system receives the message sent by the message queue ;

Again Redis Delete operation .

But this solution will have a disadvantage that it will cause a lot of intrusion into the business code , Deeply coupled , So there will be an optimized solution , We know right MySQL After the database update operation Binlog We can find the corresponding operation in the log , So we can subscribe to Mysql Database Binlog Logs operate on the cache .

For subscriptions Binlog Logs can be accessed through Alibaba's open source framework

canal

, See this article for details ：

be based on canal Framework solution mysql And redis The question of consistency

2. Cache penetration

It refers to querying a fundamental

Nonexistent data

, Neither the cache tier nor the storage tier will hit , If no data can be found from the storage layer, it will not be written to the cache layer .

Cache penetration will cause non-existent data to be queried in the storage layer every time it is requested , Lost the significance of cache protection back-end storage . Cache penetration issues can make

The back-end storage load increases

, Because many back-end storage does not have high concurrency , It may even cause the back-end storage to go down .

Generally, the total number of calls can be counted separately in the program 、 Cache layer hits 、 Number of storage layer hits , If a large number of storage layer null hits are found , Maybe there is a cache penetration problem .

2.1 Cause analysis

There are two basic reasons for cache penetration ：

1、 There is a problem with your own business code or data

such as , Our database id All are 1 Starting to grow , If initiated as id The value is -1 Data or id For very large nonexistent data . If the parameter is not verified , database id Are greater than 0 Of , I always use less than 0 To request , You can get around it every time Redis Directly to the database , At this time, the database cannot be found , Every time , Concurrent highs are prone to collapse .

2、 A malicious attack 、 Reptiles and so on cause a lot of empty hits

2.2 Solution

Cache penetration can be solved from the following aspects ：

1、 Add verification

Such as user authentication verification ,id Do basic verification ,id<=0 Direct interception of ;

2、 Caching empty objects

Since the original data does not exist , Keep the empty object in the cache layer , Then accessing this data will get from the cache , This protects the back-end data source .

But there will be 2 A question ：

Null values are cached , It means that there are more keys in the cache layer , Need more memory space ( If it's an attack , The problem is more serious ), The more effective method is for this kind of data
Set a shorter expiration time
, Let it automatically remove .

The data of cache layer and storage layer will be inconsistent for a period of time , It may have some impact on the business . For example, the expiration time is set to 5 minute , If the storage layer adds this data at this time , Then there will be inconsistencies between the cache layer and the storage layer data in this period of time , At this point, the data consistency scheme can be used to process .

3、 The bloon filter

Before accessing the cache layer and storage layer , There will be key Save in advance with a bloom filter , Do the first level intercept .

for example ： A recommendation system has 4 Billion users id, Every hour, the algorithm engineer will calculate the recommended data according to each user's previous historical behavior and put it into the storage layer , But the latest users have no historical behavior , There will be cache penetration behavior , For this reason, users of all recommended data can be made into Bloom filters . If the bloom filter thinks that the user id non-existent , Then there is no access to the storage tier , It protects the storage layer to a certain extent .

This method is suitable for data hit is not high 、 Data is relatively fixed 、 Low real time （ Usually the data set is large ） Application scenarios of , Code maintenance is more complex , But the cache space is less .

3. Cache breakdown

Buffer breakdown refers to a

hotspot key

, Large concurrent centralized access to this point , When this key At the moment of failure , Continuous large concurrency breaks through the cache , Direct request database .

If the cache breaks down , Set hotspot data never to expire . Or you can add the mutex .

3.1 Cause analysis

Set the expiration time of key, Carrying high concurrency , It's hot data . From this key To expire from MySQL Load data into the cache for a period of time , A lot of requests could kill the database .

Cache avalanche refers to a large number of cache failures , Cache breakdown refers to cache failure of hot data .

3.2 Solution

1、 Use mutexes

A common practice in the industry , Is to use mutex .

To put it simply , When the cache fails （ Judge that the value is empty ）, Not immediately load db, Instead, use some operations with the return value of the successful operation of the caching tool first （ such as Redis Of SETNX perhaps Memcache Of ADD） Go to set One mutex key, When the operation returns success , Proceed again load db And reset the cache , otherwise , Just try the whole thing again get Caching method .

The pseudocode is as follows ：

public String get(key) {
 String value = redis.get(key);
 if (value == null) {// Represents cache value expiration 
 //  Set up  3min  timeout , prevent  del  When the operation fails , Next time the cache expires, it can't be  load db
 if (redis.setnx(key_mutex, 1, 3 * 60) == 1) {
 // Indicates that the setting is successful 
 value = db.get(key);
 redis.set(key, value, expire_secs);
 redis.del(key_mutex);
 } else {
 // This time represents that other threads at the same time have  load db  And back to the cache , In this case, try again to get the cache value 
 sleep(50);
 get(key); // retry 
 }
 } else {
 return value;
 }
}

2、 Never expire

there

Never expire

It has two meanings ：

from Redis Look up , It's true that the expiration time is not set , That's the guarantee , There will be no hot spots key Overdue problem , That is to say
Physics doesn't expire
.

functionally , If it doesn't expire , That's static ？ So we keep the expiration date key Corresponding value in , If it's found to be overdue , Build the cache through a background asynchronous thread , That is to say
Logic does not expire
.

From the perspective of actual combat , This method is very performance friendly , The only drawback is when building the cache , The rest of the threads （ Threads that are not building the cache ） Maybe it's old data , But for general Internet functions, this is tolerable .

4. Cache avalanche

Cache avalanche refers to the time from bulk to expiration of data in cache , And the amount of query data is huge , Cause too much pressure on the database and even downtime .

4.1 Cause analysis

Because the cache layer carries a lot of requests , Effectively protects the storage tier , But if the cache layer cannot provide services for some reason , For example, a large area of cached data fails at the same time , That moment Redis It's the same as nothing , So all the requests will reach the storage layer , The number of calls to the storage layer will skyrocket , Cause cascading downtime in the storage tier .

4.2 Solution

Prevent and solve buffer avalanche problem , We can start from the following three aspects ：

Ensure high availability of cache layer services

Like an airplane with multiple engines , If the cache layer is designed to be highly available , Even if individual nodes 、 Individual machines 、 Even the computer room went down , Services are still available , for example Redis Sentinel and Redis Cluster High availability is realized ;

Set up key Never fail （ Hot data ）;

Set up key When the cache fails, stagger it as much as possible ;

Use a multi-level caching mechanism , For example, using at the same time Redis and Memcache cache , request ->redis->memcache->db.

5. Hot Key

stay Redis in , High frequency of visits key be called hot key, Hot data .

5.1 reason

There are two reasons for hot issues ：

The amount of data consumed by users is far greater than expected

Some unexpected events in daily work and life , for example ： The price reduction and promotion of some popular goods during the double 11 , When one of these items is viewed or purchased tens of thousands of times , There will be a large demand , In this case, it will cause hot issues . Empathy , Be published in large numbers 、 Browsing hot news 、 Hot reviews 、 Star live broadcast, etc , These typical scenes with more reading and less writing will also generate hot issues .

Request fragmentation , More than one Server Performance limit of

When the server reads data for access , Data is often segmented , In this process, it will be on a certain host Server Up to the corresponding Key Visit , When the visit exceeds Server At the limit , It will lead to hot spots Key Problem generation .

5.2 harm

Flow concentration , The physical network card limit is reached .

Too many requests , Cache fragmentation service is broken .DB breakdown , Cause a business avalanche .

As mentioned above , When a hot spot Key When the request exceeds the upper limit of the host network card on a host , Because of the over concentration of traffic , It will cause other services in the server to be unavailable . If the hot spots are too concentrated , hotspot Key Too much cache , When the current cache capacity is exceeded , It will lead to the collapse of cache fragmentation service .

When the cache service crashes , At this time, another request is generated , It will be cached in the background DB On , because DB Its performance is weak , In the face of large requests, it is easy to have request penetration , Will further lead to avalanches , Seriously affect the performance of the equipment .

5.3 Solution

Find hot spots key after , Need to focus on hot spots key To deal with , Usually there are the following 2 Medium scheme ：

1、 Use L2 cache

have access to guava-cache or hcache, Will be hot key Load into JVM As a local cache . Visit these key You can get it directly from the local cache , No direct access to Redis The layer , The cache server is effectively protected .

2、key

Dispersed

Will be hot key Split into multiple children key, Then it is stored on different machines of the cache cluster , These key Corresponding value And hot spots key It's the same . When passing through hot spots key When you go to query data , By some means hash The algorithm randomly selects a child key, Then access the cache machine , Spread the hot spots to multiple sub key On .

actually , For heat key The problem is the problem that will only occur when the high level is issued , General single machine systems do not have such high concurrency , The L2 cache architecture is a bit complicated , The actual situation mainly depends on whether the business scenario really needs .

For example, each Redis ceiling 10w/s QPS,Redis5.0 We usually deploy in clusters 3 Lord 6 from , heat key It is generally distributed over a hash slot , It's a Lord redis+ Two from redis, Theoretically, it can satisfy 30w/s Of QPS, Let's reserve a little buffer,10wQPS There must be no problem . If higher , For example, millions of visits , That's except for expansion Redis Outside the cluster , Local caching is also necessary .

6. Big Key

bigkey Refer to key Corresponding value The memory space occupied is relatively large , For example, a string type value Can save up to 512MB, A list type value Up to storage 23-1 Elements . If it is subdivided according to the data structure , It is generally divided into string types bigkey And non string types bigkey.

String type ： Reflected in a single value Great value , It is generally believed that more than 10KB Namely bigkey, But this value is different from the specific QPS relevant .

Non string type ： Hash 、 list 、 aggregate 、 Ordered set , Reflected in the excessive number of elements .

bigkey Both spatial and temporal complexity are not very friendly .

6.1 Find out

redis-cli --bigkeys

You can command statistics bigkey The distribution of , But in a production environment , Developers and operation and maintenance personnel prefer to define bigkey Size , And more hope to find the real bigkey What are the , So that we can locate 、 solve 、 optimization problem . Judge a key Is it bigkey, Just execute debug object key see serializedlength Attribute is enough , It said key Corresponding value Number of bytes after serialization .

If there are many key values ,scan + debug object It will be slow , You can use Pipeline Mechanism complete . For data structures with a large number of elements ,debug object The execution speed is slow , Blocked Redis The possibility of , So if there is a slave node , Consider executing on the slave node .

6.2 harm

bigkey The harm of is reflected in three aspects ：

Uneven memory space

For example, in Redis Cluster in ,bigkey The memory space of the node will be unevenly used .

Time out blocking

because Redis Single thread features , operation bigkey More time-consuming , It means blocking Redis The possibility increases .

Network congestion

Every time to get bigkey The generated network traffic is large , Suppose a bigkey by 1MB, The number of visits per second is 1000, So every second produces 1000MB Of traffic , For ordinary gigabit network card （ In terms of bytes 128MB/s） It's a disaster for our servers , Moreover, the general server will be deployed in the way of single machine and multiple instances , That is to say a bigkey May affect other instances , The consequences are dire .

bigkey The existence of is not completely fatal , If this bigkey Exists but is rarely accessed , Then only the problem of uneven memory space exists , It is less important and urgent than the other two problems , But if bigkey It's a hot spot key, Then the harm it brings is unimaginable , Therefore, we must pay close attention to the actual development, operation and maintenance bigkey The existence of .

6.3 Solution

The main idea is to split , Yes big key Stored data （big value） To break up , become value1,value2… valueN wait .

for example big value It's a big json adopt mset The way , Put this key The contents of are scattered into various instances , Or a hash, Every field Represents a specific attribute , adopt hget、hmget Get part value,hset、hmset To update some properties .

for example ：big value It's a big list, Can be disassembled into list Split into .list_1,list_2,list3,listN The same applies to other data types .

7. Redis Split brain

The so-called brain fissure , It refers to the master-slave cluster that guarantees availability ,

Two master nodes are generated at the same time

, They can all receive write requests .

The main cause may be network problems redis master The node follows redis slave Nodes and sentinel Clusters are in different network partitions , At this time because sentinel The cluster cannot perceive master The existence of , So will slave The node is promoted to master node ..

The most direct effect of cerebral fissure , That is, the client does not know which master node to write data to , As a result, different clients will write data to different master nodes . and , Serious words , Cleft brain can further lead to data loss .

7.1 Sentinel master-slave cluster brain crack

Suppose there are now three servers , One master server , Two slave servers , And the sentinel mechanism .

Based on the above environment , At this time, the network environment fluctuated, leading to a certain master The machine is suddenly out of the normal network , But actually master Still running ,sentinel By way of election, the Council has promoted a slave As new master.

If it happens at the right time App Server1 Still connected is the old master, and App Server2 Connected to the new master On . The data is not consistent , The sentry restored to the old master After the perception of nodes , Will downgrade it to slave node , Then start again maste Synchronous data （full resynchronization）, Lead to aging during brain fissure master Data written is lost .

Solution

The following parameters are configured to solve the cerebral fissure ：

min-replicas-to-write 1 
min-replicas-max-lag 5

The above parameters indicate that at least 1 individual slave, The delay of data replication and synchronization cannot exceed 5 second .

The first parameter represents the least slave The node is 1 individual , Only one data is written master Nodes are also synchronized to at least 1 From nodes , It means that the data is successfully added .

The second parameter indicates that the delay of data replication and synchronization cannot exceed 5 second .

These two parameters are configured , If a brain crack occurs , primary master The request will be rejected when the client writes , This can avoid a lot of data loss .

7.2 Cluster cleft brain

By default ,Redis A cluster of brain fissures generally does not exist , because Redis There are more than half of the election mechanisms in the cluster , And when the cluster 16384 The entire cluster is not available when any one of the slots is not assigned to a node .

So we're building Redis When the cluster , You should let the cluster Master The minimum number of nodes is 3 individual , And the number of available nodes in the cluster is odd .

Not by default , For example, the number of clusters is an even number or parameters cluster-require-full-coverage Set to off （ The function of this parameter ： When it is turned on, as long as there is node downtime 16384 Pieces are not completely covered , The entire cluster is out of service ）, In this case, it may still lead to brain cracking . So the solution can also use parameters min-replicas-to-write and min-replicas-max-lag .

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/172/202206211841526986.html