当前位置:网站首页>Problems caused by redis caching scenario

Problems caused by redis caching scenario

2022-06-21 20:34:00 InfoQ


null

1.  Data consistency

We know ,Redis  It is mainly used for caching , Just use the cache , Whether it's local memory for caching or using  Redis  Do the cache , Then there will be the problem of data synchronization .

In general , We are all
Read the cached data first , Cache data has , The result is returned immediately ; If there is no data , Read data from the database , And synchronize the read data to the cache , Provide the return data of the next read request
.

null
This can effectively reduce the pressure on the database , But if
Modify delete
Data in the database , The memory is unable to perceive the data changes in the database . This will cause the inconsistency between the data in the database and the data in the cache , Then how to solve it ?

There are several common solutions :

  • Update cache first , Updating the database ;
  • Update the database first , Updating cache ;
  • So let's delete the cache , Updating the database ;
  • Update the database first , Deleting cache .

1.1  Update cache first , Updating the database

We usually don't consider this plan . The reason is that the cache was successfully updated , An exception occurred while updating the database , As a result, the cached data is completely inconsistent with the database data , And it's hard to detect , Because the data in the cache always exists .

1.2  Update the database first , Updating cache

We generally do not consider this scheme , The reason is the same as the first one , Database update succeeded , Cache update failed , Data inconsistency will also occur .

That is, for updating the cache , Generally do not consider , Mainly from below  2  Point consideration :

1、 Concurrency issues

If there's a request at the same time  A  And request  B  Update operation , Then there will be :

  • Threads  A  Update the database ;
  • Threads  B  Update the database ;
  • Threads  B  Update cache ;
  • Threads  A  Update cache .

So there's a request  A  Update cache should be better than request  B  Update the cache early , But because of the Internet and so on ,B  But than  A  The cache was updated earlier . This leads to dirty data , So don't think about .

2、 Business scenario problems

If you are one
There are many scenarios for writing databases
, and
There are few data reading scenarios
Business needs , The adoption of this scheme will lead to , I haven't read the data yet , The cache is updated frequently , Waste performance .

Secondly, many times , Caching scenarios at complex points , Caching is not just a direct value from the database . For example, a field of a table may be updated , Then its corresponding cache , You need to query the data of the other two tables and perform operations , To calculate the latest value of the cache .

And does it mean , Every time the database is modified , Must update the corresponding cache , Maybe it's like this , But for more complex scenarios of cache data calculation , That's not it . If you frequently modify multiple tables involved in a cache , Cache is also updated frequently . But the problem is , Will this cache be accessed frequently ?

for instance : A field of the table involved in the cache , stay  1  It's changed in minutes  20  Time , Or is it  100  Time , Then cache updates  20  Time 、100 Time , But this cache is  1  Only read in minutes  1  Time , There's a lot of cold data .

actually , If you just delete the cache , So in  1  Within minutes , This cache is just a recalculation , Significantly reduced overhead . Use cache to calculate cache .

Actually delete cache , Instead of updating the cache , It's just one.  Lazy  The idea of computation , Don't do complex calculations every time , Whether it's going to work or not , But let it recalculate when it needs to be used .

In the final analysis, it is choice
Update cache
still
Retire cache
Well , Mainly depends on
The complexity of updating the cache
, The cost of updating the cache is small , At this point, we should prefer to update the cache , To ensure a higher cache hit rate , Updating the cache is expensive , At this point, we should be more inclined to eliminate the cache . But eliminating the cache is simple , And the side effects are only added once
cache miss
, Therefore, it is generally used as a general processing method .

1.3  So let's delete the cache , Updating the database

There will also be problems with the scheme , The specific reasons are as follows :

  • If there are two requests , request  A( update operation )  And request  B( Query operation );
  • request  A  Will be deleted first  Redis  Data in , Then go to the database for update operation ;
  • Request at this time  B  notice  Redis  Data in time and space , The value is queried in the database , Add to  Redis  in .

But at this point the request  A  It didn't update successfully , Or the transaction has not yet been committed , request  B  Go to the database to query for the old value , Then the database and  Redis  Data inconsistency .

So what's the solution ? The simplest solution is
Delay double delete
The strategy of , namely :

  • Eliminate the cache first ;
  • Write the database again ;
  • Sleep  1  second , Eliminate the cache again .

This practice can make  1s  Internally caused data deletion .

that , This  1  How does the second determine , How long should I sleep ?

For the above case , Self evaluate the time-consuming of reading data and business logic of your project . The sleep time for writing the data is then based on the time spent reading the data's business logic , Add a few hundred  ms  that will do . The purpose of this is , Is to make sure that the read request ends , Write requests can remove cached dirty data caused by read requests .

1.4  Update the database first , Deleting cache

This way, , go by the name of  
Cache Aside Pattern
.

For read requests

  • Read cache first , Reading the database ;
  • If exist , Then return to ;
  • If it does not exist , Read database , Then take out the data and put it into the cache , Return response at the same time .

For write requests

  • Update the database first ;
  • Delete the cache .

This situation still has concurrency problems ?

Suppose there are two requests , A request  A  Do query operation , A request  B  Do update operation , So this is going to happen :

  • The cache just failed ;
  • request  A  Query the database , Get an old value ;
  • request  B  Writes the new value to the database ;
  • request  B  Delete cache ;
  • request  A  Writes the old value found to the cache .

But the probability of this situation is actually very low , If the above occurs , Then step  3  Write database operations than steps  2  The read database operation takes less time , It's possible to make steps  4  Before the steps  5.

But actually , Database
Read operations are much faster than write operations
Of , So step  3  It takes more time than steps  2  shorter , It's very difficult for this to happen .

however , In theory, there is still a possibility , So how to deal with this situation ? There are usually two kinds :

  • Design an expiration time for the cache ;
  • Asynchronous delay delete .

1.5  There is a problem with the delete policy

therefore , about  Redis  Cache consistency , The usual approach is to delete the cache , Since there is an action to delete , If there is a problem during the deletion phase , As a result, the data has not been deleted , At this time, every query is wrong data . How can this be solved ?

Generally speaking, there are two schemes :

Use the message queue to retry the deletion compensation

null
  • Update the database first ;
  • In the face of  Redis  An error was found during the deletion operation , Delete failed ;
  • At this time will be  Redis  Of  key  Send to message queue as message body ;
  • After the system receives the message sent by the message queue ;
  • Again  Redis  Delete operation .

But this solution will have a disadvantage that it will cause a lot of intrusion into the business code , Deeply coupled , So there will be an optimized solution , We know right  MySQL  After the database update operation  Binlog  We can find the corresponding operation in the log , So we can subscribe to  Mysql  Database  Binlog  Logs operate on the cache .

For subscriptions  Binlog  Logs can be accessed through Alibaba's open source framework  
canal
, See this article for details :

be based on  canal  Framework solution  mysql  And  redis  The question of consistency

2.  Cache penetration

It refers to querying a fundamental
Nonexistent data
, Neither the cache tier nor the storage tier will hit , If no data can be found from the storage layer, it will not be written to the cache layer .

Cache penetration will cause non-existent data to be queried in the storage layer every time it is requested , Lost the significance of cache protection back-end storage . Cache penetration issues can make
The back-end storage load increases
, Because many back-end storage does not have high concurrency , It may even cause the back-end storage to go down .

Generally, the total number of calls can be counted separately in the program 、 Cache layer hits 、 Number of storage layer hits , If a large number of storage layer null hits are found , Maybe there is a cache penetration problem .

2.1  Cause analysis

There are two basic reasons for cache penetration :

1、 There is a problem with your own business code or data

such as , Our database  id  All are  1  Starting to grow , If initiated as  id  The value is  -1  Data or  id  For very large nonexistent data . If the parameter is not verified , database  id  Are greater than  0  Of , I always use less than  0  To request , You can get around it every time  Redis  Directly to the database , At this time, the database cannot be found , Every time , Concurrent highs are prone to collapse .

2、 A malicious attack 、 Reptiles and so on cause a lot of empty hits

2.2  Solution

Cache penetration can be solved from the following aspects :

1、 Add verification

Such as user authentication verification ,id  Do basic verification ,id<=0  Direct interception of ;

2、 Caching empty objects

Since the original data does not exist , Keep the empty object in the cache layer , Then accessing this data will get from the cache , This protects the back-end data source .

But there will be  2  A question :

  • Null values are cached , It means that there are more keys in the cache layer , Need more memory space ( If it's an attack , The problem is more serious ), The more effective method is for this kind of data
    Set a shorter expiration time
    , Let it automatically remove .
  • The data of cache layer and storage layer will be inconsistent for a period of time , It may have some impact on the business . For example, the expiration time is set to  5  minute , If the storage layer adds this data at this time , Then there will be inconsistencies between the cache layer and the storage layer data in this period of time , At this point, the data consistency scheme can be used to process .

3、 The bloon filter

Before accessing the cache layer and storage layer , There will be  key  Save in advance with a bloom filter , Do the first level intercept .

for example : A recommendation system has  4  Billion users  id, Every hour, the algorithm engineer will calculate the recommended data according to each user's previous historical behavior and put it into the storage layer , But the latest users have no historical behavior , There will be cache penetration behavior , For this reason, users of all recommended data can be made into Bloom filters . If the bloom filter thinks that the user  id  non-existent , Then there is no access to the storage tier , It protects the storage layer to a certain extent .

This method is suitable for data hit is not high 、 Data is relatively fixed 、 Low real time ( Usually the data set is large ) Application scenarios of , Code maintenance is more complex , But the cache space is less .

3.  Cache breakdown

Buffer breakdown refers to a
hotspot  key
, Large concurrent centralized access to this point , When this  key  At the moment of failure , Continuous large concurrency breaks through the cache , Direct request database .

If the cache breaks down , Set hotspot data never to expire . Or you can add the mutex .

3.1  Cause analysis

Set the expiration time of  key, Carrying high concurrency , It's hot data . From this  key  To expire from  MySQL  Load data into the cache for a period of time , A lot of requests could kill the database .

Cache avalanche refers to a large number of cache failures , Cache breakdown refers to cache failure of hot data .

3.2  Solution

1、 Use mutexes

A common practice in the industry , Is to use mutex .

To put it simply , When the cache fails ( Judge that the value is empty ), Not immediately  load db, Instead, use some operations with the return value of the successful operation of the caching tool first ( such as  Redis  Of  SETNX  perhaps  Memcache  Of  ADD) Go to  set  One  mutex key, When the operation returns success , Proceed again  load db  And reset the cache , otherwise , Just try the whole thing again  get  Caching method .

The pseudocode is as follows :

public String get(key) {
 String value = redis.get(key);
 if (value == null) {// Represents cache value expiration
 //  Set up  3min  timeout , prevent  del  When the operation fails , Next time the cache expires, it can't be  load db
 if (redis.setnx(key_mutex, 1, 3 * 60) == 1) {
 // Indicates that the setting is successful
 value = db.get(key);
 redis.set(key, value, expire_secs);
 redis.del(key_mutex);
 } else {
 // This time represents that other threads at the same time have  load db  And back to the cache , In this case, try again to get the cache value
 sleep(50);
 get(key); // retry
 }
 } else {
 return value;
 }
}

2、 Never expire

there
Never expire
It has two meanings :

  • from  Redis  Look up , It's true that the expiration time is not set , That's the guarantee , There will be no hot spots  key  Overdue problem , That is to say
    Physics doesn't expire
    .
  • functionally , If it doesn't expire , That's static ? So we keep the expiration date  key  Corresponding  value  in , If it's found to be overdue , Build the cache through a background asynchronous thread , That is to say
    Logic does not expire
    .

From the perspective of actual combat , This method is very performance friendly , The only drawback is when building the cache , The rest of the threads ( Threads that are not building the cache ) Maybe it's old data , But for general Internet functions, this is tolerable .

4.  Cache avalanche

Cache avalanche refers to the time from bulk to expiration of data in cache , And the amount of query data is huge , Cause too much pressure on the database and even downtime .

4.1  Cause analysis

Because the cache layer carries a lot of requests , Effectively protects the storage tier , But if the cache layer cannot provide services for some reason , For example, a large area of cached data fails at the same time , That moment  Redis  It's the same as nothing , So all the requests will reach the storage layer , The number of calls to the storage layer will skyrocket , Cause cascading downtime in the storage tier .

4.2  Solution

Prevent and solve buffer avalanche problem , We can start from the following three aspects :

  • Ensure high availability of cache layer services
  • Like an airplane with multiple engines , If the cache layer is designed to be highly available , Even if individual nodes 、 Individual machines 、 Even the computer room went down , Services are still available , for example  Redis Sentinel  and  Redis Cluster  High availability is realized ;
  • Set up  key  Never fail ( Hot data );
  • Set up  key  When the cache fails, stagger it as much as possible ;
  • Use a multi-level caching mechanism , For example, using at the same time  Redis  and  Memcache  cache , request ->redis->memcache->db.

5. Hot Key

stay  Redis  in , High frequency of visits  key  be called  hot key, Hot data .

5.1  reason

There are two reasons for hot issues :

  • The amount of data consumed by users is far greater than expected
  • Some unexpected events in daily work and life , for example : The price reduction and promotion of some popular goods during the double 11 , When one of these items is viewed or purchased tens of thousands of times , There will be a large demand , In this case, it will cause hot issues . Empathy , Be published in large numbers 、 Browsing hot news 、 Hot reviews 、 Star live broadcast, etc , These typical scenes with more reading and less writing will also generate hot issues .
  • Request fragmentation , More than one  Server  Performance limit of
  • When the server reads data for access , Data is often segmented , In this process, it will be on a certain host  Server  Up to the corresponding  Key  Visit , When the visit exceeds  Server  At the limit , It will lead to hot spots  Key  Problem generation .

5.2  harm

Flow concentration , The physical network card limit is reached .

Too many requests , Cache fragmentation service is broken .DB  breakdown , Cause a business avalanche .

As mentioned above , When a hot spot  Key  When the request exceeds the upper limit of the host network card on a host , Because of the over concentration of traffic , It will cause other services in the server to be unavailable . If the hot spots are too concentrated , hotspot  Key  Too much cache , When the current cache capacity is exceeded , It will lead to the collapse of cache fragmentation service .

When the cache service crashes , At this time, another request is generated , It will be cached in the background  DB  On , because  DB  Its performance is weak , In the face of large requests, it is easy to have request penetration , Will further lead to avalanches , Seriously affect the performance of the equipment .

5.3  Solution

Find hot spots  key  after , Need to focus on hot spots  key  To deal with , Usually there are the following  2  Medium scheme :

1、 Use L2 cache

have access to  guava-cache  or  hcache, Will be hot  key  Load into  JVM  As a local cache . Visit these  key  You can get it directly from the local cache , No direct access to  Redis  The layer , The cache server is effectively protected .

2、key
 
Dispersed

Will be hot  key  Split into multiple children  key, Then it is stored on different machines of the cache cluster , These  key  Corresponding  value  And hot spots  key  It's the same . When passing through hot spots  key  When you go to query data , By some means  hash  The algorithm randomly selects a child  key, Then access the cache machine , Spread the hot spots to multiple sub  key  On .

actually , For heat  key  The problem is the problem that will only occur when the high level is issued , General single machine systems do not have such high concurrency , The L2 cache architecture is a bit complicated , The actual situation mainly depends on whether the business scenario really needs .

For example, each  Redis  ceiling 10w/s QPS,Redis5.0  We usually deploy in clusters  3  Lord  6  from , heat  key  It is generally distributed over a hash slot , It's a Lord redis+ Two from redis, Theoretically, it can satisfy 30w/s Of QPS, Let's reserve a little buffer,10wQPS There must be no problem . If higher , For example, millions of visits , That's except for expansion  Redis  Outside the cluster , Local caching is also necessary .

6. Big Key

bigkey  Refer to  key  Corresponding  value  The memory space occupied is relatively large , For example, a string type  value  Can save up to  512MB, A list type  value  Up to storage  23-1  Elements . If it is subdivided according to the data structure , It is generally divided into string types  bigkey  And non string types  bigkey.

String type : Reflected in a single  value  Great value , It is generally believed that more than  10KB  Namely  bigkey, But this value is different from the specific  QPS  relevant .

Non string type : Hash 、 list 、 aggregate 、 Ordered set , Reflected in the excessive number of elements .

bigkey  Both spatial and temporal complexity are not very friendly .

6.1  Find out

redis-cli --bigkeys
  You can command statistics  bigkey  The distribution of , But in a production environment , Developers and operation and maintenance personnel prefer to define  bigkey  Size , And more hope to find the real  bigkey  What are the , So that we can locate 、 solve 、 optimization problem . Judge a  key  Is it  bigkey, Just execute  debug object key  see  serializedlength  Attribute is enough , It said  key  Corresponding  value  Number of bytes after serialization .

If there are many key values ,scan + debug object  It will be slow , You can use  Pipeline  Mechanism complete . For data structures with a large number of elements ,debug object  The execution speed is slow , Blocked  Redis  The possibility of , So if there is a slave node , Consider executing on the slave node .

6.2  harm

bigkey  The harm of is reflected in three aspects :

  • Uneven memory space
  • For example, in  Redis Cluster  in ,bigkey  The memory space of the node will be unevenly used .
  • Time out blocking
  • because  Redis  Single thread features , operation  bigkey  More time-consuming , It means blocking  Redis  The possibility increases .
  • Network congestion
  • Every time to get  bigkey  The generated network traffic is large , Suppose a  bigkey  by  1MB, The number of visits per second is  1000, So every second produces  1000MB  Of traffic , For ordinary gigabit network card ( In terms of bytes  128MB/s) It's a disaster for our servers , Moreover, the general server will be deployed in the way of single machine and multiple instances , That is to say a  bigkey  May affect other instances , The consequences are dire .

bigkey  The existence of is not completely fatal , If this  bigkey  Exists but is rarely accessed , Then only the problem of uneven memory space exists , It is less important and urgent than the other two problems , But if  bigkey  It's a hot spot  key, Then the harm it brings is unimaginable , Therefore, we must pay close attention to the actual development, operation and maintenance  bigkey  The existence of .

6.3  Solution

The main idea is to split , Yes  big key  Stored data (big value) To break up , become  value1,value2… valueN  wait .

for example  big value  It's a big  json  adopt  mset  The way , Put this  key  The contents of are scattered into various instances , Or a  hash, Every  field  Represents a specific attribute , adopt  hget、hmget  Get part  value,hset、hmset  To update some properties .

for example :big value  It's a big  list, Can be disassembled into  list  Split into .list_1,list_2,list3,listN  The same applies to other data types .

7. Redis  Split brain

The so-called brain fissure , It refers to the master-slave cluster that guarantees availability ,
Two master nodes are generated at the same time
, They can all receive write requests .

The main cause may be network problems  redis master  The node follows  redis slave  Nodes and  sentinel  Clusters are in different network partitions , At this time because sentinel The cluster cannot perceive master The existence of , So will slave The node is promoted to master node ..

The most direct effect of cerebral fissure , That is, the client does not know which master node to write data to , As a result, different clients will write data to different master nodes . and , Serious words , Cleft brain can further lead to data loss .

7.1  Sentinel master-slave cluster brain crack

Suppose there are now three servers , One master server , Two slave servers , And the sentinel mechanism .

null
Based on the above environment , At this time, the network environment fluctuated, leading to a certain  master  The machine is suddenly out of the normal network , But actually  master  Still running ,sentinel  By way of election, the Council has promoted a  slave  As new  master.

If it happens at the right time  App Server1  Still connected is the old  master, and  App Server2  Connected to the new  master  On . The data is not consistent , The sentry restored to the old  master  After the perception of nodes , Will downgrade it to  slave  node , Then start again  maste  Synchronous data (full resynchronization), Lead to aging during brain fissure  master  Data written is lost .

null
Solution

The following parameters are configured to solve the cerebral fissure :

min-replicas-to-write 1 
min-replicas-max-lag 5

The above parameters indicate that at least  1  individual  slave, The delay of data replication and synchronization cannot exceed  5  second .

The first parameter represents the least  slave  The node is  1  individual , Only one data is written  master  Nodes are also synchronized to at least  1  From nodes , It means that the data is successfully added .

The second parameter indicates that the delay of data replication and synchronization cannot exceed  5  second .

These two parameters are configured , If a brain crack occurs , primary  master  The request will be rejected when the client writes , This can avoid a lot of data loss .

7.2  Cluster cleft brain

By default ,Redis  A cluster of brain fissures generally does not exist , because  Redis  There are more than half of the election mechanisms in the cluster , And when the cluster  16384  The entire cluster is not available when any one of the slots is not assigned to a node .

So we're building  Redis  When the cluster , You should let the cluster  Master  The minimum number of nodes is  3  individual , And the number of available nodes in the cluster is odd .

Not by default , For example, the number of clusters is an even number or parameters  cluster-require-full-coverage  Set to off ( The function of this parameter : When it is turned on, as long as there is node downtime  16384  Pieces are not completely covered , The entire cluster is out of service ), In this case, it may still lead to brain cracking . So the solution can also use parameters  min-replicas-to-write  and  min-replicas-max-lag .
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/172/202206211841526986.html