当前位置:网站首页>Memory consistency and cache consistency

Memory consistency and cache consistency

2022-06-23 15:49:00 User 4415180

  With CPU Improvement of design technology , In order to speed up program execution, there are many optimization techniques ,1. Assembly line technology , classical 5 Class assembly line ( Fingering , decoding , perform , Visiting and depositing , Write back to ).2. Multi launch technology , One cpu There can be multiple identical assembly line components in the , In this way, multiple instructions can be sent in one cycle , Implement instruction level parallelism .3. Out of order execution technology , To avoid pipeline interruption , Will not be relevant ( Data related , Control related ) The instructions of are put into one block for reordering , This allows unrelated instructions to be executed in parallel , Such as loop unrolling technology , Instruction dynamic scheduling technology , Branch prediction technology to avoid data risk and control risk , Make the assembly line as full as possible .cpu Faster and faster , Nor can deposit access hold back , So there is cache technology ,L1,L2,L3cache.

    These optimization techniques are used in single core cpu There will be no impact on the program , But in multicore cpu Write multithreaded programs without protection , There will be some unexpected results . For example, the following code

    public static int x,y,r1,r2;
	public static int count = 0;
	public static void main(String[] args) throws InterruptedException {
		do{
		Thread t1 = new Thread(()->{
				x = 1;
				r1 = y;
			});
		Thread t2 = new Thread(()->{
				y = 1;
				r2 = x;
			});
			t1.start();
			t2.start();
			t1.join();
			t2.join();
			count++;
			x = y = 0;
		}while(!(r1 == 0 && r2 == 0));
		System.out.printf("r1 = %d, r2 = %d, count = %d",r1,r2,count);
	}

  Two thread execution , If the program is executed in sequence, there are the following cases ,t1 Execute first t2 After execution ,t2 Execute first t1 After execution ,t1,t2 At the same time , be r1,r2 The following combinations will appear ,r1 = 0,r2 = 1;r1 = 1,r2 = 0; r1 = 1, r2 = 1. It's impossible for r1 = 0, r2 = 0 The situation of . But this program will appear , Because load Than store Access memory first , That is to say r1 = y stay x = 1 Before execution , r2 = x stay y = 1 Before execution . Why does this happen ? Let's get to know cpu The memory access instruction will be reordered ,LOAD —— LOAD Reorder ,STORE —— STORE Reorder ,LOAD —— STORE Reorder ,STORE —— LOAD Reorder , These are reordering that occurs without data correlation , That is to say, these four cases are different addresses for access , The same address will not be reordered , such as load x   load x It doesn't reorder , and load y load x May reorder .X86 cpu Will only appear STORE —— LOAD Reorder , Just like the above program , stay X86 Next, there will be r1 and r2 Also for 0.

  Memory order consistency :

      Also called strong consistency model , Sequence consistency means that the memory access sequence is the same as the program execution sequence , There will be no memory sort , This consistency is very programmer friendly , No additional protection code is required , Just like writing a single threaded program ,MIPS R10000 The sequential consistency model is implemented . There is a misunderstanding when we first think about the sequential consistency model , It is assumed that the sequential consistency model will not cause instruction rearrangement , Disorderly execution , Performance will be reduced , It's not like that .cpu Disorderly execution , But the delay technique of hiding sequential consistent memory access by speculation , The sequence of memory access is consistent with the procedure sequence . This is it. cpu Designers use the deferred commit function of the speculative processor to realize the sequence consistency under out of order execution . If the processor reorders the memory requests , The results of the new execution sequence are different from those seen under sequence consistency , The processor will undo this execution , Speculative restart after undo will affect some performance , But the probability of restart is very low , Of course cpu The design complexity is also higher . So out of order execution is not necessarily related to memory sorting , Everything depends on cpu How to design .

  TSO( Full storage sort ):

      TSO allow STORE —— LOAD Memory reordering , That is to say load The operation can be in store Happened before , such as X86 That's the model ,x86 No guessing techniques are used , Read / write instruction rearrangement only supports STORE-LOAD Reorder , But other non read / write instructions can be rearranged . This is the case with the above procedure ,X86 cpu There's a built-in fifo Of write buffer, For example store The operation will first write write buffer If in write buffer Refresh to cache When cache miss( Write miss ) It triggers cache Agreement of conformity , here load The operation will not be caused by store Operation occurs cache miss And jam , But will continue to carry out , therefore load May be accessed in store Before .TSO The model is relative SC The model design should be simple , Also friendly to programmers , Because only store load In one case, reordering will occur , Other situations and SC The model is the same , It's a compromise .

    TSO Model cpu When writing multithreaded synchronous code, you need to add memory barriers in some places , The function of memory barrier is to prevent memory sorting , The mandatory access sequence is the same as the program sequence , This will also cause pipeline congestion , Reduce performance .X86 Provides LFENCE, SFENCE, MFENCE Instructions act as memory barriers , Prevent instructions from crossing the memory barrier and executing first . such as LFENCE Will stop LFENCE The subsequent read instruction crosses LFENCE Execute first ,SFENCE Will stop SFENCE The subsequent write instruction crosses SFENCE Execute first , here TSO Model not defined STORE The operation will be performed first , however X86 Some special write instructions will be executed before other write instructions and reorder will occur , such as MOVNTI, MOVNTQ,MOVNTDQ, MOVNTPS, and  MOVNTPD There are also string operation instructions , There will be reordering , also CFLUSH,CLFLUSHOPT Reordering also occurs , These reorders require the use of SFENCE Instructions act as a barrier .MFENCE It is a universal barrier that is useful for both reading and writing . The above code can be found in x = 1,LFENCE, r1 = y Add a memory barrier between ,y = 1, LFENCE, r1 = y Add a memory barrier between , Corresponding to java Is the addition of variables volatile keyword .

TSO Model diagram

There is also a weak memory consistency model , Also called loose consistency model .

Relaxed Memory Consistency( Loose memory consistency model ):

        Weak consistency model , You can reorder the above four storage access methods , So it will be more difficult to write concurrent programs , Clear all possible consistency situations , stay ARM, IBM POWER, DEC ALPHA Wait for the system program to consider more . Common optimizations for loose consistency models are :

1. Not FIFO, Merge write buffers ,TSO Allow to use FIFO Write buffer , Improve performance by hiding some or all of the latency of committed storage . Even though FIFO Write buffer improves performance , However, a more optimized design will use merged non - FIFO Write buffer ( namely , Two stores that are not contiguous in program order can be written to the same entry in the buffer ,store store You can reorder ). however , Not FIFO Merging write buffers violates TSO, because TSO Require storage to appear in program order .

2. Support cpu Simple speculation ,  In systems with strongly consistent models ,cpu Before you are ready to submit , Execute speculatively out of sequence load. Support SC Of MIPS R10000 This feature is used to improve the performance . But support SC Conjecture of cpu Mechanisms must be included to check whether the speculation is correct , Even if there are few misguesses . R10000 By comparing the address of the cache block to be replaced with cpu Address lists that have been speculatively loaded but not yet submitted ( That is, the contents of the core load queue ) Compare to check . This mechanism increases the cost and complexity of hardware , It consumes extra power , And it may limit instruction level parallelism . But in a system with a loose memory consistency model ,cpu It can be executed in disorder load, There is no need to compare these loaded addresses with the addresses of incoming consistency requests . For loose consistency models , these load Not speculative .

Take an example of the loose consistency model

data1 data2 Initialize to 0,flag It's not equal to set,r2 and r3 Will all be equal to 0 Do you . according to SC and TSO,r2 and r3 It can't be equal to 0 Of , But under the loose consistency model r2 and r3 It may all be 0, Because the loose consistency model supports store store Rearrangement rule S3 Maybe it's better than S1 S2 Execute first , At this time data1 and data2 Also for 0,flag yes set be r2 and r3 All for 0. This obviously does not meet our expectations , So you need to be in S2 and S3 In between FENCE The memory barrier does not allow store store reorder , If not B1 Control dependence , Then it needs to be in L2 Add memory barrier before , Not allow load load reorder .

The execution sequence of this program has the following possibilities

FENCE After S3 You can't go ahead of time FENCE Pre execution , however FENCE Before store store You can reorder ,load load It's similar .

Another possibility is that FENCE The previous and subsequent operations are not reordered .

Comparison of three memory consistency models :

TSO Memory model and SC Memory model comparison :

perform : all SC It's all about execution TSO perform ,TSO Some implementation is not SC perform , therefore SC Execution is TSO A subset of the execution .

A good memory consistency model should have the following characteristics :

1. Programmability : A good model should ( relative ) Easy to write multithreaded programs .  For most users , The model should be intuitive , Even those who haven't read the details .  And it should be accurate .

2. performance : A good model should be at a reasonable power , Cost and other ways to promote high performance . It should provide implementers with a wide range of choices . such as MIPS R10000(SC Model ),x86(TSO Model ),ARM( Loose consistency model ), Their power consumption and performance need to be compromised .

3. Portability : A good model will be widely adopted or at least provide backward compatibility or the ability to adapt between models .

4. accuracy : A good model should be defined mathematically . Natural language is too vague , The user cannot understand .

It can be compared from the above four aspects SC and TSO:

1. Programmability :SC Is the most intuitive . TSO Very close to , Because it is similar to common programming idioms SC equally . However , Others are not SC Execution requires special instructions .SC A slight victory TSO.

2. performance : For simple cpu,TSO It can provide a comparison of SC Better performance , But the difference can be narrowed by speculation (CPU The design is more complicated ),TSO be better than SC.

3. Portability :SC Widely understood , and TSO Widely adopted .SC be better than TSO

4. accuracy :SC,TSO and x86-TSO Are precisely defined . Strike a level .

Loose consistency model :

1. Programmability : For those using SC For people who , Loose model programmability is acceptable . Deep understanding of loose models ( for example , Writing compilers and runtime systems ) It's very difficult , So the loose consistency model is that the programmability is inferior to SC and TSO Of

2. performance : A loose memory model can provide better performance than TSO Better performance , But for many core microarchitectures , Small differences .

3. Portability : The loose consistency model is relative to SC and TSO Many restrictions , Transplantation is relatively difficult .

4. accuracy : Many loose consistency models are informally defined , The accuracy is inferior to TSO and SC.

SC representative cpu:MIPS R10000

TSO representative cpu:x86

Relaxed Memory Models representative cpu:ARM, POWER, dec alpha, among dec alpha Is the weakest , Support Dependent loads reordered, That is, dependent loading will also reorder . Like this program

If you don't add a memory barrier ,p You may take it first y The address of , This may lead to i = 0, This will only happen in alpha Under a , Other cpu Do not support dependent load reordering .

原网站

版权声明
本文为[User 4415180]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206231502315130.html