当前位置：网站首页>Show you how to distinguish several kinds of parallelism

Show you how to distinguish several kinds of parallelism

2022-06-22 02:10:00 【Huawei cloud developer Alliance】

Abstract ： in application , The main factor that affects the parallel speedup ratio is serial computing 、 Parallel computing and parallel overhead .

This article is shared from Huawei cloud community 《 High performance computing （2）—— Ten thousand Zhang tall buildings rise from the ground 》, author ： I'm a big watermelon .

storage

From the physical division Shared memory and distributed memory Are two basic parallel computer storage methods besides Distributed shared memory It is also an increasingly important parallel computer storage method .

Instructions and data

[ Small particle size ] According to the number of instructions and data that a parallel computer can execute at the same time Parallel computers can be divided into SIMD Single-Instruction Multiple-Data SIMD parallel computer and MIMD Multiple-Instruction Multiple-Data Multi instruction multi data parallel computer
[ Large particle size ] According to different programs and data executed at the same time And put forward the SPMD Single-Program Multuple-Data Single program multi data parallel computer and MPMD Multiple-ProgramMultiple-Data Multi program multi data parallel computer

According to the simultaneous execution of instructions and data , Computer systems can be divided into the following four categories ：

Single processor , Single data (SISD)
Single processor , More data (SIMD)
Multiprocessor , Single data (MISD)
Multiprocessor , More data (MIMD)

SISD

Single processor and single data are “ single CPU Machine ”, It executes instructions on a single data stream . stay SISD in , Instructions are executed sequentially .

For each of these “CPU The clock ”,CPU Execute in the following order ：

Fetch: CPU From a memory area （ register ） Get data and instructions
Decode: CPU Decoding instructions
Execute: The execution is performed on the data , Save the result in another register

This architecture （ feng · The Neumann system ） The main elements of are ：

Central memory unit ： Store instructions and data
CPU： Used to obtain instructions from memory units / data , Decode the instructions and execute them sequentially
I/O System ： The input and output streams of a program

Traditional single processor computers are classic SISD System . The figure below shows CPU stay Fetch、Decode、Execute Which units are used in the steps of ：

MISD

In this model , Yes n A processor , Each has its own control unit , Share the same memory unit . In every one of them CPU The middle of the hour , The data obtained from memory will be processed by all processors at the same time , Each processor processes according to the instructions sent by its own control unit . under these circumstances , Parallelism is actually instruction level parallelism , Multiple instructions operate on the same data . The problem model that can make rational use of this architecture is quite special , For example, data encryption . therefore ,MISD There is not much use in reality , It is more about being an abstract model .

SIMD

SIMD The computer consists of several independent processors , Each has its own local memory , Can be used to store data . All processors work under a single instruction stream ; specifically , There is n Data streams , Each processor handles one . All processors process each step at the same time , Execute the same instructions on different data .

Many questions can be used SIMD Computer architecture to solve . Another interesting feature of this architecture is , The algorithm of this architecture is very well designed , Analyze and implement . The limitation is , Only can be decomposed into many small problems （ Small problems should be independent , Can be executed by the same instructions in any order ） Can be solved with this architecture . Many supercomputers are designed using this architecture . for example Connection Machine（1985 Year of Thinking Machine) and MPP（NASA-1983）. We are in chapter six GPU Python Programming will be exposed to advanced modern graphics processors （GPU）, There are many built-in processors SIMD processing unit , This architecture is widely used today .

MIMD

In the ferin classification , This computer is the most widely used 、 It is also the most powerful kind . This architecture has n A processor ,n Instruction streams ,n Data streams . Each processor has its own control unit and local memory , Give Way MIMD Architecture SIMD The computing power of the architecture is stronger . Each processor works under the instruction flow allocated by the independent control unit ; therefore , The processor can run different programs on different data , In this way, completely different sub problems or even single large problems can be solved . stay MIMD in , Architecture is implemented through thread or process level parallelism , This also means that the processor generally works asynchronously . This type of computer is usually used to solve problems that do not have a unified structure 、 No use SIMD To solve the problem . Now , Many computers use this intermediate architecture , For example, supercomputers , Computer network, etc . However , There is a problem that must be considered ： Asynchronous algorithms are very difficult to design 、 Analyze and implement .

Concurrency and parallelism

Parallel type

Several parallel distinctions

Program 、 Threads 、 Processes and hyper threads

Program A program is an ordered set of instructions . It doesn't have any running meaning , It is just a static entity file in the hard disk and other storage space of the computer system . such as Linux Under the system binary excutable,windows Under the system exe
process A process is a system resource management entity maintained by the operating system under dynamic conditions . A process has its own lifecycle , It reflects the whole dynamic process of a program running on a certain data set . It needs to be loaded into memory , Click open one exe Is to start a process
Threads . A thread is an entity of a process , It's a smaller basic unit that can run independently than a process , It is the basic unit scheduled and allocated by the system . Threads themselves basically do not own system resources , Have only a few essential resources in operation （ Such as program counter 、 A set of registers and call stack ）, But it shares all the resources owned by the process with other threads belonging to the same process , Multiple threads of the same process can execute concurrently , Thus, the utilization rate of system resources is improved
hyper-threading Hyper threading technology is to use special hardware instructions , Simulate two logic cores into two physical chips , Let a single CPU Can perform thread level parallel computing , Compatible with multithreaded operating system and software . Generally one CPU Corresponding to a thread , Through hyper threading, such as 8 nucleus 16 Threads

A cliche , The difference and relationship between thread and process ：

The execution of a program has at least one process , A process contains at least one thread （ The main thread ）.
Thread partition scale is smaller than process , So multithreaded programs have higher concurrency .
A process is an independent unit of the system for resource allocation and scheduling , Thread is CPU Basic unit of dispatch and dispatch . Allow multiple threads to share their resources within the same process .
Processes have separate memory units , That is, processes are independent of each other ; Multiple threads in the same process share memory . therefore , Threads can communicate with each other through read and write operations to the memory that is visible to them , The communication between processes needs the help of message transmission .
Each thread has an entry for the program to run , An exit for sequential execution sequences and program runs , But threads cannot execute alone , Must depend on the process , The process controls the execution of multiple threads
Processes have more corresponding states than threads , So the cost of creating or destroying a process is much higher than that of creating or destroying a thread . therefore , The process exists for a long time , Threads dynamically derive and merge as the computation progresses .
One thread can create and undo another . Moreover, multiple threads in the same process share all the resources owned by the process ; At the same time, processes can also execute in parallel , Thus, the utilization of system resources is better improved .

Thread binding

A computer system is composed of one or more physical processors and memory , The running program divides the memory into two parts , One part is the storage area used by shared variables , The other part is the storage area for the private variables of each thread . Thread binding Is to bind a thread to a fixed processor , Thus, a one-to-one mapping relationship is established between threads and processors . If you do not bind threads , Threads may run on different processors at different time slices . We know , Each processor has its own multi-level cache , If the thread cuts back and forth , that cache The hit rate is certainly not high , Program performance is also affected . Binding through threads , The program can get higher cache Utilization to improve program performance .c++ How to bind threads in can be referred to https://www.cnblogs.com/wenqiang/p/6049978.html

Parallel algorithm evaluation

In theory ,n A the same cpu Theoretically, it can provide n Times the computing power .

But in practice , The parallel overhead will cause the total execution time to be unable to reduce linearly . These expenses are ：

Thread creation and destruction 、 Thread to thread communication 、 The overhead caused by factors such as synchronization between threads .
There is computing code that cannot be parallelized , Cause the calculation to be completed by a single thread , Other threads are idle .
Cost of competition for shared resources .
Because of each cpu The imbalance of workload distribution and the limitation of memory bandwidth , One or more threads are idle due to lack of work or because they cannot continue executing while waiting for a specific event to occur .

Parallel speedup （ Speedup ratio ）

The definition of speedup ratio is the execution time of sequential programs divided by the execution time of parallel programs that calculate the same result

In style ,t_sts For one CPU The serial execution time required for the program to complete the task ;t_ptp by n star CPU The time required to complete the task in parallel . Due to serial execution time t_sts by n star CPU Parallel execution completes the And parallel execution time t_ptp There are many ways to define it . This leads to the definition of five different acceleration ratios , The relative acceleration ratio 、 Actual acceleration ratio 、 Absolute acceleration ratio 、 Asymptotic actual acceleration ratio and asymptotic relative acceleration ratio .

Parallel efficiency （ efficiency ）

in application , The main factors that affect the parallel speedup ratio are Serial computing 、 Parallel computing and parallel overhead Three aspects . In general , The parallel speedup ratio is less than CPU The number of . however , Sometimes there is a strange phenomenon , That is, parallel programs can be faster than serial programs n Times the speed , It is called superlinear acceleration ratio . The reason for superlinear acceleration is CPU The accessed data resides in their respective caches Cache in , The capacity of cache is smaller than that of memory , But the speed of reading and writing is much higher than that of memory .
Another major criterion for measuring parallel algorithms is parallel efficiency , It represents multiple CPU In parallel computing, a single CPU Average acceleration ratio of .

The ideal parallel efficiency is 1 Indicate all CPU Working at full capacity . Usually , The parallel efficiency will be less than 1, And with CPU Decrease as the quantity increases .

Scalability

Scalability measures the ability of parallel machines to run efficiently , Represents the computing power proportional to the number of processors ( Execution speed ). If the size of the problem and the number of processors increase at the same time , Performance will not degrade .

Amdal's law (Ahmdal’s law)

Amdal's law is widely used in processor design and parallel algorithm design . It indicates that the maximum speedup that a program can achieve is limited by the serial portion of the program .$S=1/(1-p) $ in 1-p1−p The serial part of a program . It means , For example, a program 90% The code is parallel , But there is still 10% Serial code of , Then the maximum speedup that can be achieved by an infinite number of processors in the system is still 9.

Gustafson's law (Gustafson’s law)

Gustafson's law is derived after considering the following situations ：

When the scale of the problem increases , The serial part of the program remains unchanged .
When increasing the number of processors , Each processor still performs the same tasks .

Gustafson's law states the acceleration ratio S(P)=P-\alpha (P-1)S(P)=P−α(P−1), PP Is the number of processors , SS For the speedup ratio ,\alphaα Is a non parallel part of a parallel processor . As a contrast , Amdal's law compares the execution time of a single processor with the parallel execution time . So amdal's law is based on a fixed problem scale , It assumes that the overall workload of the program does not change with the size of the machine ( That is, the number of processors ) And change . Gustafson's law complements the lack that amdal's law does not consider the total amount of resources needed to solve the problem . Gustafson's law solves this problem , It shows that the best way to set the time allowed for parallel solutions is to consider all computing resources and based on this kind of information .

Click to follow , The first time to learn about Huawei's new cloud technology ~

原网站

版权声明
本文为[Huawei cloud developer Alliance]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206220203332817.html