当前位置:网站首页>AVX instruction set accelerated matrix multiplication
AVX instruction set accelerated matrix multiplication
2022-07-24 00:33:00 【Virgo programmer's friend】
AVX brief introduction
SIMD
SIMD(Single Instruction Multiple Data, Single instruction multi data stream ), It is a technology to realize spatial parallelism . This technology uses one controller to control multiple processing units , At the same time, perform the same operation on each data in a group of data . stay SIMD During instruction execution , There is only one process running at any time , namely SIMD No concurrency , Just calculate at the same time .
stay Intel Of x86 In microarchitecture processor ,SIMD The instruction set has MMX、SSE、SSE2、SSE3、SSSE3、SSE4.1、SSE4.2、AVX、AVX2、AVX512.
AVX
AVX yes SSE Extension of Architecture , take SSE Of XMM 128bit The register is upgraded to YMM 256bit register , At the same time, the floating-point operation command is extended to 256 position , The computational efficiency has doubled . in addition ,AVX Three operand instructions are also added , In order to reduce the action of copying before operation in coding .
AVX2 Extend most integer operation commands to 256 position , Support at the same time FMA(Fused Multiply-Accumulate, Fusion multiplication and accumulation ) operation , It can improve the efficiency of operation and reduce the loss of accuracy .
AVX512 take AVX The instruction is further extended to 512 position .
AVX Instruction Introduction
Refer to this website :Crunching Numbers with AVX and AVX2 - CodeProject
be based on AVX Matrix multiplication of
Algorithmic thought

Combining with the above , First, we need to put the matrix C Initialize to 0 matrix . In order to make full use of AVX The computational advantage of single instruction and multiple data , We need to take vector as the basic operation unit , The matrix A The first i Xing di j Elements of columns and matrices B The first j Multiply the vectors of rows , Result vectors and matrices C The first i The vector accumulation of rows . When the matrix A The first i All elements of the row are related to the matrix B The vectors of each row are multiplied and added to the matrix C The first i After the vector of the row , matrix C The first i End of line calculation .
Suppose the elements of the matrix are float type (32 position ) Floating point number , because AVX use 256 Bit registers store vectors , So a vector can store 8 Elements . When the number of columns of the matrix is not 8 The integral times of , That is, the number of elements in a row is not 8 The integral times of , The last few bits of the last vector need to be set to 0. If we adopt a method that is compatible with unaligned data, we need to solve the problem of memory conflict .
Code
In the specific implementation , matrix A The first i Xing di j Columns of data A[i][j] Will be copied and grow to 8 Vector vecA, matrix B Vector of each line vecB Then read from the specified memory address . Because the number of columns of the matrix can not be 8 Integer multiple ( For elements float Type of matrix ), So the data can be misaligned , So we use loadu Load matrix data from memory .
For the multiplication and accumulation of vectors , Because the two one. n The result of multiplying digits can account for 2n position , So when two float Floating point number of type a And b Multiplying time , The result is the closest a*b Of float The number round(a*b). In this algorithm , Multiply two floating-point numbers and add them to C Matrix , Compared with ordinary multiplication and accumulation round(a*b)+c,FMA Operation can be realized round(a*b+c). Therefore, this method has higher accuracy .
Here I don't set the last few bits of the last vector to 0, So there's a problem : Suppose the matrix B by 7*6, When loadu Read in the matrix B The data in the second row , The second line is just 6 individual float Type of 32 Bit floating point , Insufficient to constitute 256 Vector of bits , Therefore, the instruction will continue to read the first two numbers in the third line , After calculation , The first two numbers of the third row are also saved to the matrix C. next , The instruction reads the first two numbers in the third and fourth lines , Save after calculation . It's not hard to find out , From the second line to the last line , The first two numbers of each line are calculated twice . And for the first calculation , The calculation process of the last two numbers does not conform to the principle of the algorithm , It's wrong data . The solution is simple , In the matrix C Before the calculation of each line , The element value of reinitializing this row is 0, To eliminate the influence of the first miscalculation .
#include <immintrin.h> | |
#include <stdio.h> | |
#include <pthread.h> | |
#include <time.h> | |
#define MATRIX_SIZE 8192 | |
#define NUM_THREAD 4 | |
float matA[MATRIX_SIZE][MATRIX_SIZE]; | |
float matB[MATRIX_SIZE][MATRIX_SIZE]; | |
float matC[MATRIX_SIZE][MATRIX_SIZE]; | |
int step = 0; | |
void* multiplicationAVX(void* args) { | |
__m256 vecA, vecB, vecC; | |
int thread = step++; | |
for (int i = thread * MATRIX_SIZE / NUM_THREAD; | |
i < (thread + 1) * MATRIX_SIZE / NUM_THREAD; i++) { | |
for (int j = 0; j < MATRIX_SIZE; j++) { | |
matC[i][j] = 0.0f; | |
} | |
for (int j = 0; j < MATRIX_SIZE; j++) { | |
vecA = _mm256_set1_ps(matA[i][j]); | |
for (int k = 0; k < MATRIX_SIZE; k += 8) { | |
vecB = _mm256_loadu_ps(&matB[j][k]); | |
vecC = _mm256_loadu_ps(&matC[i][k]); | |
vecC = _mm256_fmadd_ps(vecA, vecB, vecC); | |
_mm256_storeu_ps(&matC[i][k], vecC); | |
} | |
} | |
} | |
return 0; | |
} | |
void* multiplicationNormal(void* args) { | |
int thread = step++; | |
for (int i = thread * MATRIX_SIZE / NUM_THREAD; | |
i < (thread + 1) * MATRIX_SIZE / NUM_THREAD; i++) { | |
for (int j = 0; j < MATRIX_SIZE; j++) { | |
for (int k = 0; k < MATRIX_SIZE; k++) { | |
matC[i][j] += matA[i][k] * matB[k][j]; | |
} | |
} | |
} | |
return 0; | |
} | |
void createMatrix() { | |
for (int i = 0; i < MATRIX_SIZE; i++) { | |
for (int j = 0; j < MATRIX_SIZE; j++) { | |
matA[i][j] = i + j * 2; | |
matB[i][j] = i * 2 + j; | |
matC[i][j] = 0.0f; | |
} | |
} | |
} | |
void printMatrix() { | |
if (MATRIX_SIZE <= 16) { | |
printf("Matriz A"); | |
for (int i = 0; i < MATRIX_SIZE; i++) { | |
printf("\n"); | |
for (int j = 0; j < MATRIX_SIZE; j++) { | |
printf("%f ", matA[i][j]); | |
} | |
} | |
printf("\n\n"); | |
printf("Matriz B"); | |
for (int i = 0; i < MATRIX_SIZE; i++) { | |
printf("\n"); | |
for (int j = 0; j < MATRIX_SIZE; j++) { | |
printf("%f ", matB[i][j]); | |
} | |
} | |
printf("\n\n"); | |
printf("Multiplying matrix A with B"); | |
for (int i = 0; i < MATRIX_SIZE; i++) { | |
printf("\n"); | |
for (int j = 0; j < MATRIX_SIZE; j++) { | |
printf("%f ", matC[i][j]); | |
} | |
} | |
} | |
} | |
int main() { | |
pthread_t threads[NUM_THREAD]; | |
clock_t start, end; | |
createMatrix(); | |
start = clock(); | |
for (int i = 0; i < NUM_THREAD; i++) { | |
// pthread_create(&threads[i], NULL, multiplicationNormal, NULL); | |
pthread_create(&threads[i], NULL, multiplicationAVX, NULL); | |
} | |
for (int i = 0; i < NUM_THREAD; i++) { | |
pthread_join(threads[i], NULL); | |
} | |
end = clock(); | |
printMatrix(); | |
printf("\n\n Number of threads used -> %d\n", NUM_THREAD); | |
printf("\n Matrix size -> %d\n", MATRIX_SIZE); | |
printf("\n Program running time ( millisecond ) -> %f\n\n", (float)(end - start) * 1000 / CLOCKS_PER_SEC); | |
} |
Copyright notice : This paper is written by iiYK original , Allowed to reprint , Please indicate the source of the article when reprinting AVX Instruction set accelerated matrix multiplication – iiYK
边栏推荐
- Redis data structure
- The most basic code interpretation of C language
- Happiness of progress and growth
- Classic example of C language - find the minimum number of banknotes
- How to open a low commission account? Is it safe?
- Gbase 8C access authority query function (V)
- php实现 Stripe订阅
- 泛型机制和增强for循环
- Educational Codeforces Round 132 (Rated for Div. 2)(A-D)
- English grammar_ Demonstrative pronoun - so
猜你喜欢

Blockbuster | certik: Web3.0 industry safety report release in the second quarter of 2022 (PDF download link attached)

如何提升数据质量

English语法_指示代词 - So

Pipeline pipeline project is built by declarative and jenkinsfile under Jenkins

paypal订阅流程及api请求

Redis master-slave synchronization mechanism

A good habit to develop when writing SQL

Implementation of singleton mode in C #

High number_ Chapter 1 space analytic geometry and vector algebra__ Two point distance

Classic example of C language - convert the input two digits into English
随机推荐
C language book recommendation
【低代码】低代码发展的局限性
Gbase 8C access authority query function (6)
Flutter | specifies the type of page return value
Gbase 8C access authority query function (III)
高数_第2章多元函数微分学__偏导数的几何应用_空间曲线的切线与法平面
《天幕红尘》笔记与思考(五)强势文化与弱势文化
Redis分布式锁待续
Problem note - unable to open include file: "direct.h": no such file or directory
GBase 8c 字符串操作符
GBase 8c 访问权限访问函数(四)
Network system experiment: solve the problem of Ping failure
Gbase 8C bit string operator (I)
How to open a low commission account? Is it safe?
Redis 集群hash分片算法(槽位定位算法)
Gbase 8C session information function (6)
Mobile terminal H5 - a lifeline timeline
GBase 8c系统表信息函数(三)
inode、软链接、硬链接
Educational Codeforces Round 132 (Rated for Div. 2)(A-D)