当前位置：网站首页>AVX instruction set accelerated matrix multiplication

AVX instruction set accelerated matrix multiplication

2022-07-24 00:33:00 【Virgo programmer's friend】

AVX brief introduction

SIMD

SIMD（Single Instruction Multiple Data, Single instruction multi data stream ）, It is a technology to realize spatial parallelism . This technology uses one controller to control multiple processing units , At the same time, perform the same operation on each data in a group of data . stay SIMD During instruction execution , There is only one process running at any time , namely SIMD No concurrency , Just calculate at the same time .

stay Intel Of x86 In microarchitecture processor ,SIMD The instruction set has MMX、SSE、SSE2、SSE3、SSSE3、SSE4.1、SSE4.2、AVX、AVX2、AVX512.

AVX

AVX yes SSE Extension of Architecture , take SSE Of XMM 128bit The register is upgraded to YMM 256bit register , At the same time, the floating-point operation command is extended to 256 position , The computational efficiency has doubled . in addition ,AVX Three operand instructions are also added , In order to reduce the action of copying before operation in coding .

AVX2 Extend most integer operation commands to 256 position , Support at the same time FMA（Fused Multiply-Accumulate, Fusion multiplication and accumulation ） operation , It can improve the efficiency of operation and reduce the loss of accuracy .

AVX512 take AVX The instruction is further extended to 512 position .

AVX Instruction Introduction

Refer to this website ：Crunching Numbers with AVX and AVX2 - CodeProject

be based on AVX Matrix multiplication of

Algorithmic thought

Combining with the above , First, we need to put the matrix C Initialize to 0 matrix . In order to make full use of AVX The computational advantage of single instruction and multiple data , We need to take vector as the basic operation unit , The matrix A The first i Xing di j Elements of columns and matrices B The first j Multiply the vectors of rows , Result vectors and matrices C The first i The vector accumulation of rows . When the matrix A The first i All elements of the row are related to the matrix B The vectors of each row are multiplied and added to the matrix C The first i After the vector of the row , matrix C The first i End of line calculation .

Suppose the elements of the matrix are float type （32 position ） Floating point number , because AVX use 256 Bit registers store vectors , So a vector can store 8 Elements . When the number of columns of the matrix is not 8 The integral times of , That is, the number of elements in a row is not 8 The integral times of , The last few bits of the last vector need to be set to 0. If we adopt a method that is compatible with unaligned data, we need to solve the problem of memory conflict .

Code

In the specific implementation , matrix A The first i Xing di j Columns of data A[i][j] Will be copied and grow to 8 Vector vecA, matrix B Vector of each line vecB Then read from the specified memory address . Because the number of columns of the matrix can not be 8 Integer multiple （ For elements float Type of matrix ）, So the data can be misaligned , So we use loadu Load matrix data from memory .

For the multiplication and accumulation of vectors , Because the two one. n The result of multiplying digits can account for 2n position , So when two float Floating point number of type a And b Multiplying time , The result is the closest a*b Of float The number round(a*b). In this algorithm , Multiply two floating-point numbers and add them to C Matrix , Compared with ordinary multiplication and accumulation round(a*b)+c,FMA Operation can be realized round(a*b+c). Therefore, this method has higher accuracy .

Here I don't set the last few bits of the last vector to 0, So there's a problem ： Suppose the matrix B by 7*6, When loadu Read in the matrix B The data in the second row , The second line is just 6 individual float Type of 32 Bit floating point , Insufficient to constitute 256 Vector of bits , Therefore, the instruction will continue to read the first two numbers in the third line , After calculation , The first two numbers of the third row are also saved to the matrix C. next , The instruction reads the first two numbers in the third and fourth lines , Save after calculation . It's not hard to find out , From the second line to the last line , The first two numbers of each line are calculated twice . And for the first calculation , The calculation process of the last two numbers does not conform to the principle of the algorithm , It's wrong data . The solution is simple , In the matrix C Before the calculation of each line , The element value of reinitializing this row is 0, To eliminate the influence of the first miscalculation .

	`#include <immintrin.h>`
	`#include <stdio.h>`
	`#include <pthread.h>`
	`#include <time.h>`

	`#define MATRIX_SIZE 8192`
	`#define NUM_THREAD 4`

	`float matA[MATRIX_SIZE][MATRIX_SIZE];`
	`float matB[MATRIX_SIZE][MATRIX_SIZE];`
	`float matC[MATRIX_SIZE][MATRIX_SIZE];`
	`int step = 0;`

	`void* multiplicationAVX(void* args) {`
	`__m256 vecA, vecB, vecC;`
	`int thread = step++;`

	`for (int i = thread * MATRIX_SIZE / NUM_THREAD;`
	`i < (thread + 1) * MATRIX_SIZE / NUM_THREAD; i++) {`
	`for (int j = 0; j < MATRIX_SIZE; j++) {`
	`matC[i][j] = 0.0f;`
	`}`
	`for (int j = 0; j < MATRIX_SIZE; j++) {`
	`vecA = _mm256_set1_ps(matA[i][j]);`
	`for (int k = 0; k < MATRIX_SIZE; k += 8) {`
	`vecB = _mm256_loadu_ps(&matB[j][k]);`
	`vecC = _mm256_loadu_ps(&matC[i][k]);`
	`vecC = _mm256_fmadd_ps(vecA, vecB, vecC);`
	`_mm256_storeu_ps(&matC[i][k], vecC);`
	`}`
	`}`
	`}`
	`return 0;`
	`}`

	`void* multiplicationNormal(void* args) {`
	`int thread = step++;`

	`for (int i = thread * MATRIX_SIZE / NUM_THREAD;`
	`i < (thread + 1) * MATRIX_SIZE / NUM_THREAD; i++) {`
	`for (int j = 0; j < MATRIX_SIZE; j++) {`
	`for (int k = 0; k < MATRIX_SIZE; k++) {`
	`matC[i][j] += matA[i][k] * matB[k][j];`
	`}`
	`}`
	`}`
	`return 0;`
	`}`

	`void createMatrix() {`
	`for (int i = 0; i < MATRIX_SIZE; i++) {`
	`for (int j = 0; j < MATRIX_SIZE; j++) {`
	`matA[i][j] = i + j * 2;`
	`matB[i][j] = i * 2 + j;`
	`matC[i][j] = 0.0f;`
	`}`
	`}`
	`}`

	`void printMatrix() {`
	`if (MATRIX_SIZE <= 16) {`
	`printf("Matriz A");`
	`for (int i = 0; i < MATRIX_SIZE; i++) {`
	`printf("\n");`
	`for (int j = 0; j < MATRIX_SIZE; j++) {`
	`printf("%f ", matA[i][j]);`
	`}`
	`}`
	`printf("\n\n");`
	`printf("Matriz B");`
	`for (int i = 0; i < MATRIX_SIZE; i++) {`
	`printf("\n");`
	`for (int j = 0; j < MATRIX_SIZE; j++) {`
	`printf("%f ", matB[i][j]);`
	`}`
	`}`
	`printf("\n\n");`
	`printf("Multiplying matrix A with B");`
	`for (int i = 0; i < MATRIX_SIZE; i++) {`
	`printf("\n");`
	`for (int j = 0; j < MATRIX_SIZE; j++) {`
	`printf("%f ", matC[i][j]);`
	`}`
	`}`
	`}`
	`}`

	`int main() {`
	`pthread_t threads[NUM_THREAD];`
	`clock_t start, end;`

	`createMatrix();`

	`start = clock();`
	`for (int i = 0; i < NUM_THREAD; i++) {`
	`// pthread_create(&threads[i], NULL, multiplicationNormal, NULL);`
	`pthread_create(&threads[i], NULL, multiplicationAVX, NULL);`
	`}`
	`for (int i = 0; i < NUM_THREAD; i++) {`
	`pthread_join(threads[i], NULL);`
	`}`
	`end = clock();`

	`printMatrix();`

	`printf("\n\n Number of threads used -> %d\n", NUM_THREAD);`
	`printf("\n Matrix size -> %d\n", MATRIX_SIZE);`
	`printf("\n Program running time （ millisecond ） -> %f\n\n", (float)(end - start) * 1000 / CLOCKS_PER_SEC);`
	`}`

Copyright notice ： This paper is written by iiYK original , Allowed to reprint , Please indicate the source of the article when reprinting AVX Instruction set accelerated matrix multiplication – iiYK

原网站

版权声明
本文为[Virgo programmer's friend]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207230940412234.html