当前位置：网站首页>CUDA day 2: GPU core and Sm core components [easy to understand]

CUDA day 2: GPU core and Sm core components [easy to understand]

2022-07-24 09:09:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm your friend, Quan Jun .

1. CUDA Memory model

Each thread has its own private local memory (local memory) , Each line contains shared memory , Can be shared by all threads in the thread block , Its declaration cycle is consistent with the thread block .

Besides , All threads have access to global memory （global memory） You can also access some read-only memory blocks ： Constant memory (Constant Memory) And texture memory （Texture Memory).

2. GPU Core components – SM（Streaming Multiprocessor）

And CPU Is similar to multithreading , One Kernel It actually starts a lot of threads , If there is no multi-core support for multithreading , It is also impossible to achieve parallelism in the physical layer .

and GPU There is a lot of CUDA The core , make the best of CUDA The core can play GPU The ability of parallel computing .‘

SM Its core components include CUDA The core , Shared memory , Register, etc ,SM Hundreds of can be executed concurrently Threads , Concurrency depends on SM Number of resources owned .

3.SIMI–（Single-Intruction, Multiple-Thread） Single instruction multithreading

The basic execution unit is the thread bundle （wraps), The thread bundle contains 32 Threads , These threads execute the same instructions at the same time , But each thread contains its own instruction address counter and register status , It also has its own independent execution path .

So although the threads in the thread bundle execute from the same program address at the same time , But it may have different behaviors , For example, when you encounter a branch structure , Some threads may enter this branch , But others may not execute , They can only wait , because GPU Specifies that all threads in the thread bundle execute the same instruction in the same cycle , Thread bundle differentiation can lead to performance degradation .

All in all , Namely Grid and thread blocks are only logical partitions , One kernel In fact, all threads in the physical layer are not necessarily concurrent at the same time . therefore kernel Of grid and block Different configurations of , Performance will vary . in addition , because SM The basic execution unit of contains 32 Thread bundle of threads , therefore block The size is usually set to 32 Multiple .

4. first CUDA Example ,Cmake The configuration etc.

#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

void printDeviceProp(cudaDeviceProp& devProp, int dev)
{ 
   
	std::cout << " Use GPU device " << dev << ": " << devProp.name << std::endl;
    std::cout << "SM The number of ：" << devProp.multiProcessorCount << std::endl;
    std::cout << " Shared memory size per thread block ：" << devProp.sharedMemPerBlock / 1024.0 << " KB" << std::endl;
    std::cout << " Maximum number of threads per thread block ：" << devProp.maxThreadsPerBlock << std::endl;
    std::cout << " Every EM Is the maximum number of threads ：" << devProp.maxThreadsPerMultiProcessor << std::endl;
    std::cout << " Every EM The maximum number of thread bundles ：" << devProp.maxThreadsPerMultiProcessor / 32 << std::endl;
	
}


bool initCUDA(cudaDeviceProp& devProp)
{ 
   
	int count;
	cudaGetDeviceCount(&count);
	if(count == 0) { 
   return false;};
	
	
	int i;
	for(i=0; i<count; i++)
	{ 
   
		if(cudaGetDeviceProperties(&devProp, i) == cudaSuccess)
		{ 
   
			if(devProp.major >= 1)
			{ 
   
				printDeviceProp(devProp, i);
				break;
			}
		}
	}
	
	if(i == count) { 
   std::cout<<"CUDA can't support the device !"<<std::endl;	return false;};
	
	cudaSetDevice(i);
	
	return false;
}
int main()
{ 
   
    cudaDeviceProp devProp;
	
	if(initCUDA(devProp))
	{ 
   
		std::cout<<"CUDA initialized Succed. \n"<<std::endl;
	}
    //CHECK(cudaGetDeviceProperties(&devProp, dev));
    return 0;

}

CMakeLists.txt Configuration of

cmake_minimum_required(VERSION 3.1)
project(CUDA_Toturials)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")

#set the default path for built executables to the "bin" directory
set(CMAKE_BUILD_TYPE Debug)

set(EXECUTABLE_OUTPUT_PATH ${ 
   PROJECT_SOURCE_DIR}/bin)
SET( LIBRARY_OUTPUT_PATH ${ 
   PROJECT_SOURCE_DIR}/lib)
LINK_DIRECTORIES( ${ 
   PROJECT_SOURCE_DIR}/lib)
INCLUDE_DIRECTORIES( ${ 
   PROJECT_SOURCE_DIR}/include )

# openMp for parallel
# find_package(OpenMP)
# if(OPENMP_FOUND)
# set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
# endif()

find_package(CUDA 8.0 REQUIRED)
include_directories(${ 
   CUDA_INCLUDE_DIRS})

#  Set up CUAD Compile configuration 
set(CUDA_NVCC_FLAGS "-g -G")

# build option
set(GENCODE -gencode=arch=compute_35,code=sm_35)
set(GENCODE ${ 
   GENCODE} -gencode=arch=compute_30,code=sm_30)
set(GENCODE ${ 
   GENCODE} -gencode=arch=compute_20,code=sm_20)
set(GENCODE ${ 
   GENCODE} -gencode=arch=compute_10,code=sm_10)
set(GENCODE ${ 
   GENCODE} -gencode arch=compute_61,code=sm_61)

#  Generate executable files 
cuda_add_executable(main src/main.cpp)

# add_subdirectory(src)

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/126025.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221702193134.html