当前位置:网站首页>CUDA day 2: GPU core and Sm core components [easy to understand]
CUDA day 2: GPU core and Sm core components [easy to understand]
2022-07-24 09:09:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
1. CUDA Memory model
Each thread has its own private local memory (local memory) , Each line contains shared memory , Can be shared by all threads in the thread block , Its declaration cycle is consistent with the thread block .
Besides , All threads have access to global memory (global memory) You can also access some read-only memory blocks : Constant memory (Constant Memory) And texture memory (Texture Memory).
2. GPU Core components – SM(Streaming Multiprocessor)
And CPU Is similar to multithreading , One Kernel It actually starts a lot of threads , If there is no multi-core support for multithreading , It is also impossible to achieve parallelism in the physical layer .
and GPU There is a lot of CUDA The core , make the best of CUDA The core can play GPU The ability of parallel computing .‘
SM Its core components include CUDA The core , Shared memory , Register, etc ,SM Hundreds of can be executed concurrently Threads , Concurrency depends on SM Number of resources owned .
3.SIMI–(Single-Intruction, Multiple-Thread) Single instruction multithreading
The basic execution unit is the thread bundle (wraps), The thread bundle contains 32 Threads , These threads execute the same instructions at the same time , But each thread contains its own instruction address counter and register status , It also has its own independent execution path .
So although the threads in the thread bundle execute from the same program address at the same time , But it may have different behaviors , For example, when you encounter a branch structure , Some threads may enter this branch , But others may not execute , They can only wait , because GPU Specifies that all threads in the thread bundle execute the same instruction in the same cycle , Thread bundle differentiation can lead to performance degradation .
All in all , Namely Grid and thread blocks are only logical partitions , One kernel In fact, all threads in the physical layer are not necessarily concurrent at the same time . therefore kernel Of grid and block Different configurations of , Performance will vary . in addition , because SM The basic execution unit of contains 32 Thread bundle of threads , therefore block The size is usually set to 32 Multiple .
4. first CUDA Example ,Cmake The configuration etc.
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
void printDeviceProp(cudaDeviceProp& devProp, int dev)
{
std::cout << " Use GPU device " << dev << ": " << devProp.name << std::endl;
std::cout << "SM The number of :" << devProp.multiProcessorCount << std::endl;
std::cout << " Shared memory size per thread block :" << devProp.sharedMemPerBlock / 1024.0 << " KB" << std::endl;
std::cout << " Maximum number of threads per thread block :" << devProp.maxThreadsPerBlock << std::endl;
std::cout << " Every EM Is the maximum number of threads :" << devProp.maxThreadsPerMultiProcessor << std::endl;
std::cout << " Every EM The maximum number of thread bundles :" << devProp.maxThreadsPerMultiProcessor / 32 << std::endl;
}
bool initCUDA(cudaDeviceProp& devProp)
{
int count;
cudaGetDeviceCount(&count);
if(count == 0) {
return false;};
int i;
for(i=0; i<count; i++)
{
if(cudaGetDeviceProperties(&devProp, i) == cudaSuccess)
{
if(devProp.major >= 1)
{
printDeviceProp(devProp, i);
break;
}
}
}
if(i == count) {
std::cout<<"CUDA can't support the device !"<<std::endl; return false;};
cudaSetDevice(i);
return false;
}
int main()
{
cudaDeviceProp devProp;
if(initCUDA(devProp))
{
std::cout<<"CUDA initialized Succed. \n"<<std::endl;
}
//CHECK(cudaGetDeviceProperties(&devProp, dev));
return 0;
}- CMakeLists.txt Configuration of
cmake_minimum_required(VERSION 3.1)
project(CUDA_Toturials)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")
#set the default path for built executables to the "bin" directory
set(CMAKE_BUILD_TYPE Debug)
set(EXECUTABLE_OUTPUT_PATH ${
PROJECT_SOURCE_DIR}/bin)
SET( LIBRARY_OUTPUT_PATH ${
PROJECT_SOURCE_DIR}/lib)
LINK_DIRECTORIES( ${
PROJECT_SOURCE_DIR}/lib)
INCLUDE_DIRECTORIES( ${
PROJECT_SOURCE_DIR}/include )
# openMp for parallel
# find_package(OpenMP)
# if(OPENMP_FOUND)
# set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
# endif()
find_package(CUDA 8.0 REQUIRED)
include_directories(${
CUDA_INCLUDE_DIRS})
# Set up CUAD Compile configuration
set(CUDA_NVCC_FLAGS "-g -G")
# build option
set(GENCODE -gencode=arch=compute_35,code=sm_35)
set(GENCODE ${
GENCODE} -gencode=arch=compute_30,code=sm_30)
set(GENCODE ${
GENCODE} -gencode=arch=compute_20,code=sm_20)
set(GENCODE ${
GENCODE} -gencode=arch=compute_10,code=sm_10)
set(GENCODE ${
GENCODE} -gencode arch=compute_61,code=sm_61)
# Generate executable files
cuda_add_executable(main src/main.cpp)
# add_subdirectory(src)Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/126025.html Link to the original text :https://javaforall.cn
边栏推荐
- Description of MATLAB functions
- Seven data show the impact of tiktok's combination of payment and organic content
- [the first anniversary of my creation] love needs to be commemorated, so does creation
- Replace the function of pow with two-dimensional array (solve the time overrun caused by POW)
- Onpropertychange property
- Tiflash source code reading (V) deltatree storage engine design and implementation analysis - Part 2
- 利用opencv 做一个简单的人脸识别
- Un7.22: how to upload videos and pictures simultaneously with the ruoyi framework in idea and vs Code?
- Let's test 5million pieces of data. How to use index acceleration reasonably?
- [example of URDF exercise based on ROS] use of four wheeled robot and camera
猜你喜欢

林业调查巡检数据采集解决方案

Publish your own library on NPM

【汇编语言实战】一元二次方程ax2+bx+c=0求解(含源码与过程截屏,可修改参数)

Android系统安全 — 5.2-APK V1签名介绍

【汇编语言实战】(二)、编写一程序计算表达式w=v-(x+y+z-51)的值(含代码、过程截图)

Protocol buffers 的问题和滥用

Paclitaxel loaded tpgs reduced albumin nanoparticles /ga-hsa gambogic acid human serum protein nanoparticles

How to import CAD files into the map new earth and accurately stack them with the image terrain tilt model

Getting started with sorting - insert sorting and Hill sorting

利用opencv 做一个简单的人脸识别
随机推荐
The next stop of data visualization platform | gifts from domestic open source data visualization datart "super iron powder"
How to configure env.js in multiple environments in uni app
Guys, what parameters can be set when printing flinksql so that the values can be printed? This later section is omitted. It's inconvenient. I read the configuration on the official website
Unity C#工具类 ArrayHelper
Three tips for finding the latest trends on tiktok
One click openstack single point mode environment deployment - preliminary construction
How do tiktok merchants bind the accounts of talents?
Opencv Chinese document 4.0.0 learning notes (updating...)
Matlab各函数说明
Un7.22: how to upload videos and pictures simultaneously with the ruoyi framework in idea and vs Code?
How can tiktok transport videos not be streaming limited?
How should tiktok account operate?
利用opencv 做一个简单的人脸识别
SQL problem summary
链表——19. 删除链表的倒数第 N 个结点
gnuplot软件学习笔记
使用Go语言开发eBPF程序
Protocol buffers 的问题和滥用
What is tiktok creator fund and how to withdraw it?
[FFH] websocket practice of real-time chat room