当前位置:网站首页>Neon optimization 1: how to optimize software performance and reduce power consumption?
Neon optimization 1: how to optimize software performance and reduce power consumption?
2022-06-27 05:26:00 【To know】
NEON Optimize 1: Software performance optimization 、 How to reduce the power consumption of hardware ?
background
For mobile terminals or embedded devices and other scenarios, cutting-edge technologies can also be used , Products often have some complex algorithm models , But because of Algorithm overhead is too high , Resulting in poor real-time performance 、 High power consumption
Other questions , Performance optimization at the end side is required .
How to do this without changing the effect of the algorithm , Reduce the time complexity of algorithm code , It has become a problem that many engineers have to face .
The basis of performance optimization MCPS and MIPS
First , Before performance optimization , A specific performance measure should be found , namely MIPS/MCPS.
- MIPS:million instructions per second, The number of instructions consumed per second when the program is running
- MCPS:million instructions per second, The number of cycles per second that the program is running
MIPS and MCPS The difference between
- MIPS Is the number of instructions , There is little difference between soft imitation and hard imitation on different platforms
- MCPS It's the number of cycles , Due to hardware optimization , There may be different platforms MCPS Different , Even better than MIPS Still small .
General soft imitation results ,MIPS All ratio MCPS Small , Because soft imitation tools RVDS Of CPI The minimum capacity is 1, Hard simulation results can be obtained directly MCPS Count . Hard imitation time , well CPU Can do CPI Less than 1, namely 1 Multiple cycle instructions , Specific view :link.
With a single-execution-unit processor, the best CPI attainable is 1. However, with a multiple-execution-unit processor, one may achieve even better CPI values (CPI < 1).
MIPS Calculation DEMO
Directory structure :
- src
- main.c
- void test(int* arr, int len);
- vpu.h
- vpu.s
- main.c
Computational code :
#include <stdio.h>
#define MIPS_COUNT_ARM_CORTEX
#ifdef MIPS_COUNT_ARM_CORTEX
#include "v7_pmu.h"
#endif
#ifdef MIPS_COUNT_ARM_CORTEX
#define MILLION_UNIT (1000000.f)
#define KILO_UNIT (1000.f)
#define FRAME_LEN_MS (10.f) // 10ms
#define COUNT_NUM 1000
unsigned int counter0;
unsigned int cycle_count1;
unsigned int cycle_count2;
unsigned int cur_time = 0;
long double cur_time_tmp = 0.0;
double avg_time = 0;
unsigned long avg_time_tmp = 0;
unsigned int peak_time = 0;
float cycle2mips_coef = (1 / MILLION_UNIT) / (FRAME_LEN_MS / KILO_UNIT); // unit: mips
#endif
void main(void) {
// set mannual
cnt = COUNT_NUM;
while(cnt--) {
#ifdef MIPS_COUNT_ARM_CORTEX
enable_pmu(); // Enable the PMU
reset_ccnt(); // Reset the CCNT (cycle counter)
reset_pmn(); // Reset the configurable counters
pmn_config(0, 0x03); // Configure counter 0 to count event code 0x03
enable_ccnt(); // Enable CCNT
enable_pmn(0); // Enable counter
counter0 = read_pmn(0); // Read counter 0
cycle_count1 = read_ccnt(); // Read Core cycle
#endif
// test();
#ifdef MIPS_COUNT_ARM_CORTEX
cycle_count2 = read_ccnt();
cur_time = cycle_count2 - cycle_count1;
// 10^6 => million cycle, *1000/frmeLms => second
cur_time_tmp = (float)cur_time * cycle2mips_coef; // mips
avg_time_tmp += (unsigned int)cur_time_tmp;
if (cur_time > peak_time) {
peak_time = cur_time;
}
printf("%.2f mips \n", cur_time_tmp);
#endif
}
#ifdef MIPS_COUNT_ARM_CORTEX
avg_time = (double)avg_time_tmp / COUNT_NUM;
printf("max %.2f mips \n", (float)peak_time * cycle2mips_coef);
printf("avg %.2f mips \n", avg_time);
#endif
}
The module functions that calculate the overhead are usually placed in the related functions to test the overhead, such as test()
Before and after , You can get the separate MIPS expenses , Of course , It can also be obtained by multiplying the overhead of the overall program by the proportion of the overhead of the related functions , But the calculation is inconvenient , It's not recommended here .
Test tools and processes
The tools needed
Soft copy testing tools usually use ARM The company's RVDS
(RealView Development Suite) Development Kit , Simulate various kernel processes , Get the overhead data .
Hard copy test tools usually use Andriod Built in platform simpleperf
Tools , Push the executable file directly to the mobile phone to run , Grab in real time CPU Data to get the actual cost data , And draw a diagram , Commonly known as flame diagram .
Soft and hard imitation optimization process
- Soft copy process
- install RVDS Software
- Configure the code engineering environment
- Run through code
- Write overhead calculation code
- Simulation Profile
- Get the hotspot function and overhead baseline
- Code optimization
- Test hotspot function overhead
- Hard copy process
- Similar to the soft copy process
- It is recommended to soft copy , Re hard imitation
- involves IO Read / write and other overhead issues , Soft simulation cannot simulate the actual operation , The hard imitation result shall prevail
With the hotspot overhead function , You can optimize the related instruction set and code .
Summary
This article shares the background and basic concepts of performance optimization , Time complexity calculation MCPS and MIPS, As well as test tools and software and hardware imitation process . Next share NEON Optimization cases and experiences .
边栏推荐
- 双位置继电器JDP-1440/DC110V
- 竣达技术丨多品牌精密空调集中监控方案
- Chapter 1 Introduction
- Gao Xiang slam14 lecture - note 1
- Avoid asteroids
- 论文解读(LG2AR)《Learning Graph Augmentations to Learn Graph Representations》
- Codeforces Round #802 (Div. 2)
- [C language] keyword supplement
- Leetcode99 week race record
- STM32 MCU pin_ How to configure the pin of single chip microcomputer as pull-up input
猜你喜欢
随机推荐
Chapter 1 Introduction
STM32 reads IO high and low level status
Execution rules of pytest framework
Microservice system design -- message caching service design
RTP 发送PS流工具(已经开源)
014 C language foundation: C string
差点因为 JSON.stringify 丢了奖金...
双位置继电器JDP-1440/DC110V
[nips 2017] pointnet++: deep feature learning of point set in metric space
jq怎么获取倒数的元素
007 basics of C language: C operator
Unity中跨平臺獲取系統音量
017 basics of C language: bit field and typedef
py2neo基本语法
Gao Xiang slam14 lecture - note 1
双位置继电器XJLS-8G/220
微信小程序刷新当前页面
Pytest框架的执行规则
Basic concepts of neo4j graph database
【FPGA】 基于FPGA分频,倍频设计实现