当前位置:网站首页>Neon optimization 1: how to optimize software performance and reduce power consumption?
Neon optimization 1: how to optimize software performance and reduce power consumption?
2022-06-27 05:26:00 【To know】
NEON Optimize 1: Software performance optimization 、 How to reduce the power consumption of hardware ?
background
For mobile terminals or embedded devices and other scenarios, cutting-edge technologies can also be used , Products often have some complex algorithm models , But because of Algorithm overhead is too high , Resulting in poor real-time performance 、 High power consumption Other questions , Performance optimization at the end side is required .
How to do this without changing the effect of the algorithm , Reduce the time complexity of algorithm code , It has become a problem that many engineers have to face .
The basis of performance optimization MCPS and MIPS
First , Before performance optimization , A specific performance measure should be found , namely MIPS/MCPS.
- MIPS:million instructions per second, The number of instructions consumed per second when the program is running
- MCPS:million instructions per second, The number of cycles per second that the program is running
MIPS and MCPS The difference between
- MIPS Is the number of instructions , There is little difference between soft imitation and hard imitation on different platforms
- MCPS It's the number of cycles , Due to hardware optimization , There may be different platforms MCPS Different , Even better than MIPS Still small .
General soft imitation results ,MIPS All ratio MCPS Small , Because soft imitation tools RVDS Of CPI The minimum capacity is 1, Hard simulation results can be obtained directly MCPS Count . Hard imitation time , well CPU Can do CPI Less than 1, namely 1 Multiple cycle instructions , Specific view :link.
With a single-execution-unit processor, the best CPI attainable is 1. However, with a multiple-execution-unit processor, one may achieve even better CPI values (CPI < 1).
MIPS Calculation DEMO
Directory structure :
- src
- main.c
- void test(int* arr, int len);
- vpu.h
- vpu.s
- main.c
Computational code :
#include <stdio.h>
#define MIPS_COUNT_ARM_CORTEX
#ifdef MIPS_COUNT_ARM_CORTEX
#include "v7_pmu.h"
#endif
#ifdef MIPS_COUNT_ARM_CORTEX
#define MILLION_UNIT (1000000.f)
#define KILO_UNIT (1000.f)
#define FRAME_LEN_MS (10.f) // 10ms
#define COUNT_NUM 1000
unsigned int counter0;
unsigned int cycle_count1;
unsigned int cycle_count2;
unsigned int cur_time = 0;
long double cur_time_tmp = 0.0;
double avg_time = 0;
unsigned long avg_time_tmp = 0;
unsigned int peak_time = 0;
float cycle2mips_coef = (1 / MILLION_UNIT) / (FRAME_LEN_MS / KILO_UNIT); // unit: mips
#endif
void main(void) {
// set mannual
cnt = COUNT_NUM;
while(cnt--) {
#ifdef MIPS_COUNT_ARM_CORTEX
enable_pmu(); // Enable the PMU
reset_ccnt(); // Reset the CCNT (cycle counter)
reset_pmn(); // Reset the configurable counters
pmn_config(0, 0x03); // Configure counter 0 to count event code 0x03
enable_ccnt(); // Enable CCNT
enable_pmn(0); // Enable counter
counter0 = read_pmn(0); // Read counter 0
cycle_count1 = read_ccnt(); // Read Core cycle
#endif
// test();
#ifdef MIPS_COUNT_ARM_CORTEX
cycle_count2 = read_ccnt();
cur_time = cycle_count2 - cycle_count1;
// 10^6 => million cycle, *1000/frmeLms => second
cur_time_tmp = (float)cur_time * cycle2mips_coef; // mips
avg_time_tmp += (unsigned int)cur_time_tmp;
if (cur_time > peak_time) {
peak_time = cur_time;
}
printf("%.2f mips \n", cur_time_tmp);
#endif
}
#ifdef MIPS_COUNT_ARM_CORTEX
avg_time = (double)avg_time_tmp / COUNT_NUM;
printf("max %.2f mips \n", (float)peak_time * cycle2mips_coef);
printf("avg %.2f mips \n", avg_time);
#endif
}
The module functions that calculate the overhead are usually placed in the related functions to test the overhead, such as test() Before and after , You can get the separate MIPS expenses , Of course , It can also be obtained by multiplying the overhead of the overall program by the proportion of the overhead of the related functions , But the calculation is inconvenient , It's not recommended here .
Test tools and processes
The tools needed
Soft copy testing tools usually use ARM The company's RVDS(RealView Development Suite) Development Kit , Simulate various kernel processes , Get the overhead data .
Hard copy test tools usually use Andriod Built in platform simpleperf Tools , Push the executable file directly to the mobile phone to run , Grab in real time CPU Data to get the actual cost data , And draw a diagram , Commonly known as flame diagram .
Soft and hard imitation optimization process
- Soft copy process
- install RVDS Software
- Configure the code engineering environment
- Run through code
- Write overhead calculation code
- Simulation Profile
- Get the hotspot function and overhead baseline
- Code optimization
- Test hotspot function overhead
- Hard copy process
- Similar to the soft copy process
- It is recommended to soft copy , Re hard imitation
- involves IO Read / write and other overhead issues , Soft simulation cannot simulate the actual operation , The hard imitation result shall prevail
With the hotspot overhead function , You can optimize the related instruction set and code .
Summary
This article shares the background and basic concepts of performance optimization , Time complexity calculation MCPS and MIPS, As well as test tools and software and hardware imitation process . Next share NEON Optimization cases and experiences .
边栏推荐
- STM32 reads IO high and low level status
- 020 basics of C language: C language forced type conversion and error handling
- 双位置继电器RXMD2-1MRK001984 DC220V
- Unity point light disappears
- 微服务系统设计——分布式缓存服务设计
- Obtenir le volume du système à travers les plateformes de l'unit é
- Microservice system design - service fusing and degradation design
- Two position relay hjws-9440
- Experience oceanbase database under win10
- leetcode-20. Valid parentheses -js version
猜你喜欢

Tri rapide (non récursif) et tri de fusion

Edge在IE模式下加载网页 - Edge设置IE兼容性

使用域名转发mqtt协议,避坑指南

Deep dive kotlin synergy (XV): Test kotlin synergy
![[nips 2017] pointnet++: deep feature learning of point set in metric space](/img/3e/0a47eecc27f236d629c611e683b37a.png)
[nips 2017] pointnet++: deep feature learning of point set in metric space

Qt使用Valgrind分析内存泄漏
![Mechanical transcoding journal [17] template, STL introduction](/img/78/926db660139fda3d31cceccad7096c.png)
Mechanical transcoding journal [17] template, STL introduction

Leetcode99 week race record

Microservice system design -- Distributed timing service design

Junda technology - centralized monitoring scheme for multi brand precision air conditioners
随机推荐
019 basics of C language: C preprocessing
Microservice system design -- distributed cache service design
Logu p4683 [ioi2008] type printer problem solving
leetcode-20. Valid parentheses -js version
Chapter 1 Introduction
DAST black box vulnerability scanner part 6: operation (final)
Two position relay rxmvb2 r251 204 110dc
微信小程序刷新当前页面
Execution rules of pytest framework
Almost because of json Stringify lost his bonus
009 basics of C language: C loop
neo4j community与neo4j desktop冲突
DAST 黑盒漏洞扫描器 第六篇:运营篇(终)
leetcode299周赛记录
Qt使用Valgrind分析内存泄漏
018 basics of C language: C file reading and writing
Redis高可用集群(哨兵、集群)
017 basics of C language: bit field and typedef
差点因为 JSON.stringify 丢了奖金...
Vue学习笔记(五)Vue2页面跳转问题 | vue-router路由概念、分类与使用 | 编程式路由导航 | 路由组件的缓存 | 5种路由导航守卫 | 嵌套路由 | Vue2项目的打包与部署