当前位置：网站首页>The performance of the new Tokio scheduler is improved by 10 times

The performance of the new Tokio scheduler is improved by 10 times

2022-06-23 02:51:00 【to listen attentively】

The optimization of the scheduler is carried out around the following aspects ：

new std::future Mission system
Better queue algorithm
Optimize messaging patterns
The improved “ Task stealing ” Algorithm
Reduce cross thread synchronization
Reduce memory allocation
Reduce the atomic reference count

1 How the scheduler works

When thread concurrency is involved ,CPU The cache consistency mechanism of will work , Therefore, cross thread synchronization should be avoided as much as possible .

Single line + Multiprocessor
- advantage ： Implement a simple ; Tasks are fairly scheduled .
- shortcoming ： All the processors are at the head of the queue , The time it takes to actually execute a task is much longer than the time it takes to eject the task from the queue .rust Asynchronous tasks are short and time consuming , The overhead of contention queue is large .
Multiprocessor + Multi task queue
- Using multiple single threaded schedulers , Each processor has its own task queue , Synchronization problems can be completely avoided .
- rust In the task model of , Any thread can submit tasks to the queue , You still need thread safety .
  - Either the task queue of each processor supports thread safe insertion .
  - Either there are two queues per processor ： Synchronous queue and asynchronous queue .
    - advantage ： Synchronization is almost completely avoided , Higher performance .
    - shortcoming ： The processor may experience severe load imbalance .
“ Task stealing ” Scheduler
- Each processor has its own task queue . When a processor is idle , Some tasks can be stolen from other processors at the same level , When the theft fails, it will sleep .
  - advantage ： Avoid synchronization overhead ; Avoid task load imbalance .
  - shortcoming ： complex , Task queues must support “ steal ”, And requires some cross processor synchronization . If the whole process is not executed correctly ,“ steal ” Your expenses may outweigh your benefits .
summary
- Minimize synchronous operations .
- “ Task stealing ” It is the preferred algorithm for general scheduler .
- The processors are basically independent of each other , but “ steal ” Operation requires some synchronous operation .

2 new Tokio Scheduler

The new mission system
- std Contains new std::future Mission system , Lighter and more flexible than previous versions .
- Waker Smaller structure , Reduced replication overhead , It also allows more critical data to be put into cache rows .
Better task queues
- Use a fixed size for each queue . When the queue is full , The task will be pushed to a global 、 Multi user 、 In a multi producer queue . The processor needs to check the global queue , But the frequency is much lower than the local queue .
  - advantage ： Avoid the overhead of expanding local queues . The capacity expansion of the double ended queue is relatively heavy .
  - details ： The local queue uses a fixed size single producer 、 Multi consumer queue , It requires very little synchronization to work properly .
Optimize messaging patterns
- When a task becomes runnable , Stored in “ The next task ” In groove , Instead of adding to the end of the task queue . The processor checks the slot before checking the task queue .
  - advantage ： In the case of messaging , The recipient of the message is immediately scheduled , A high probability of hitting CPU Cache .
Task stealing
- When the processor's run queue is empty , The processor will try to steal tasks from a peer processor at random , If not found , Try the next peer processor .
  - shortcoming ： Many processors complete the processing of the run queue at approximately the same time . Will cause all processors to attempt to steal at the same time , Cause contention . Although random selection of initial nodes can reduce contention , But it's still bad .
  - improve ： Limit the number of processors that concurrently perform stealing operations . The processor state attempted to steal is “ Searching ”. Control the number of concurrency by using atomic counters ： The processor increments the atomic counter before starting the search , Decrements the atomic counter when exiting the search state .
Reduce cross thread synchronization
- Another key part of the task stealing scheduler is peer notification . The processor notifies the peer processor when a new task is observed , If the peer processor receiving the notification is in the sleep state, it will wake up and steal the task .
  - shortcoming ： Too many notifications can cause crowd panic problems .
  - improve ： When no processor is in the search state , To be notified . When a processor in search state finds a new task , It will exit the search state first , Then notify the next processor . The processor in search state will not receive any notification . The notification processor will steal half the tasks in the batch , Then notify the other processor . The third processor is awakened , Find tasks from the first two processors and steal half of them , So as to quickly achieve responsible balance .
Reduce memory allocation
- Allocate memory only once for each task .
Reduce the atomic reference count
- Each wakeup has a reference count to the task handle , Wake up the task , Will call task Of clone Method , Increase the atomic count , Then put the reference in the run queue . When the processor finishes executing the task , It will delete the reference , Reduce the atomic count . These atomic operations are cheap but add up .
  - improve ： Provide weke Method to take ownership directly , Instead of getting references . The scheduler needs to maintain a list of outstanding tasks .
    - difficult ： Make sure that the scheduler does not delete any tasks from its list before the task ends .

3 Use Loom Fearless concurrency

Loom Is a tool for testing concurrent code .Loom Will run multiple use cases , It also enumerates the behaviors that may be encountered in a multi-threaded environment , And verify memory access 、 Whether memory allocation and release are correct .