当前位置:网站首页>直播回顾 | 云原生混部系统 Koordinator 架构详解(附完整PPT)
直播回顾 | 云原生混部系统 Koordinator 架构详解(附完整PPT)
2022-06-23 18:33:00 【InfoQ】
混部技术的介绍和发展
- 面向混部场景的资源优先级和服务质量模型
- 稳定可靠的资源超卖机制
- 细粒度的容器资源编排和隔离机制
- 针对多种类型工作负载的调度能力增强
- 复杂类型工作负载的快速接入能力
Koordinator 简介

- Koord-Manager
- SLO-Controller:提供资源超卖、混部 SLO 管理、精细化调度增强等核心管控能力。
- Recommender:围绕资源画像为应用提供相关的弹性能力。
- Colocation Profile Webhook:简化 Koordinator 混部模型的使用,为应用提供一键接入的能力,自动注入相关优先级、QoS 配置。
- Koord extensions for Scheduler:面向混部场景的调度能力增强。
- Koord descheduler:提供灵活可扩展的重调度机制。
- Koord Runtime Proxy:作为 Kubelet 和 Runtime 之间的代理,满足不同场景的资源管理需求,提供插件化的注册框架,提供相关资源参数的注入机制。
- Koordlet:在单机侧负责 Pod 的 QoS 保障,提供细粒度的容器指标采集,以及干扰检测和调节策略能力,并支持一系列的 Runtime Proxy 插件,用于精细化的隔离参数注入。




- 典型场景:
- Prod + LS:典型的在线应用,通常对应用时延要求较高,对资源质量要求较高,也需要保证一定的资源弹性能力。
- Batch + BE:用于混部场景中的低优离线,对资源质量有相当的忍耐度,例如批处理类型的 Spark/MR 任务,以及 AI 类型的训练任务
- 典型场景的增强:
- Prod + LSR/LSE:比较敏感的在线应用,可以接受牺牲资源弹性而换取更好的确定性(如CPU绑核),对应用时延要求极高。
- Mid/Free + BE:与“Batch + BE”相比主要区别是对资源质量要求的高低不同。
- 非典型的应用场景:
- Mid/Batch/Free + LS:用于低优先级的在线服务、近线计算以及AI推理类等任务,这些任务相较于大数据类型任务,它们无法接受过低的资源质量,对其他应用的干扰也相对较低;而相较于典型的在线服务,它们又可以忍受相对较低的资源质量,例如接受一定程度的驱逐。
Quick Start

# Spark Driver Pod example
apiVersion: v1
kind: Pod
metadata:
labels:
koordinator.sh/qosClass: BE
...
spec:
containers:
- args:
- driver
...
resources:
limits:
koordinator.sh/batch-cpu: "1000"
koordinator.sh/batch-memory: 3456Mi
requests:
koordinator.sh/batch-cpu: "1000"
koordinator.sh/batch-memory: 3456Mi
...
关键技术介绍
资源超发 - Resource Overcommitment

# node info
allocatable:
koordinator.sh/bach-cpu: 50k # milli-core
koordinator.sh/bach-memory: 50Gi
# pod info
annotations:
koordinator.sh/resource-limit: {cpu: “5k”}
resources:
requests
koordinator.sh/bach-cpu: 5k # milli-core
koordinator.sh/bach-memory: 5Gi
负载均衡调度 - Load-Aware Scheduling


应用接入管理 - ClusterColocationProfile
apiVersion: config.koordinator.sh/v1alpha1
kind: ClusterColocationProfile
metadata:
name: colocation-profile-example
spec:
namespaceSelector:
matchLabels:
koordinator.sh/enable-colocation: "true"
selector:
matchLabels:
sparkoperator.k8s.io/launched-by-spark-operator: "true"
qosClass: BE
priorityClassName: koord-batch
koordinatorPriority: 1000
schedulerName: koord-scheduler
labels:
koordinator.sh/mutated: "true"
annotations:
koordinator.sh/intercepted: "true"
patch:
spec:
terminationGracePeriodSeconds: 30
$ kubectl apply -f profile.yaml
$ kubectl label ns spark-job -l koordinator.sh/enable-colocation=true
$ # submit Spark Job, the Pods created by SparkOperator are co-located other LS Pods.
QoS 增强 – CPU Suppress

QoS 增强 – 基于资源满足度的驱逐

QoS 增强 - CPU Burst



QoS 增强 – Group Identity

QoS 增强 – Memory QoS
- 自身内存限制:当容器自身的内存(含Page Cache)接近容器上限时,会触发内核的内存回收子系统,这个过程会影响容器内应用的内存申请和释放的性能。
- 节点内存限制:当容器内存超卖(Memory Limit>Request)导致整机内存不足,会触发内核的全局内存回收,这个过程对性能影响较大,极端情况甚至导致整机异常。

后续演进计划
精细化 CPU 编排 - Find-grained CPUOrchestration


- SameCore 策略:更好的隔离性,但弹性空间小。
- Spread 策略:中等的隔离性,但可以通过其他隔离策略优化;使用得当可以获得比 SameCore 策略更好的性能;有一定的弹性空间。

资源预留 - Resource Reservation
kind: Reservation
metadata:
name: my-reservation
namespace: default
spec:
template: ... # a copy of the Pod's spec
resourceOwners:
controller:
apiVersion: apps/v1
kind: Deployment
name: deployment-5b8df84dd
timeToLiveInSeconds: 300 # 300 seconds
nodeName: node-1
status:
phase: Available
...
精细化 GPU 调度 - GPU Scheduling

规格推荐 - Resource Recommendation

社区建设


- If you find a typo, try to fix it!
- If you find a bug, try to fix it!
- If you find some redundant codes, try to remove them!
- If you find some test cases missing, try to add them!
- If you could enhance a feature, please DO NOT hesitate!
- If you find code implicit, try to add comments to make it clear!
- If you find code ugly, try to refactor that!
- If you can help to improve documents, it could not be better!
- If you find document incorrect, just do it and fix that!
- ...


边栏推荐
- Function definition and function parameters
- 学习编程只需要这三条建议!
- User analysis aarrr model (pirate model)
- IDEA控制台显示中文乱码
- DataEase模板市场正式发布
- Tutorial on installing SSL certificates in Microsoft Exchange Server 2007
- Noah fortune passed the hearing: with an annual revenue of 4.3 billion yuan, Wang Jingbo has 49% voting rights, and Sequoia is a shareholder
- [one by one series] identityserver4 (IV) authorization code process
- 韬略生物冲刺科创板:年亏损过亿 实控人张大为夫妇为美国籍
- Advanced network accounting notes (V)
猜你喜欢

A review of comparative learning

STM32 (VIII) -- PWM output

又一家破产清算:那些在时代和资本裹挟下风雨飘摇的游戏公司

杰理之串口设置好以后打印乱码,内部晶振没有校准【篇】

重磅:国产IDE发布,由阿里研发,完全开源!(高性能+高定制性)

Basic knowledge of penetration test
![Jerry's DAC output mode setting [chapter]](/img/b4/64fe92308c16d0cd8c29fee8ad28d8.png)
Jerry's DAC output mode setting [chapter]

Heavyweight: the domestic ide was released, developed by Alibaba, and is completely open source! (high performance + high customization)

Leetcode daily question - 30 Concatenate substrings of all words

halcon知识:区域(Region)上的轮廓算子(1)
随机推荐
Programmable, protocol independent software switch (read the paper)
Jericho Forced upgrade [chapter]
Summary of accelerating mobile applications at network edge with software programmable FPGA
8. AI doctor case
杰理之DAC 输出方式设置【篇】
How to make a list sort according to the order of another list
今年,安徽母基金大爆发
诺亚财富通过聆讯:年营收43亿 汪静波有49%投票权,红杉是股东
Product feedback mechanism
Principles of microcomputer Chapter 6 notes arrangement
Jerry's SD card will reset after he enters soft off [chapter]
Programmable data plane (paper reading)
指标(复杂指标)定义和模型
如何让一个list根据另一个list的顺序排序
外卖江湖格局将变,美团“大哥”不好当
【One by One系列】IdentityServer4(七)授权码流程原理之MVC
Browser cross domain
[comparative learning] koa JS, gin and asp Net core - Middleware
Principles of microcomputer Chapter 6 notes arrangement
Basic knowledge of assembly language (1)