当前位置:网站首页>直播回顾 | 云原生混部系统 Koordinator 架构详解(附完整PPT)
直播回顾 | 云原生混部系统 Koordinator 架构详解(附完整PPT)
2022-06-23 18:33:00 【InfoQ】
混部技术的介绍和发展
- 面向混部场景的资源优先级和服务质量模型
- 稳定可靠的资源超卖机制
- 细粒度的容器资源编排和隔离机制
- 针对多种类型工作负载的调度能力增强
- 复杂类型工作负载的快速接入能力
Koordinator 简介

- Koord-Manager
- SLO-Controller:提供资源超卖、混部 SLO 管理、精细化调度增强等核心管控能力。
- Recommender:围绕资源画像为应用提供相关的弹性能力。
- Colocation Profile Webhook:简化 Koordinator 混部模型的使用,为应用提供一键接入的能力,自动注入相关优先级、QoS 配置。
- Koord extensions for Scheduler:面向混部场景的调度能力增强。
- Koord descheduler:提供灵活可扩展的重调度机制。
- Koord Runtime Proxy:作为 Kubelet 和 Runtime 之间的代理,满足不同场景的资源管理需求,提供插件化的注册框架,提供相关资源参数的注入机制。
- Koordlet:在单机侧负责 Pod 的 QoS 保障,提供细粒度的容器指标采集,以及干扰检测和调节策略能力,并支持一系列的 Runtime Proxy 插件,用于精细化的隔离参数注入。




- 典型场景:
- Prod + LS:典型的在线应用,通常对应用时延要求较高,对资源质量要求较高,也需要保证一定的资源弹性能力。
- Batch + BE:用于混部场景中的低优离线,对资源质量有相当的忍耐度,例如批处理类型的 Spark/MR 任务,以及 AI 类型的训练任务
- 典型场景的增强:
- Prod + LSR/LSE:比较敏感的在线应用,可以接受牺牲资源弹性而换取更好的确定性(如CPU绑核),对应用时延要求极高。
- Mid/Free + BE:与“Batch + BE”相比主要区别是对资源质量要求的高低不同。
- 非典型的应用场景:
- Mid/Batch/Free + LS:用于低优先级的在线服务、近线计算以及AI推理类等任务,这些任务相较于大数据类型任务,它们无法接受过低的资源质量,对其他应用的干扰也相对较低;而相较于典型的在线服务,它们又可以忍受相对较低的资源质量,例如接受一定程度的驱逐。
Quick Start

# Spark Driver Pod example
apiVersion: v1
kind: Pod
metadata:
labels:
koordinator.sh/qosClass: BE
...
spec:
containers:
- args:
- driver
...
resources:
limits:
koordinator.sh/batch-cpu: "1000"
koordinator.sh/batch-memory: 3456Mi
requests:
koordinator.sh/batch-cpu: "1000"
koordinator.sh/batch-memory: 3456Mi
...
关键技术介绍
资源超发 - Resource Overcommitment

# node info
allocatable:
koordinator.sh/bach-cpu: 50k # milli-core
koordinator.sh/bach-memory: 50Gi
# pod info
annotations:
koordinator.sh/resource-limit: {cpu: “5k”}
resources:
requests
koordinator.sh/bach-cpu: 5k # milli-core
koordinator.sh/bach-memory: 5Gi
负载均衡调度 - Load-Aware Scheduling


应用接入管理 - ClusterColocationProfile
apiVersion: config.koordinator.sh/v1alpha1
kind: ClusterColocationProfile
metadata:
name: colocation-profile-example
spec:
namespaceSelector:
matchLabels:
koordinator.sh/enable-colocation: "true"
selector:
matchLabels:
sparkoperator.k8s.io/launched-by-spark-operator: "true"
qosClass: BE
priorityClassName: koord-batch
koordinatorPriority: 1000
schedulerName: koord-scheduler
labels:
koordinator.sh/mutated: "true"
annotations:
koordinator.sh/intercepted: "true"
patch:
spec:
terminationGracePeriodSeconds: 30
$ kubectl apply -f profile.yaml
$ kubectl label ns spark-job -l koordinator.sh/enable-colocation=true
$ # submit Spark Job, the Pods created by SparkOperator are co-located other LS Pods.
QoS 增强 – CPU Suppress

QoS 增强 – 基于资源满足度的驱逐

QoS 增强 - CPU Burst



QoS 增强 – Group Identity

QoS 增强 – Memory QoS
- 自身内存限制:当容器自身的内存(含Page Cache)接近容器上限时,会触发内核的内存回收子系统,这个过程会影响容器内应用的内存申请和释放的性能。
- 节点内存限制:当容器内存超卖(Memory Limit>Request)导致整机内存不足,会触发内核的全局内存回收,这个过程对性能影响较大,极端情况甚至导致整机异常。

后续演进计划
精细化 CPU 编排 - Find-grained CPUOrchestration


- SameCore 策略:更好的隔离性,但弹性空间小。
- Spread 策略:中等的隔离性,但可以通过其他隔离策略优化;使用得当可以获得比 SameCore 策略更好的性能;有一定的弹性空间。

资源预留 - Resource Reservation
kind: Reservation
metadata:
name: my-reservation
namespace: default
spec:
template: ... # a copy of the Pod's spec
resourceOwners:
controller:
apiVersion: apps/v1
kind: Deployment
name: deployment-5b8df84dd
timeToLiveInSeconds: 300 # 300 seconds
nodeName: node-1
status:
phase: Available
...
精细化 GPU 调度 - GPU Scheduling

规格推荐 - Resource Recommendation

社区建设


- If you find a typo, try to fix it!
- If you find a bug, try to fix it!
- If you find some redundant codes, try to remove them!
- If you find some test cases missing, try to add them!
- If you could enhance a feature, please DO NOT hesitate!
- If you find code implicit, try to add comments to make it clear!
- If you find code ugly, try to refactor that!
- If you can help to improve documents, it could not be better!
- If you find document incorrect, just do it and fix that!
- ...


边栏推荐
- Advanced network accounting notes (IV)
- 【One by One系列】IdentityServer4(四)授权码流程
- [one by one series] spa of identityserver4 (VI) authorization code process principle
- test
- CV fully connected neural network
- 从零开发小程序和公众号【第一期】
- 1、 Summary and introduction
- Machine learning jobs
- Jerry's dynamic switching vcomo modulation method [chapter]
- Timertasks notes
猜你喜欢

IDEA控制台显示中文乱码

Basic knowledge of penetration test

Basic knowledge of assembly language (1)

Principles of microcomputer Chapter VIII notes arrangement

What does logistics service and management mainly learn

Robust extraction of specific signals with time structure (Part 2)

Taolue biology rushes to the scientific innovation board: the actual controllers with annual losses of more than 100 million are Zhang Dawei and his wife, who are American nationals

诺亚财富通过聆讯:年营收43亿 汪静波有49%投票权,红杉是股东

Halcon knowledge: contour operator on region (1)

halcon知识:区域(Region)上的轮廓算子(1)
随机推荐
杰理之串口通信 串口接收 IO 需要设置数字功能【篇】
亚香香料深交所上市:市值40亿 鼎龙博晖与涌耀投资是股东
[one by one series] identityserver4 (VIII) uses entityframework core to persist data
Summary of accelerating mobile applications at network edge with software programmable FPGA
Develop small programs and official account from zero [phase II]
How can enterprises do business monitoring well?
如何让一个list根据另一个list的顺序排序
pmp考试需要备考多长时间?
Browser cross domain
Leetcode daily question - 30 Concatenate substrings of all words
函数的定义和函数的参数
Uniswap创始人:不会为Genie发行独立代币,Genie产品将集成至Uniswap界面
获取设备信息相关
A review of comparative learning
Advanced network accounting notes (III)
Jerry added an input capture channel [chapter]
(10) Binary tree
halcon知识:区域(Region)上的轮廓算子(1)
外卖江湖格局将变,美团“大哥”不好当
杰理之添加定时器中断【篇】