当前位置:网站首页>语音自监督预训练模型 CNN Encoder 调研总结
语音自监督预训练模型 CNN Encoder 调研总结
2022-07-25 09:27:00 【Haulyn5】
前言
最近在看 WavLM 的工作,注意到它使用的 CNN Encoder 是 7 层,相关描述如下。
“The convolutional encoder is composed of seven blocks of temporal convolution followed by layer normalization and a GELU activation layer. The temporal convolutions have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2), resulting in each output representing about 25ms of audio strided by 20ms.” (Chen et al., 2022, p. 3)
于是就很好奇,这些 CNN 的层数步长等超参数设定有什么深层意义,为什么要这样设置?为什么不能用别的?而最近浏览的语音自监督预训练文章大多对 CNN Encoder 都是一笔带过,简单描述一下作用和参数就结束了。如果说的详细点可能会说这里是为了将高维的语音信号做降维,而给定的超参数可以使得大概模拟出一个 25 ms的感受野,步长 20ms。这些是之前传统语音信号处理所推荐的参数。于是看了一下,因为 WavLM 这篇工作拿了一个 Section 去说明 HuBERT,所以怀疑这样的 CNN 设定是 Follow 了 HuBERT。搜了一下果然 HuBERT 也是这样的设定。但是 HuBERT 也没有说明为什么这样。毕竟 25 ms 的感受野有无穷种 CNN 的设计方法……加上 HuBERT 中重点说明和对比了 wav2vec 2.0,所以又做了进一步探究,至此觉得还是有点数据了,还是总结一篇博客记录了。
正文
简单来说对于最开始的 CNN Encoder,WavLM 遵循了 HuBERT 的设定,而 HuBERT 遵循了 wav2vec 2.0 的设定。根据过去的了解 wav2vec 2.0 是直接基于 vq-wav2vec 的工作,但是调研 vq-wav2vec 时发现后者并没有使用一样的 CNN Encoder。
下面简单列个表格记录。
| 模型名 | 参数 | Quote |
| vq-wav2vec | 8层卷积 | “The encoder has 8 layers with 512 channels each, kernel sizes (10,8,4,4,4,1,1,1) and strides (5,4,2,2,2,1,1,1), yielding a total stride of 160. Each layer contains a convolution, followed by dropout, group normalization with a single group (Wu & He, 2018) and a ReLU non-linearity.” (Baevski et al., 2019, p. 4) |
| wav2vec 2.0 | 7 层卷积 | “The feature encoder contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2).” (Baevski et al., 2020, p. 4) “This results in an encoder output frequency of 49 hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25ms of audio.” (Baevski et al., 2020, p. 4) |
| HuBERT | 同 wav2vec 2.0 | “The waveform encoder is identical for all the three configurations, which is composed of seven 512-channel layers with strides [5,2,2,2,2,2,2] and kernel widths [10,3,3,3,3,2,2]. The BERT encoder consists of many identical transformer blocks, whose parameters along with the parameter of the subsequent projection layer are specified in Table” (Hsu et al., 2021, p. 3) |
| WavLM | 同 HuBERT | “The convolutional encoder is composed of seven blocks of temporal convolution followed by layer normalization and a GELU activation layer. The temporal convolutions have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2), resulting in each output representing about 25ms of audio strided by 20ms.” (Chen et al., 2022, p. 3) |
后续有机会会再补充,包括其他的模型以及参数设定的意义(这个得等我能理解)。
220601 补充
可是实际跑了 WavLM 的代码,CNN Encoder 似乎并不能降维。测试了一下,对于 (8,80000)的输入(即 batch_size=8, 输入语音长度 80000 个采样点),得到的模型如下。(使用了 torchinfo 库打印模型信息)
from torchinfo import summary
from wavlm.WavLM import WavLM, WavLMConfig, ConvFeatureExtractionModel
feature_enc_layers = [(512, 10, 5),
(512, 3, 2),
(512, 3, 2),
(512, 3, 2),
(512, 3, 2),
(512, 2, 2),
(512, 2, 2)]
conv_feature_extractor = ConvFeatureExtractionModel(
conv_layers=feature_enc_layers,
dropout=0.0,
mode='default',
conv_bias=False,
)
summary(conv_feature_extractor, input_size=(8, 80000))==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
ConvFeatureExtractionModel -- --
├─ModuleList: 1-1 -- --
│ └─Sequential: 2-1 [8, 512, 15999] --
│ │ └─Conv1d: 3-1 [8, 512, 15999] 5,120
│ │ └─Dropout: 3-2 [8, 512, 15999] --
│ │ └─Fp32GroupNorm: 3-3 [8, 512, 15999] 1,024
│ │ └─GELU: 3-4 [8, 512, 15999] --
│ └─Sequential: 2-2 [8, 512, 7999] --
│ │ └─Conv1d: 3-5 [8, 512, 7999] 786,432
│ │ └─Dropout: 3-6 [8, 512, 7999] --
│ │ └─GELU: 3-7 [8, 512, 7999] --
│ └─Sequential: 2-3 [8, 512, 3999] --
│ │ └─Conv1d: 3-8 [8, 512, 3999] 786,432
│ │ └─Dropout: 3-9 [8, 512, 3999] --
│ │ └─GELU: 3-10 [8, 512, 3999] --
│ └─Sequential: 2-4 [8, 512, 1999] --
│ │ └─Conv1d: 3-11 [8, 512, 1999] 786,432
│ │ └─Dropout: 3-12 [8, 512, 1999] --
│ │ └─GELU: 3-13 [8, 512, 1999] --
│ └─Sequential: 2-5 [8, 512, 999] --
│ │ └─Conv1d: 3-14 [8, 512, 999] 786,432
│ │ └─Dropout: 3-15 [8, 512, 999] --
│ │ └─GELU: 3-16 [8, 512, 999] --
│ └─Sequential: 2-6 [8, 512, 499] --
│ │ └─Conv1d: 3-17 [8, 512, 499] 524,288
│ │ └─Dropout: 3-18 [8, 512, 499] --
│ │ └─GELU: 3-19 [8, 512, 499] --
│ └─Sequential: 2-7 [8, 512, 249] --
│ │ └─Conv1d: 3-20 [8, 512, 249] 524,288
│ │ └─Dropout: 3-21 [8, 512, 249] --
│ │ └─GELU: 3-22 [8, 512, 249] --
==========================================================================================
Total params: 4,200,448
Trainable params: 4,200,448
Non-trainable params: 0
Total mult-adds (G): 98.14
==========================================================================================
Input size (MB): 2.56
Forward/backward pass size (MB): 1564.41
Params size (MB): 16.80
Estimated Total Size (MB): 1583.77
==========================================================================================可以看出输出的每个样本的特征长度 512*249 = 127,488,对于每条语音,反而增大了特征维度。虽然CNN的操作本身确实是降维,但是 Filter 的数目足够多的话,仍然会使得最后输出的维度增多。最后相当于对于 16000 Hz 采样率的语音, 25 ms 感受野,20 ms 的帧移,大约50Hz 左右。所以滤波器的数目超过 16000/50 = 320 左右时,最后的输出其实就会超过输入的维度。
参考文献
[1]S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” arXiv:2110.13900 [cs, eess], Jan. 2022, Accessed: May 11, 2022. [Online]. Available: http://arxiv.org/abs/2110.13900
[2]W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” arXiv:2106.07447 [cs, eess], Jun. 2021, Accessed: Apr. 28, 2022. [Online]. Available: http://arxiv.org/abs/2106.07447
[3]A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations,” presented at the International Conference on Learning Representations, Sep. 2019. Accessed: Apr. 01, 2022. [Online]. Available: vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | OpenReview
[4]A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” arXiv:2006.11477 [cs, eess], Oct. 2020, Accessed: Mar. 11, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477
边栏推荐
- RedisUtil
- Vant problem record
- Pytorch 张量列表转换为张量 List of Tensor to Tensor 使用 torch.stack()
- 字符串最长公共前缀
- GCD详解
- [recommended collection] with these learning methods, I joined the world's top 500 - the "fantastic skills and extravagance" in the Internet age
- 升级 GLIBC 2.29 checking LD_LIBRARY_PATH variable... contains current directory error 解决方案
- yarn速查手册
- vant问题记录
- mysql历史数据补充新数据
猜你喜欢

Subtotal of rospy odometry sinkhole
![严重 [main] org.apache.catalina.util.LifecycleBase.handleSubClassException 初始化组件](/img/39/6f6760e80acec0b02028ea2ed1a5b1.png)
严重 [main] org.apache.catalina.util.LifecycleBase.handleSubClassException 初始化组件

将 conda 虚拟环境 env 加入 jupyter kernel

GCD详解

canal实现mysql数据同步
![[machine translation] scones -- machine translation with multi tag tasks](/img/72/d3e46a820796a48b458cd2d0a18f8f.png)
[machine translation] scones -- machine translation with multi tag tasks

小程序企业发放红包功能

Detailed explanation of MySQL database

VLAN的配置及其应用(以华为eNSP为例)

UE4 外部打开exe文件
随机推荐
VLAN的配置及其应用(以华为eNSP为例)
[tensorflow2 installation] tensorflow2.3-cpu installation pit avoidance guide!!!
Swing组件
See how a junior student of double non-2 (0 Internship) can get an offer from Alibaba and Tencent
集合的创建,及常用方法
Probabilistic robot learning notes Chapter 2
升级 GLIBC 2.29 checking LD_LIBRARY_PATH variable... contains current directory error 解决方案
C3D模型pytorch源码逐句详析(一)
SSM整合(简单的图书管理系统来整合SSM)
yarn速查手册
Pytorch 张量列表转换为张量 List of Tensor to Tensor 使用 torch.stack()
Excel导入导出源码分析
UE4 碰撞(Collsion)
贪吃蛇小游戏
链表相关(设计链表及环链表问题)
MVC三层架构理解
GCD详解
字符串最长公共前缀
Swing组件之单选与多选按钮
js数字千位分割的常用方法