当前位置:网站首页>Transformer variants (spark transformer, longformer, switch transformer)
Transformer variants (spark transformer, longformer, switch transformer)
2022-07-25 12:02:00 【Shangshanxianger】

Before you know it Transformer It has gradually penetrated into all fields , On its own, it has also produced a considerable number of variants , Pictured above . The previous blog post of the blogger was updated Transformer variant (Star-Transformer,Transformer-XL), This article wants to sort out these two very important Transformer variant , Namely Sparse Transformer and Switch Transformer.

Explicit Sparse Transformer: : Concentrated Attention Through Explicit Selection
standard Transformer The complexity of is O(n^2), But is it necessary to pay attention to all the elements in the sequence , Is there a way to simplify this mechanism ? So this article's “Sparse” The point is that there are only a few token Participate in attention Calculation of distribution , To improve the concentration of attention mechanism . That is, originally, a word is only related to a few words , But standard self attention will assign weight to all words and then aggregate , A natural idea is through explicit selection , Just let the model focus on a few elements .
The model diagram is shown in the figure above , On the far left is the standard route for calculating attention , The middle is Sparse The implementation of the , You can see the difference is that there is a manually selected one in the middle Sparsification, On the far right is its execution diagram . Simply put, it's calculating softmax Score before top-k Select a few important elements . Specifically, first calculate the inner product : P = Q K T d P=\frac{QK^T}{\sqrt{d}} P=dQKT Then filter manually according to the inner product score top-k Elements , That is, in the following formula M operation , Other P Then set it directly to negative infinity , Such coercion only makes k Elements are concerned . A = s o f t m a x ( M ( P , k ) ) A=softmax(M(P,k)) A=softmax(M(P,k)) Finally, multiply the score back to V: C = A V C=AV C=AV Through this operation , It can make you pay more attention . This operation to reduce the amount of computation also makes GPT-3 When the model can be bigger and more violent, it has also achieved good results, but it doesn't work ...
- paper:https://arxiv.org/abs/1912.11637
- code:https://github.com/lancopku/Explicit-Sparse-Transformer

Longformer: The Long-Document Transformer
Longformer It is also a kind of classic Sparse The method of . A total of 3 Strategies :
Sliding Window: Pictured above (b) Shown , Follow CNN It's like , Given a fixed window size w, There is one on both sides w/2 individual token Instead of doing attention. The computational complexity is reduced to O(n x w), That is, the complexity is linear with the length of the sequence . And if you set different windows for each layer size It can well balance the efficiency and presentation ability of the model .
Dilated sliding window: As shown in the figure above , Similar expansion CNN, The receiving domain can be further expanded without changing the computational complexity . alike , If you set different expansion configurations on each head of the multi head attention mechanism , You can pay attention to different local contexts of the article , Especially through this Dilated Can expand even far away .
Global Attention : Pictured (d) Shown , Calculate the global token It may represent the overall characteristics of the sequence . such as BERT Medium [CLS] This function , The complexity is reduced to O(n)
The complete content can be seen in the original :
- paper:https://arxiv.org/pdf/2004.05150.pdf
- code:https://github.com/allenai/longformer

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Compared with Sparse Attention It is difficult to use sparse operators GPU、TPU Hardware performance problems .Switch Transformer Sparse operators are not required , Can better adapt to GPU、TPU And other dense hardware . The main idea is to simplify sparse routing . In natural language MoE (Mixture of experts) Layer , Only will token The performance will be better if the characterization is sent to a single expert instead of multiple . The model architecture is shown in the figure above , The blue part in the middle is the key part of price comparison , You can see every time router They only send information to scores p The largest single FFN. And this operation can greatly reduce the amount of calculation .
Then on the other hand , The reason for its success is that it has a very good parallel strategy , It also combines data parallelism + Model parallel +expert parallel . The details are as follows: :
In fact, according to the model architecture ,experts Parallelism is the parallelism between operators , Their corresponding FFN There is operator level model parallelism inside , And the whole experts On the computational graph, it is a multi parallel FFN Branch , This is inter operator model parallelism , So we can get lower communication overhead , Improve the efficiency of parallelism .
- paper:https://arxiv.org/abs/2101.03961
- code:https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py
The next article will continue to sort out the relevant content :
边栏推荐
- LeetCode第303场周赛(20220724)
- W5500 adjusts the brightness of LED light band through upper computer control
- 创新突破!亚信科技助力中国移动某省完成核心账务数据库自主可控改造
- [cloud co creation] what is the role of AI in mathematics? What will be the disruptive impact on the mathematical world in the future?
- 【GCN-RS】Towards Representation Alignment and Uniformity in Collaborative Filtering (KDD‘22)
- Experimental reproduction of image classification (reasoning only) based on caffe resnet-50 network
- toString()与new String()用法区别
- Attendance system based on w5500
- Power BI----这几个技能让报表更具“逼格“
- JS interview question: handwriting throttle function
猜你喜欢

"Mqtt protocol explanation and Practice (access to onenet)" of wiznet w5500 series training activities

Varest blueprint settings JSON
![[imx6ull notes] - a preliminary exploration of the underlying driver of the kernel](/img/0f/a0139be99c61fde08e73a5be6d6b4c.png)
[imx6ull notes] - a preliminary exploration of the underlying driver of the kernel
![[multimodal] transferrec: learning transferable recommendation from texture of modality feedback arXiv '22](/img/02/5f24b4af44f2f9933ce0f031d69a19.png)
[multimodal] transferrec: learning transferable recommendation from texture of modality feedback arXiv '22

Brpc source code analysis (VI) -- detailed explanation of basic socket

LeetCode 50. Pow(x,n)

Oil monkey script link

Solutions to the failure of winddowns planning task execution bat to execute PHP files

The first C language program (starting from Hello World)

【多模态】《TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback》 Arxiv‘22
随机推荐
Introduction to pl/sql, very detailed notes
[MySQL 17] installation exception: could not open file '/var/log/mysql/mysqld log‘ for error logging: Permission denied
brpc源码解析(六)—— 基础类socket详解
【CTR】《Towards Universal Sequence Representation Learning for Recommender Systems》 (KDD‘22)
Return and finally? Everyone, please look over here,
[untitled]
[MySQL learning 09]
Varest blueprint settings JSON
JVM performance tuning methods
【GCN-RS】Learning Explicit User Interest Boundary for Recommendation (WWW‘22)
What is the difference between session and cookie?? Xiaobai came to tell you
brpc源码解析(四)—— Bthread机制
【GCN】《Adaptive Propagation Graph Convolutional Network》(TNNLS 2020)
Hardware connection server TCP communication protocol gateway
dirReader. Readentries compatibility issues. Exception error domexception
Flinksql client connection Kafka select * from table has no data error, how to solve it?
Solutions to the failure of winddowns planning task execution bat to execute PHP files
brpc源码解析(五)—— 基础类resource pool详解
【多模态】《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021
Objects in JS