当前位置:网站首页>How to calculate the distance between texts: WMD
How to calculate the distance between texts: WMD
2022-06-25 08:25:00 【Happy little yard farmer】
List of articles
WMD
Paper: From Word Embeddings To Document Distances
2015 year , Put forward Word shift distance WMD(Word Mover’s Distance): Word shift distance is a measure of document similarity based on word vector , It is a method of calculating the distance between sentences , The smaller the distance , The higher the similarity .
1. Why put forward ?
Calculate sentence / Some methods of document similarity are :
Unsupervised sentence similarity calculation ——Jaccard Similarity coefficient :
Jaccard Similarity coefficient is used to measure the similarity between two sets , Is defined as the number of elements in the intersection of two sets divided by the number of elements in the union .
S I M ( s 1 , s 2 ) = intersection ( w o r d s 1 , w o r d s 2 ) union ( w o r d s 1 , w o r d s 2 ) SIM(s_1,s_2) = \frac{\text{intersection}(words1, words2)}{\text{union}(words1, words2)} SIM(s1,s2)=union(words1,words2)intersection(words1,words2)
Jaccard The larger the value of similarity coefficient , The higher the sentence similarity .Jaccard Distance is used to measure the difference between two sets , It is Jaccard The complement of the similarity coefficient of , Is defined as 1 subtract Jaccard Similarity coefficient .
d ( s 1 , s 2 ) = 1 − S I M ( s 1 , s 2 ) = 1 − intersection ( w o r d s 1 , w o r d s 2 ) union ( w o r d s 1 , w o r d s 2 ) d(s_1,s_2)=1-SIM(s_1,s_2) = 1-\frac{\text{intersection}(words1, words2)}{\text{union}(words1, words2)} d(s1,s2)=1−SIM(s1,s2)=1−union(words1,words2)intersection(words1,words2)
Jaccard The greater the distance , The smaller the sentence similarity .Using vector space models , Vector representation of sentences , Then calculate the Euclidean distance between two pairs 、 Cosine similarity and other metrics .
be based on Word2Vec, Use pre trained word-embdding vector , For a sentence , Add or average each bit of the word vector .
Model sentences , Methods include skip-gram,cbow Of doc2vector Modeling methods , be based on autoencoder Modeling method of , be based on skip-thought Sentence modeling method, etc . be based on sentence-vector This method can better preserve the semantic information of sentences , To a certain extent, it can make up for the shortcomings of the word bag .
First, solve the problem of document representation , Can pass :
- The word bag (BOW)/TF-IDF: Regardless of grammar or word order , Only consider word frequency , Vectorize the text . The problem is : If two sentences do not have the same words , According to the vector representation, it will be judged as completely irrelevant , But both may have the same semantics .
- LSI and LDA: Put similar words into the topic , The document is represented as the probability distribution of the subject . comparison BOW More coherent , But it can't be improved BOW The performance effect of . Compare the similarity of two documents , You can compare the differences in the distribution of topics between two documents .
- Word2Vec: be based on word2vec Map all the words in the text to a word vector space . To some extent, it solves the semantic problem , Use CBOW/Skip-Gram Consider context information , Word vectors can reflect the semantic differences between words . Then the document / The sentence is represented as the average of all word vectors , By calculating the Euclidean distance between two pairs 、 Cosine similarity measures document similarity .
Word2Vec Consider the distance between words , Proposed WMD Is to consider the distance between documents . Use a distance to reflect the similarity between documents , An idea : Model the document as 2 A combination of semantic distances of words in documents , For example, calculate the Euclidean distance of the word vector corresponding to any two words in two documents, and then calculate the weighted sum .
2. How to solve the problem ?
WMD The solution is EMD A special case of the problem .
2.1 Define the problem
2.1.1 Normalized word frequency
Normalized word frequency (normalized bag-of-words,nBOW), If a word i i i The number of occurrences in the text is c i c_i ci, Its normalized word frequency is :
d i = c i ∑ j = 1 n c j d_i=\frac{c_i}{\sum_{j=1}^{n}c_j} di=∑j=1ncjci
The importance of a word is related to its frequency of occurrence ( And normalize ).
2.1.2 Word movement cost
Word movement cost (Word travel cost), because WMD The Chinese word is after word2vec It means , and word2vec The word after the expression can directly calculate the Euclidean distance . word i i i And the words j j j Semantic distance c ( i , j ) c(i,j) c(i,j) For European distance :
c ( i , j ) = ∣ ∣ x i − x j ∣ ∣ 2 = ( x i 1 − x j 1 ) 2 + ⋯ + ( x i d − x j d ) 2 ) c(i,j)=||x_i-x_j||_2=\sqrt{(x_i^1-x_j^1)^2+\cdots+(x_i^d-x_j^d)^2)} c(i,j)=∣∣xi−xj∣∣2=(xi1−xj1)2+⋯+(xid−xjd)2)
2.1.3 Document distance
Document distance (Document distance) The visual representation of : ∑ i , j T i j ⋅ c ( i , j ) \sum_{i,j}T_{ij}\cdot c(i,j) ∑i,jTij⋅c(i,j)
transition matrix T i j T_{ij} Tij: file d d d The words in i i i To another document d ′ d' d′ The words in j j j The weight of .
2.1.4 constraint condition
combination EMD This classic transportation problem (Transportation problem), The goal is : Minimize the total distance between two documents ( Mobile costs ); Add constraints to adjust weights : Let document d d d Each word in is matched to another document with different weights d ′ d' d′ All words in .
min T ≥ 0 ∑ i , j = 1 n T i j c ( i , j ) s . t : ∑ j = 1 n T i j = d i ∀ i ∈ 1 , … , n ∑ i = 1 n T i j = d j ′ ∀ j ∈ 1 , … , n \min_{T\ge0}\sum_{i,j=1}^{n}T_{ij}c(i,j)\\ s.t: \sum_{j=1}^{n}T_{ij}=d_i \quad \forall i\in{1,\dots,n}\\ \sum_{i=1}^{n}T_{ij}=d_j' \quad \forall j\in{1,\dots,n}\\ T≥0mini,j=1∑nTijc(i,j)s.t:j=1∑nTij=di∀i∈1,…,ni=1∑nTij=dj′∀j∈1,…,n
2.2 Fast calculation
solve WMD The optimal average time complexity of the optimization problem is O ( p 3 log p ) O(p^3 \log p) O(p3logp), among p p p Represents the number of unique words in the document . If the dataset has a large number of documents or a large number of unique words , solve WMD Optimization problems are difficult .
Calculate the... Between two documents WMD It is required to solve a large linear programming problem , If you want to use it to do K-NN( k The most similar documents ) It's time-consuming , Therefore, two optimization schemes are proposed to reduce the amount of computation .
Two settings are proposed in this paper Lower bounds To filter documents , Reduce time complexity by sacrificing some accuracy . The two methods are Word Centroid Distance( WCD) , Relaxed Word Moving Distance(RWMD).
2.2.1 WCD
WCD(Word centroid distance), According to the trigonometric inequality ∣ a ∣ + ∣ b ∣ ≥ ∣ a ± b ∣ ≥ ∣ ∣ a ∣ − ∣ b ∣ ∣ |a|+|b|\ge |a\pm b|\ge ||a|-|b|| ∣a∣+∣b∣≥∣a±b∣≥∣∣a∣−∣b∣∣, Convert to calculation WMD The lower limit of ——centroid distance: ∥ X d − X d ′ ∥ 2 \parallel Xd-Xd'\parallel_2 ∥Xd−Xd′∥2, as follows :
Between two documents “centroid” Distance will always be greater than WMD The distance should be small .

among X ∈ R d × p X\in R^{d\times p} X∈Rd×p Is a matrix of word vectors , The time complexity of the algorithm is O ( d p ) O(dp) O(dp).
2.2.2 RWMD
Even though WCD The time complexity is very low , But the borders are too loose , Can't approximate well WMD. therefore , Use more here tight Lower bound of RWMD(Relaxed word moving distance ).RWMD It needs to be calculated twice , be based on WMD Objective function , Remove one of the two constraints , Then solve the minimum value , Use the maximum of the two minimum values as WMD Approximate value .
After removing a condition , Still consistent with the original WMD The nature of , The optimal solution of the following equation is , Assign all the assigned weights to the closest Euclidean distance Of the two words .
For example, remove the second constraint , obtain :

The optimal solution of this problem is , For documents d d d One of the words in , Find another document d ′ d' d′ The closest word in , All transferred to the word , namely :

Get WMD minimum value l 1 ( d , d ′ ) l_1(d,d') l1(d,d′).
Similarly, remove the first constraint , Available :
min T ≥ 0 ∑ i , j = 1 n T i j c ( i , j ) s . t : ∑ i = 1 n T i j = d j ′ ∀ j ∈ 1 , … , n \min_{T\ge0}\sum_{i,j=1}^{n}T_{ij}c(i,j)\\ s.t: \sum_{i=1}^{n}T_{ij}=d_j' \quad \forall j\in{1,\dots,n}\\ T≥0mini,j=1∑nTijc(i,j)s.t:i=1∑nTij=dj′∀j∈1,…,n
The optimal solution of this problem is , For documents d ′ d' d′ One of the words in , Find another document d d d The closest word in , Accept the word all , namely :
T i j ∗ = { d j ′ if i = arg min i c ( i , j ) 0 otherwise. T_{ij}^*=\begin{cases} d_j'\quad \text{if} \;i=\arg\min_i c(i,j)\\ 0 \quad \text{otherwise.} \end{cases} Tij∗={ dj′ifi=argminic(i,j)0otherwise.
Get WMD minimum value l 2 ( d , d ′ ) l_2(d,d') l2(d,d′).
Finally, ask for l r ( d , d ′ ) = max ( l 1 ( d , d ′ ) , l 2 ( d , d ′ ) ) l_r(d,d')=\max(l_1(d,d'),\;l_2(d,d')) lr(d,d′)=max(l1(d,d′),l2(d,d′)) The result is RWMD, The time complexity is O ( p 2 ) O(p^2) O(p2).
2.2.3 Prefetch and prune Speed up k-NN
Combine the above two methods to prune . To find the query document q q q Of k The nearest neighbor , We can use two lower bounds to greatly reduce what we need to do WMD Distance calculation amount .
K-NN The process is as follows :
- With the least time complexity WCD Calculate this document q q q Distance from all documents ;
- And sort all documents according to the distance from small to large ;
- Then calculate only the previous k Of documents WMD distance ;
- Traverse the remaining documents , use RWMD Calculate this document q q q Distance from the remaining documents , If the distance is greater than the front k Of recent documents WMD The maximum distance , Prune this document directly ; If the distance of this document is less than the previous k individual WMD Any one of the distances , Let's calculate the WMD, And update the k Next door documents .
In this way , We can prune it off 95% The above data sets , Very effective .
3. advantage ?
- The results are excellent : Make the most of it word2vec The field of transfer Ability
- Unsupervised : Do not rely on dimension data , There is no cold start problem
- Simple model : Only the result of the word vector is required as input , There are no hyperparameters Count
- Interpretability : Turn the problem into linear programming , Yes Global optimal solution
- flexibility : Human intervention is possible The importance of words
- High retrieval accuracy
4. shortcoming ? Improving direction ?
4.1 shortcoming
shortcoming :
- No word order information is reserved ,WMD Think “ Repeatedly war and defeated ” and “ Often hurt often war ” Completely consistent in semantics .
- Can't handle well OOV problem : Because word vectors are trained offline , Encountered when applied to online business OOV problem , user query Separate words , It is possible that the corresponding word vector cannot be found .
- Poor ability to deal with negative words : In the trained word vector , Generally, the word vectors of words with opposite meanings are relatively similar , This will lead to two sentences with opposite semantics WMD Are close . ( The disadvantages of many similarity calculation methods )
4.2 Improve the algorithm S-WMD
WMD distance , Learn this weight matrix for clustering (K-NN, Those documents are similar ) The effect is very good , But the effect of classification is very poor .
Because there are many synonyms in some articles , But it doesn't express a semantic or a theme . such as : I love playing football and I like playing piano. Although the sentence pattern looks similar , May fall into the same category , But if the label is ” motion ” and ” art ” Two types of , Obviously you can't use WMD Directly classified . because ,**WMD Didn't join football and ” motion ” Is strongly relevant information .
The paper “Supervised Word Mover’s Distance” Proposed WMD Algorithm improvements : Trainable with penalties S-WMD.
The solution given in this paper is to WMD Add the function of training category weight to the distance :

there d i d_i di Added category weight w i w_i wi:

The distance between words should also be adjusted ( The distance between words also needs to be adjusted due to different categories ), Add training parameter matrix A.

Improving direction :
- Interpretability
- Add penalty item
5. WMD application
WMD application :
- As feature input : An unsupervised semantic similarity feature .
- Calculate the distance between texts .
6. WMD Code implementation
WMD Several key points of code implementation :
- Find the latest word ,
- Optimize : Processing weight
- prune :topk Sorting of documents
Currently open source WMD Implement a :
- https://pypi.org/project/wmd/
- https://github.com/mkusner/wmd
7. Reference resources
WMD tutorial:https://github.com/RaRe-Technologies/gensim/blob/c971411c09773488dbdd899754537c0d1a9fce50/docs/notebooks/WMD_tutorial.ipynb
Code:https://github.com/mkusner/wmd
https://zhuanlan.zhihu.com/p/84809907
https://blog.csdn.net/weixin_40547993/article/details/89475630
Supervised Word Mover’s Distance ( Supervised word shift distance )-NIPS 2016 Selected papers #2
Algorithm improvements :Supervised Word Mover’s Distance
Welcome to my official account. 【SOTA Technology interconnection 】, I will share more dry goods .

边栏推荐
- Bluecmsv1.6- code audit
- How is the ISM model analyzed?
- Static web server
- How to calculate the independence weight index?
- Luogu p2839 [national training team]middle (two points + chairman tree + interval merging)
- [QT] qtcreator shortcut key and QML introduction
- Functions should not specify operation types through variables
- Logu P2486 [sdoi2011] coloring (tree chain + segment tree + merging of intervals on the tree)
- 4 raisons inconnues d'utiliser le "déplacement sûr à gauche"
- Jdbc-dao layer implementation
猜你喜欢

STM32CubeMX 學習(5)輸入捕獲實驗

双周投融报:资本埋伏Web3基础设施

Bluecmsv1.6-代码审计

Stack awareness - stack overflow instance (ret2libc)

First experience Amazon Neptune, a fully managed map database

企业全面云化的时代——云数据库的未来

每日刷题记录 (三)

Electronics: Lesson 009 - Experiment 7: study relays

How to calculate the information entropy and utility value of entropy method?

With the beauty of technology enabled design, vivo cooperates with well-known art institutes to create the "industry university research" plan
随机推荐
Scanpy(七)基于scanorama整合scRNA-seq实现空间数据分析
Bluecmsv1.6- code audit
iframe简单使用 、获取iframe 、获取iframe 元素值 、iframe获取父页面的信息
With the beauty of technology enabled design, vivo cooperates with well-known art institutes to create the "industry university research" plan
Jdbc-dao layer implementation
是否可以给数据库表授予删除列对象的权限?为什么?
InfluxDB时序数据库
Luogu p1073 [noip2009 improvement group] optimal trade (layered diagram + shortest path)
4 raisons inconnues d'utiliser le "déplacement sûr à gauche"
In 2022, which industry will graduates prefer when looking for jobs?
Luogu p6822 [pa2012]tax (shortest circuit + edge change point)
[unexpected token o in JSON at position 1 causes and solutions]
[red flag Cup] Supplementary questions
Rqt command
PH neutralization process modeling
TCP and UDP
Black dot = = white dot (MST)
Ffmpeg+sdl2 for audio playback
TCP MIN_ A dialectical study of RTO
物联网毕设(智能灌溉系统 -- Android端)