当前位置:网站首页>How to calculate the distance between texts: WMD

How to calculate the distance between texts: WMD

2022-06-25 08:25:00 Happy little yard farmer

WMD

Paper: From Word Embeddings To Document Distances

2015 year , Put forward Word shift distance WMD(Word Mover’s Distance): Word shift distance is a measure of document similarity based on word vector , It is a method of calculating the distance between sentences , The smaller the distance , The higher the similarity .

1. Why put forward ?

Calculate sentence / Some methods of document similarity are :

  • Unsupervised sentence similarity calculation ——Jaccard Similarity coefficient :

    Jaccard Similarity coefficient is used to measure the similarity between two sets , Is defined as the number of elements in the intersection of two sets divided by the number of elements in the union .
    S I M ( s 1 , s 2 ) = intersection ( w o r d s 1 , w o r d s 2 ) union ( w o r d s 1 , w o r d s 2 ) SIM(s_1,s_2) = \frac{\text{intersection}(words1, words2)}{\text{union}(words1, words2)} SIM(s1,s2)=union(words1,words2)intersection(words1,words2)
    Jaccard The larger the value of similarity coefficient , The higher the sentence similarity .

    Jaccard Distance is used to measure the difference between two sets , It is Jaccard The complement of the similarity coefficient of , Is defined as 1 subtract Jaccard Similarity coefficient .
    d ( s 1 , s 2 ) = 1 − S I M ( s 1 , s 2 ) = 1 − intersection ( w o r d s 1 , w o r d s 2 ) union ( w o r d s 1 , w o r d s 2 ) d(s_1,s_2)=1-SIM(s_1,s_2) = 1-\frac{\text{intersection}(words1, words2)}{\text{union}(words1, words2)} d(s1,s2)=1SIM(s1,s2)=1union(words1,words2)intersection(words1,words2)
    Jaccard The greater the distance , The smaller the sentence similarity .

  • Using vector space models , Vector representation of sentences , Then calculate the Euclidean distance between two pairs 、 Cosine similarity and other metrics .

    be based on Word2Vec, Use pre trained word-embdding vector , For a sentence , Add or average each bit of the word vector .

    Model sentences , Methods include skip-gram,cbow Of doc2vector Modeling methods , be based on autoencoder Modeling method of , be based on skip-thought Sentence modeling method, etc . be based on sentence-vector This method can better preserve the semantic information of sentences , To a certain extent, it can make up for the shortcomings of the word bag .

First, solve the problem of document representation , Can pass :

  • The word bag (BOW)/TF-IDF: Regardless of grammar or word order , Only consider word frequency , Vectorize the text . The problem is : If two sentences do not have the same words , According to the vector representation, it will be judged as completely irrelevant , But both may have the same semantics .
  • LSI and LDA: Put similar words into the topic , The document is represented as the probability distribution of the subject . comparison BOW More coherent , But it can't be improved BOW The performance effect of . Compare the similarity of two documents , You can compare the differences in the distribution of topics between two documents .
  • Word2Vec: be based on word2vec Map all the words in the text to a word vector space . To some extent, it solves the semantic problem , Use CBOW/Skip-Gram Consider context information , Word vectors can reflect the semantic differences between words . Then the document / The sentence is represented as the average of all word vectors , By calculating the Euclidean distance between two pairs 、 Cosine similarity measures document similarity .

Word2Vec Consider the distance between words , Proposed WMD Is to consider the distance between documents . Use a distance to reflect the similarity between documents , An idea : Model the document as 2 A combination of semantic distances of words in documents , For example, calculate the Euclidean distance of the word vector corresponding to any two words in two documents, and then calculate the weighted sum .

2. How to solve the problem ?

WMD The solution is EMD A special case of the problem .

2.1 Define the problem

2.1.1 Normalized word frequency

Normalized word frequency (normalized bag-of-words,nBOW), If a word i i i The number of occurrences in the text is c i c_i ci, Its normalized word frequency is :
d i = c i ∑ j = 1 n c j d_i=\frac{c_i}{\sum_{j=1}^{n}c_j} di=j=1ncjci
The importance of a word is related to its frequency of occurrence ( And normalize ).

2.1.2 Word movement cost

Word movement cost (Word travel cost), because WMD The Chinese word is after word2vec It means , and word2vec The word after the expression can directly calculate the Euclidean distance . word i i i And the words j j j Semantic distance c ( i , j ) c(i,j) c(i,j) For European distance :
c ( i , j ) = ∣ ∣ x i − x j ∣ ∣ 2 = ( x i 1 − x j 1 ) 2 + ⋯ + ( x i d − x j d ) 2 ) c(i,j)=||x_i-x_j||_2=\sqrt{(x_i^1-x_j^1)^2+\cdots+(x_i^d-x_j^d)^2)} c(i,j)=xixj2=(xi1xj1)2++(xidxjd)2)

2.1.3 Document distance

Document distance (Document distance) The visual representation of : ∑ i , j T i j ⋅ c ( i , j ) \sum_{i,j}T_{ij}\cdot c(i,j) i,jTijc(i,j)

transition matrix T i j T_{ij} Tij: file d d d The words in i i i To another document d ′ d' d The words in j j j The weight of .

2.1.4 constraint condition

combination EMD This classic transportation problem (Transportation problem), The goal is : Minimize the total distance between two documents ( Mobile costs ); Add constraints to adjust weights : Let document d d d Each word in is matched to another document with different weights d ′ d' d All words in .
min ⁡ T ≥ 0 ∑ i , j = 1 n T i j c ( i , j ) s . t : ∑ j = 1 n T i j = d i ∀ i ∈ 1 , … , n ∑ i = 1 n T i j = d j ′ ∀ j ∈ 1 , … , n \min_{T\ge0}\sum_{i,j=1}^{n}T_{ij}c(i,j)\\ s.t: \sum_{j=1}^{n}T_{ij}=d_i \quad \forall i\in{1,\dots,n}\\ \sum_{i=1}^{n}T_{ij}=d_j' \quad \forall j\in{1,\dots,n}\\ T0mini,j=1nTijc(i,j)s.t:j=1nTij=dii1,,ni=1nTij=djj1,,n

2.2 Fast calculation

solve WMD The optimal average time complexity of the optimization problem is O ( p 3 log ⁡ p ) O(p^3 \log p) O(p3logp), among p p p Represents the number of unique words in the document . If the dataset has a large number of documents or a large number of unique words , solve WMD Optimization problems are difficult .

Calculate the... Between two documents WMD It is required to solve a large linear programming problem , If you want to use it to do K-NN( k The most similar documents ) It's time-consuming , Therefore, two optimization schemes are proposed to reduce the amount of computation .

Two settings are proposed in this paper Lower bounds To filter documents , Reduce time complexity by sacrificing some accuracy . The two methods are Word Centroid Distance( WCD) , Relaxed Word Moving Distance(RWMD).

2.2.1 WCD

WCD(Word centroid distance), According to the trigonometric inequality ∣ a ∣ + ∣ b ∣ ≥ ∣ a ± b ∣ ≥ ∣ ∣ a ∣ − ∣ b ∣ ∣ |a|+|b|\ge |a\pm b|\ge ||a|-|b|| a+ba±bab, Convert to calculation WMD The lower limit of ——centroid distance: ∥ X d − X d ′ ∥ 2 \parallel Xd-Xd'\parallel_2 XdXd2, as follows :

Between two documents “centroid” Distance will always be greater than WMD The distance should be small .

WCD

among X ∈ R d × p X\in R^{d\times p} XRd×p Is a matrix of word vectors , The time complexity of the algorithm is O ( d p ) O(dp) O(dp).

2.2.2 RWMD

Even though WCD The time complexity is very low , But the borders are too loose , Can't approximate well WMD. therefore , Use more here tight Lower bound of RWMD(Relaxed word moving distance ).RWMD It needs to be calculated twice , be based on WMD Objective function , Remove one of the two constraints , Then solve the minimum value , Use the maximum of the two minimum values as WMD Approximate value .

After removing a condition , Still consistent with the original WMD The nature of , The optimal solution of the following equation is , Assign all the assigned weights to the closest Euclidean distance Of the two words .

For example, remove the second constraint , obtain :

RWMD.png

The optimal solution of this problem is , For documents d d d One of the words in , Find another document d ′ d' d The closest word in , All transferred to the word , namely :

RWMD-T.png

Get WMD minimum value l 1 ( d , d ′ ) l_1(d,d') l1(d,d).

Similarly, remove the first constraint , Available :
min ⁡ T ≥ 0 ∑ i , j = 1 n T i j c ( i , j ) s . t : ∑ i = 1 n T i j = d j ′ ∀ j ∈ 1 , … , n \min_{T\ge0}\sum_{i,j=1}^{n}T_{ij}c(i,j)\\ s.t: \sum_{i=1}^{n}T_{ij}=d_j' \quad \forall j\in{1,\dots,n}\\ T0mini,j=1nTijc(i,j)s.t:i=1nTij=djj1,,n
The optimal solution of this problem is , For documents d ′ d' d One of the words in , Find another document d d d The closest word in , Accept the word all , namely :
T i j ∗ = { d j ′ if    i = arg ⁡ min ⁡ i c ( i , j ) 0 otherwise. T_{ij}^*=\begin{cases} d_j'\quad \text{if} \;i=\arg\min_i c(i,j)\\ 0 \quad \text{otherwise.} \end{cases} Tij={ djifi=argminic(i,j)0otherwise.
Get WMD minimum value l 2 ( d , d ′ ) l_2(d,d') l2(d,d).

Finally, ask for l r ( d , d ′ ) = max ⁡ ( l 1 ( d , d ′ ) ,    l 2 ( d , d ′ ) ) l_r(d,d')=\max(l_1(d,d'),\;l_2(d,d')) lr(d,d)=max(l1(d,d),l2(d,d)) The result is RWMD, The time complexity is O ( p 2 ) O(p^2) O(p2).

2.2.3 Prefetch and prune Speed up k-NN

Combine the above two methods to prune . To find the query document q q q Of k The nearest neighbor , We can use two lower bounds to greatly reduce what we need to do WMD Distance calculation amount .

K-NN The process is as follows :

  1. With the least time complexity WCD Calculate this document q q q Distance from all documents ;
  2. And sort all documents according to the distance from small to large ;
  3. Then calculate only the previous k Of documents WMD distance ;
  4. Traverse the remaining documents , use RWMD Calculate this document q q q Distance from the remaining documents , If the distance is greater than the front k Of recent documents WMD The maximum distance , Prune this document directly ; If the distance of this document is less than the previous k individual WMD Any one of the distances , Let's calculate the WMD, And update the k Next door documents .

In this way , We can prune it off 95% The above data sets , Very effective .

3. advantage ?

  • The results are excellent : Make the most of it word2vec The field of transfer Ability
  • Unsupervised : Do not rely on dimension data , There is no cold start problem
  • Simple model : Only the result of the word vector is required as input , There are no hyperparameters Count
  • Interpretability : Turn the problem into linear programming , Yes Global optimal solution
  • flexibility : Human intervention is possible The importance of words
  • High retrieval accuracy

4. shortcoming ? Improving direction ?

4.1 shortcoming

shortcoming :

  • No word order information is reserved ,WMD Think “ Repeatedly war and defeated ” and “ Often hurt often war ” Completely consistent in semantics .
  • Can't handle well OOV problem : Because word vectors are trained offline , Encountered when applied to online business OOV problem , user query Separate words , It is possible that the corresponding word vector cannot be found .
  • Poor ability to deal with negative words : In the trained word vector , Generally, the word vectors of words with opposite meanings are relatively similar , This will lead to two sentences with opposite semantics WMD Are close . ( The disadvantages of many similarity calculation methods )

4.2 Improve the algorithm S-WMD

WMD distance , Learn this weight matrix for clustering (K-NN, Those documents are similar ) The effect is very good , But the effect of classification is very poor .

Because there are many synonyms in some articles , But it doesn't express a semantic or a theme . such as : I love playing football and I like playing piano. Although the sentence pattern looks similar , May fall into the same category , But if the label is ” motion ” and ” art ” Two types of , Obviously you can't use WMD Directly classified . because ,**WMD Didn't join football and ” motion ” Is strongly relevant information .

The paper “Supervised Word Mover’s Distance” Proposed WMD Algorithm improvements : Trainable with penalties S-WMD.

The solution given in this paper is to WMD Add the function of training category weight to the distance :

S-WMD.png

there d i d_i di Added category weight w i w_i wi

S-WMD-weight.png

The distance between words should also be adjusted ( The distance between words also needs to be adjusted due to different categories ), Add training parameter matrix A.

Algorithm S-WMD.png

Improving direction :

  • Interpretability
  • Add penalty item

5. WMD application

WMD application :

  • As feature input : An unsupervised semantic similarity feature .
  • Calculate the distance between texts .

6. WMD Code implementation

WMD Several key points of code implementation :

  • Find the latest word ,
  • Optimize : Processing weight
  • prune :topk Sorting of documents

Currently open source WMD Implement a :

  • https://pypi.org/project/wmd/
  • https://github.com/mkusner/wmd

7. Reference resources

WMD tutorial:https://github.com/RaRe-Technologies/gensim/blob/c971411c09773488dbdd899754537c0d1a9fce50/docs/notebooks/WMD_tutorial.ipynb

Code:https://github.com/mkusner/wmd

https://zhuanlan.zhihu.com/p/84809907

https://blog.csdn.net/weixin_40547993/article/details/89475630

Supervised Word Mover’s Distance ( Supervised word shift distance )-NIPS 2016 Selected papers #2

Algorithm improvements :Supervised Word Mover’s Distance

Welcome to my official account. 【SOTA Technology interconnection 】, I will share more dry goods .

 Insert picture description here

原网站

版权声明
本文为[Happy little yard farmer]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202200556381347.html