当前位置：网站首页>How to calculate the distance between texts: WMD

How to calculate the distance between texts: WMD

2022-06-25 08:25:00 【Happy little yard farmer】

List of articles

WMD

Paper: From Word Embeddings To Document Distances

2015 year , Put forward Word shift distance WMD(Word Mover’s Distance)： Word shift distance is a measure of document similarity based on word vector , It is a method of calculating the distance between sentences , The smaller the distance , The higher the similarity .

1. Why put forward ？

Calculate sentence / Some methods of document similarity are ：

Unsupervised sentence similarity calculation ——Jaccard Similarity coefficient ：
Jaccard Similarity coefficient is used to measure the similarity between two sets , Is defined as the number of elements in the intersection of two sets divided by the number of elements in the union .
$SIM(s_1,s_2) = \frac{\text{intersection}(words1, words2)}{\text{union}(words1, words2)}$
Jaccard The larger the value of similarity coefficient , The higher the sentence similarity .
Jaccard Distance is used to measure the difference between two sets , It is Jaccard The complement of the similarity coefficient of , Is defined as 1 subtract Jaccard Similarity coefficient .
$d(s_1,s_2)=1-SIM(s_1,s_2) = 1-\frac{\text{intersection}(words1, words2)}{\text{union}(words1, words2)}$
Jaccard The greater the distance , The smaller the sentence similarity .
Using vector space models , Vector representation of sentences , Then calculate the Euclidean distance between two pairs 、 Cosine similarity and other metrics .
be based on Word2Vec, Use pre trained word-embdding vector , For a sentence , Add or average each bit of the word vector .
Model sentences , Methods include skip-gram,cbow Of doc2vector Modeling methods , be based on autoencoder Modeling method of , be based on skip-thought Sentence modeling method, etc . be based on sentence-vector This method can better preserve the semantic information of sentences , To a certain extent, it can make up for the shortcomings of the word bag .

First, solve the problem of document representation , Can pass ：

The word bag （BOW）/TF-IDF： Regardless of grammar or word order , Only consider word frequency , Vectorize the text . The problem is ： If two sentences do not have the same words , According to the vector representation, it will be judged as completely irrelevant , But both may have the same semantics .
LSI and LDA： Put similar words into the topic , The document is represented as the probability distribution of the subject . comparison BOW More coherent , But it can't be improved BOW The performance effect of . Compare the similarity of two documents , You can compare the differences in the distribution of topics between two documents .
Word2Vec： be based on word2vec Map all the words in the text to a word vector space . To some extent, it solves the semantic problem , Use CBOW/Skip-Gram Consider context information , Word vectors can reflect the semantic differences between words . Then the document / The sentence is represented as the average of all word vectors , By calculating the Euclidean distance between two pairs 、 Cosine similarity measures document similarity .

Word2Vec Consider the distance between words , Proposed WMD Is to consider the distance between documents . Use a distance to reflect the similarity between documents , An idea ： Model the document as 2 A combination of semantic distances of words in documents , For example, calculate the Euclidean distance of the word vector corresponding to any two words in two documents, and then calculate the weighted sum .

2. How to solve the problem ？

WMD The solution is EMD A special case of the problem .

2.1 Define the problem

2.1.1 Normalized word frequency

Normalized word frequency (normalized bag-of-words,nBOW), If a word $i$ The number of occurrences in the text is $c_i$ , Its normalized word frequency is ：
$d_i=\frac{c_i}{\sum_{j=1}^{n}c_j}$
The importance of a word is related to its frequency of occurrence （ And normalize ）.

2.1.2 Word movement cost

Word movement cost （Word travel cost）, because WMD The Chinese word is after word2vec It means , and word2vec The word after the expression can directly calculate the Euclidean distance . word $i$ And the words $j$ Semantic distance $c (i, j)$ For European distance ：
$c(i,j)=||x_i-x_j||_2=\sqrt{(x_i^1-x_j^1)^2+\cdots+(x_i^d-x_j^d)^2)}$

2.1.3 Document distance

Document distance （Document distance） The visual representation of ： $\sum_{i,j}T_{ij}\cdot c(i,j)$

transition matrix $T_{ij}$ ： file $d$ The words in $i$ To another document $d^{'}$ The words in $j$ The weight of .

2.1.4 constraint condition

combination EMD This classic transportation problem （Transportation problem）, The goal is ： Minimize the total distance between two documents （ Mobile costs ）; Add constraints to adjust weights ： Let document $d$ Each word in is matched to another document with different weights $d^{'}$ All words in .
$\min_{T\ge0}\sum_{i,j=1}^{n}T_{ij}c(i,j)\\ s.t: \sum_{j=1}^{n}T_{ij}=d_i \quad \forall i\in{1,\dots,n}\\ \sum_{i=1}^{n}T_{ij}=d_j' \quad \forall j\in{1,\dots,n}\\$

2.2 Fast calculation

solve WMD The optimal average time complexity of the optimization problem is $O(p^3 \log p)$ , among $p$ Represents the number of unique words in the document . If the dataset has a large number of documents or a large number of unique words , solve WMD Optimization problems are difficult .

Calculate the... Between two documents WMD It is required to solve a large linear programming problem , If you want to use it to do K-NN( k The most similar documents ) It's time-consuming , Therefore, two optimization schemes are proposed to reduce the amount of computation .

Two settings are proposed in this paper Lower bounds To filter documents , Reduce time complexity by sacrificing some accuracy . The two methods are Word Centroid Distance( WCD) , Relaxed Word Moving Distance(RWMD).

2.2.1 WCD

WCD(Word centroid distance), According to the trigonometric inequality $|a|+|b|\ge |a\pm b|\ge ||a|-|b||$ , Convert to calculation WMD The lower limit of ——centroid distance： $\parallel Xd-Xd'\parallel_2$ , as follows ：

Between two documents “centroid” Distance will always be greater than WMD The distance should be small .

WCD

among $X\in R^{d\times p}$ Is a matrix of word vectors , The time complexity of the algorithm is $O (d p)$ .

2.2.2 RWMD

Even though WCD The time complexity is very low , But the borders are too loose , Can't approximate well WMD. therefore , Use more here tight Lower bound of RWMD（Relaxed word moving distance ）.RWMD It needs to be calculated twice , be based on WMD Objective function , Remove one of the two constraints , Then solve the minimum value , Use the maximum of the two minimum values as WMD Approximate value .

After removing a condition , Still consistent with the original WMD The nature of , The optimal solution of the following equation is , Assign all the assigned weights to the closest Euclidean distance Of the two words .

For example, remove the second constraint , obtain ：

The optimal solution of this problem is , For documents $d$ One of the words in , Find another document $d^{'}$ The closest word in , All transferred to the word , namely :

Get WMD minimum value $l_1(d,d')$ .

Similarly, remove the first constraint , Available ：
$\min_{T\ge0}\sum_{i,j=1}^{n}T_{ij}c(i,j)\\ s.t: \sum_{i=1}^{n}T_{ij}=d_j' \quad \forall j\in{1,\dots,n}\\$
The optimal solution of this problem is , For documents $d^{'}$ One of the words in , Find another document $d$ The closest word in , Accept the word all , namely :
$T_{ij}^*=\begin{cases} d_j'\quad \text{if} \;i=\arg\min_i c(i,j)\\ 0 \quad \text{otherwise.} \end{cases}$
Get WMD minimum value $l_2(d,d')$ .

Finally, ask for $l_r(d,d')=\max(l_1(d,d'),\;l_2(d,d'))$ The result is RWMD, The time complexity is $O(p^2)$ .

2.2.3 Prefetch and prune Speed up k-NN

Combine the above two methods to prune . To find the query document $q$ Of k The nearest neighbor , We can use two lower bounds to greatly reduce what we need to do WMD Distance calculation amount .

K-NN The process is as follows :

With the least time complexity WCD Calculate this document $q$ Distance from all documents ;
And sort all documents according to the distance from small to large ;
Then calculate only the previous k Of documents WMD distance ;
Traverse the remaining documents , use RWMD Calculate this document $q$ Distance from the remaining documents , If the distance is greater than the front k Of recent documents WMD The maximum distance , Prune this document directly ; If the distance of this document is less than the previous k individual WMD Any one of the distances , Let's calculate the WMD, And update the k Next door documents .

In this way , We can prune it off 95% The above data sets , Very effective .

3. advantage ？

The results are excellent ： Make the most of it word2vec The field of transfer Ability
Unsupervised ： Do not rely on dimension data , There is no cold start problem
Simple model ： Only the result of the word vector is required as input , There are no hyperparameters Count
Interpretability ： Turn the problem into linear programming , Yes Global optimal solution
flexibility ： Human intervention is possible The importance of words
High retrieval accuracy

4. shortcoming ？ Improving direction ？

4.1 shortcoming

shortcoming ：

No word order information is reserved ,WMD Think “ Repeatedly war and defeated ” and “ Often hurt often war ” Completely consistent in semantics .
Can't handle well OOV problem ： Because word vectors are trained offline , Encountered when applied to online business OOV problem , user query Separate words , It is possible that the corresponding word vector cannot be found .
Poor ability to deal with negative words ： In the trained word vector , Generally, the word vectors of words with opposite meanings are relatively similar , This will lead to two sentences with opposite semantics WMD Are close . （ The disadvantages of many similarity calculation methods ）

4.2 Improve the algorithm S-WMD

WMD distance , Learn this weight matrix for clustering （K-NN, Those documents are similar ） The effect is very good , But the effect of classification is very poor .

Because there are many synonyms in some articles , But it doesn't express a semantic or a theme . such as : I love playing football and I like playing piano. Although the sentence pattern looks similar , May fall into the same category , But if the label is ” motion ” and ” art ” Two types of , Obviously you can't use WMD Directly classified . because ,**WMD Didn't join football and ” motion ” Is strongly relevant information .

The paper “Supervised Word Mover’s Distance” Proposed WMD Algorithm improvements ： Trainable with penalties S-WMD.

The solution given in this paper is to WMD Add the function of training category weight to the distance ：

there $d_i$ Added category weight $w_i$ ：

The distance between words should also be adjusted （ The distance between words also needs to be adjusted due to different categories ）, Add training parameter matrix A.

Algorithm S-WMD.png