当前位置:网站首页>Learning summary of spanquery source code

Learning summary of spanquery source code

2022-06-23 03:28:00 Call me Jiabao

General idea

SpanQuery The main idea of implementation :

SpanQuery->SpanWeight->SpanScorer.

SpanScorer Contains a Spans object , SpanScorer hold iterator() and twoPhraseIterator() Methods are entrusted to Spans object . Spans The class itself inherits DocIdSetIterator, in other words Spans The object itself represents a document inversion table , In addition to being an inverted table , Spans Class also implements nextStartPosition() /startPosition() /endPosition(), When matching a document , Through these three interfaces, you can traverse the matching position in the current document , For phrase matching .

SpanWeight Contains a getSpans(LeafReaderContext ctx, Postings requiredPostings) Abstract method of , Generate one for each segment Spans object , This object is in SpanScorer Objects contained in .

SpanNearQuery

ES Syntax example

{
  "query": {
    "span_near": {
      "clauses": [
        { "span_term": { "field": "value1" } },
        { "span_term": { "field": "value2" } },
        { "span_term": { "field": "value3" } }
      ],
      "slop": 12,
      "in_order": false
    }
  }
}

Recall logic

Recall candidates

Through multiple sub spans Take the document obtained from the intersection as the candidate set .

Filtering stage

For many spans For each group of matching positions , according to inOrder Parameter values determine the use of NearSpansOrdered or NearSpansUnordered The algorithm of . See... Below for details Spans Class detailed explanation of the algorithm .

SpanOrQuery

ES Syntax example

{
  "query": {
    "span_or" : {
      "clauses" : [
        { "span_term" : { "field" : "value1" } },
        { "span_term" : { "field" : "value2" } },
        { "span_term" : { "field" : "value3" } }
      ]
    }
  }
}

Recall logic

Through multiple sub spans Get the recall document set from the union set . ( Through the heap ).

SpanNotQuery

ES Syntax example

{
  "query": {
    "span_not": {
      "include": {
        "span_term": { "field1": "hoya" }
      },
      "exclude": {
        "span_near": {
          "clauses": [
            { "span_term": { "field1": "la" } },
            { "span_term": { "field1": "hoya" } }
          ],
          "slop": 0,
          "in_order": true
        }
      }
    }
  }
}

Require matching documents to match include Of SpanQuery, And can not overlap matching exclude Of SpanQuery.

The recall logic is as follows :

Recall candidates

Use include Of SpanQuery Recall all hit documents as candidate sets .

Filtering stage 1

Use exclude Of SpanQuery Filter the documents in the candidate set , If the candidate document does not hit exclude Of SpanQuery, Is returned directly as a hit document . If the candidate document hits exclude Of SpanQuery, Then enter the next filtering stage .

Filtering stage 2

For simultaneous hits include and exclude Documents , Need to detect include and exclude The hit of position Whether there is overlap . Such as include Hit location is [0,5], exclude Hit location is [7,8] There is no coincidence . include Hit location is [0,5], exclude Hit location is [4,8], There is overlap . about include and exclude There are overlapping documents in the hit location , To filter out .

SpanContainingQuery

ES Syntax example

{
  "query": {
    "span_containing": {
      "little": {
        "span_term": { "title": "b" }
      },
      "big": {
        "span_near": {
          "clauses": [
            { "span_term": { "title": "a" } },
            { "span_term": { "title": "c" } }
          ],
          "slop": 5,
          "in_order": true
        }
      }
    }
  }
}

Recall logic

Recall candidates

adopt little and big Take the intersection set as the candidate set .

Filtering stage

For the stage 1 The results of the recall , need little Matching position The scope is big Within the matching range of .

For the example above , I.e. request to pass big Match the "a" and "c" At the same time , little Of "b" Must appear in "a" and "c" In the middle of the .

such as :

file a x x b c Can match , because b Appear in a and c In the middle of the .

file a x x c b Can't match , because b Not in a and c In the middle of the .

Matching position

We need to pay attention to , SpanContainingQuery The matching position is big The location of . What does that mean ? Let's look at the following example :

The assumptions are as follows query:

{
    "query": {
        "span_near": {
            "clauses": [
                {
                    "span_containing": {
                        "little": {
                            "span_term": {
                                "title": "b"
                            }
                        },
                        "big": {
                            "span_near": {
                                "clauses": [
                                    {
                                        "span_term": {
                                            "title": "a"
                                        }
                                    },
                                    {
                                        "span_term": {
                                            "title": "c"
                                        }
                                    }
                                ],
                                "slop": 5,
                                "in_order": true
                            }
                        }
                    }
                },
                {
                    "span_term": {
                        "title": "d"
                    }
                }
            ],
            "slop": 0,
            "in_order": true
        }
    }
}

Ask if you can match the document a x x b c d?

The answer is yes . The outermost span_near It is required to match at the same time span_containing and span_term(title=d) And there is no gap between the two (slop=0), We've already said , This span_containing Can be matched , span_term(title=d) Obviously, it can be matched by itself , So how to calculate the distance between them ? At this time, we need to know span_containing Actual matching position How much is it .

span_containing Specify the matching position yes big Of position, Corresponding to a x x b c, So distance d The distance is 0, So the whole query Can match documents .

As you can imagine , If span_containing Matching position No big It is little, So the document a x x b c d There's no match . Because if span_containing The match is little Of position, Then it is equivalent to matching the b, So distance d The distance is 1, slop=0 Can't be matched . If span_containing Matching position No big It is little, Then it becomes another kind of SpanQuery: SpanWithinQuery.

SpanWithinQuery

Recall logic and when used alone SpanContainingQuery Exactly the same . Only the matching position is different , SpanContainingQuery To match the position big Of , SpanWithinQuery To match the position little Of .

ES Syntax example

{
  "query": {
    "span_within": {
      "little": {
        "span_term": { "title": "b" }
      },
      "big": {
        "span_near": {
          "clauses": [
            { "span_term": { "title": "a" } },
            { "span_term": { "title": "c" } }
          ],
          "slop": 5,
          "in_order": true
        }
      }
    }
  }
}

Recall logic

Recall candidates

adopt little and big Take the intersection set as the candidate set .

Filtering stage

For the stage 1 The results of the recall , need little Matching position The scope is big Within the matching range of .

For the example above , I.e. request to pass big Match the "a" and "c" At the same time , little Of "b" Must appear in "a" and "c" In the middle of the .

such as :

file a x x b c Can match , because b Appear in a and c In the middle of the .

file a x x c b Can't match , because b Not in a and c In the middle of the .

Matching position

And SpanContainingQuery contrary , SpanWithinQuery The matching position is little The location of . For detailed explanation, please refer to SpanContainingQuery Explanation in .

SpanFieldMaskingQuery

ES Syntax example

{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "text": "quick brown"
          }
        },
        {
          "span_field_masking": {
            "query": {
              "span_term": {
                "text.stems": "fox"
              }
            },
            "field": "text"
          }
        }
      ],
      "slop": 5,
      "in_order": false
    }
  }
}

In order to support in SpanNearQuery/SpanOrQuery Using multiple search domains in query.

Can put a normal SpanQuery use SpanFieldMaskingQuery Wrap it up and specify a custom field field_x, So wrapped SpanQuery And other fields field_x Of SpanQuery Mixed use .

SpanFirstQuery

ES Syntax example

{
  "query": {
    "span_first": {
      "match": {
        "span_term": { "user.id": "kimchy" }
      },
      "end": 3
    }
  }
}

Wrap a SpanQuery, The matching position must be at the beginning of the document . say concretely , requirement Matching the position of endPosition<= Parameter specified end.

SpanMultiTermQuery

ES Syntax example

{
  "query": {
    "span_multi": {
      "match": {
        "prefix": { "user.id": { "value": "ki" } }
      }
    }
  }
}

Put one MultiTermQuery Bao Yaocheng SpanQuery, So that it can be used as SpanQuery Nesting uses .

Spans Class explanation

It realizes DocIdSetIterator, Inverted linked list used to represent documents , Added nextStartPosition() Used to traverse all of a document position.

int nextStartPosition()
void collect(SpanCollector collector) //  At present, it is mainly used to operate payload.

ConjunctionSpans

Multiple defined span The operation of taking intersection , Allow subclasses to pass twoPhaseCurrentDocMatches() Method to customize the matching rules for phrases .

NearSpansOrdered

Match phrases in order .

The core algorithm
@Override
  boolean twoPhaseCurrentDocMatches() throws IOException {
    assert unpositioned();
    oneExhaustedInCurrentDoc = false;
    while (subSpans[0].nextStartPosition() != NO_MORE_POSITIONS && !oneExhaustedInCurrentDoc) {
      if (stretchToOrder() && matchWidth <= allowedSlop) {
        return atFirstInCurrentDoc = true;
      }
    }
    return false;
  }


  @Override
  public int nextStartPosition() throws IOException {
    if (atFirstInCurrentDoc) {
      atFirstInCurrentDoc = false;
      return matchStart;
    }
    oneExhaustedInCurrentDoc = false;
    while (subSpans[0].nextStartPosition() != NO_MORE_POSITIONS && !oneExhaustedInCurrentDoc) {
      if (stretchToOrder() && matchWidth <= allowedSlop) {
        return matchStart;
      }
    }
    return matchStart = matchEnd = NO_MORE_POSITIONS;
  }
  
  private boolean stretchToOrder() throws IOException {
    Spans prevSpans = subSpans[0];
    matchStart = prevSpans.startPosition();
    assert prevSpans.startPosition() != NO_MORE_POSITIONS : "prevSpans no start position "+prevSpans;
    assert prevSpans.endPosition() != NO_MORE_POSITIONS;
    matchWidth = 0;
    for (int i = 1; i < subSpans.length; i++) {
      Spans spans = subSpans[i];
      assert spans.startPosition() != NO_MORE_POSITIONS;
      assert spans.endPosition() != NO_MORE_POSITIONS;
      if (advancePosition(spans, prevSpans.endPosition()) == NO_MORE_POSITIONS) {
        oneExhaustedInCurrentDoc = true;
        return false;
      }
      matchWidth += (spans.startPosition() - prevSpans.endPosition());
      prevSpans = spans;
    }
    matchEnd = subSpans[subSpans.length - 1].endPosition();
    return true; // all subSpans ordered and non overlapping
  }

Simply speaking , Is the fixed header node , Then detect the remaining nodes in turn .

matchWidth For the total distance to be moved , Total distance =

( The first 2 Node locations - The first 1 Node locations ) +

( The first 3 Node locations - The first 2 Node locations ) +

( The first 4 Node locations - The first 3 Node locations ) +

...

Finally, judge the total distance matchWidth<=slop that will do .

payload check Less recall problems

The algorithm slop Not for 0 And cooperate payload check There will be a problem when using :

If a document is :

china|1 bank|0.5 bank|1

We search :

{!payload_check f='TITLE_PAYLOADS' v='china bank' payloads='1 1' operator='phrase' slop=100 inOrder=true}

You can't find documents , because NearSpansOrdered For the first time, I will give 0,1 The subscript of this group of phrases , Then this group of phrases because bank Of payload=0.5 Unqualified , The second time we had expected NearSpansOrdered Should give 0,2 The subscript of this group of phrases , But it's not , because NearSpansOrdered Every time you call nextStartPosition() The first word must be called nextStartPosition() Of .

That is to say 0,1 If this set of subscripts does not match , that china The subscript of is going to go back . This is reasonable in a separate phrase query scenario , Because usually after the head node is fixed , If the nearest node cannot be selected from other nodes slop, It is even more impossible to match the farther one , But in joining payload check The scene is not , Nearby may be because payload check Do not conform to the , But the distant ones may be consistent , However NearSpansOrdered It doesn't give you a chance to test .

therefore , Specify inOrder=true And slop!=0 Scene , Make sure there are no duplicates in the document data term, Otherwise, there may be a risk of missed recall .

NearSpansUnordered

Unordered matching phrases

The core algorithm
 @Override
  boolean twoPhaseCurrentDocMatches() throws IOException {
    // at doc with all subSpans
    spanWindow.startDocument();
    while (true) {
      if (spanWindow.atMatch()) {
        atFirstInCurrentDoc = true;
        oneExhaustedInCurrentDoc = false;
        return true;
      }
      if (! spanWindow.nextPosition()) {
        return false;
      }
    }
  }

  @Override
  public int nextStartPosition() throws IOException {
    if (atFirstInCurrentDoc) {
      atFirstInCurrentDoc = false;
      return spanWindow.top().startPosition();
    }
    assert spanWindow.top().startPosition() != -1;
    assert spanWindow.top().startPosition() != NO_MORE_POSITIONS;
    while (true) {
      if (! spanWindow.nextPosition()) {
        oneExhaustedInCurrentDoc = true;
        return NO_MORE_POSITIONS;
      }
      if (spanWindow.atMatch()) {
        return spanWindow.top().startPosition();
      }
    }
  }


private class SpanTotalLengthEndPositionWindow extends PriorityQueue<Spans> {
    int totalSpanLength;
    int maxEndPosition;

    public SpanTotalLengthEndPositionWindow() {
      super(subSpans.length);
    }

    @Override
    protected final boolean lessThan(Spans spans1, Spans spans2) {
      return positionsOrdered(spans1, spans2);
    }

    void startDocument() throws IOException {
      clear();
      totalSpanLength = 0;
      maxEndPosition = -1;
      for (Spans spans : subSpans) {
        assert spans.startPosition() == -1;
        spans.nextStartPosition();
        assert spans.startPosition() != NO_MORE_POSITIONS;
        add(spans);
        if (spans.endPosition() > maxEndPosition) {
          maxEndPosition = spans.endPosition();
        }
        int spanLength = spans.endPosition() - spans.startPosition();
        assert spanLength >= 0;
        totalSpanLength += spanLength;
      }
    }

    boolean nextPosition() throws IOException {
      Spans topSpans = top();
      assert topSpans.startPosition() != NO_MORE_POSITIONS;
      int spanLength = topSpans.endPosition() - topSpans.startPosition();
      int nextStartPos = topSpans.nextStartPosition();
      if (nextStartPos == NO_MORE_POSITIONS) {
        return false;
      }
      totalSpanLength -= spanLength;
      spanLength = topSpans.endPosition() - topSpans.startPosition();
      totalSpanLength += spanLength;
      if (topSpans.endPosition() > maxEndPosition) {
        maxEndPosition = topSpans.endPosition();
      }
      updateTop();
      return true;
    }

    boolean atMatch() {
      boolean res = (maxEndPosition - top().startPosition() - totalSpanLength) <= allowedSlop;
      return res;
    }
  }
  
  

Simply speaking , The main idea of the algorithm is " Card boundary "+" Find time ".

This algorithm takes into account each sub span The length of , To understand thoughts , Let's not consider every child span The length of , Suppose the lengths are all 1, Then we can consider the following question :

Suppose the following documents are currently indexed :

"a b c d e f g h i j k"

We use it SpanNearQuery To search this document , The parameters are as follows :

  • phrase="b c e g h"
  • slop=?
  • inOrder=false

Ask if you can match the document , slop How small can the smallest be taken .

To answer this question , Let's draw a picture first , Index the document's term, position, And what we want to inquire about phrase Query for term All marked on it :

Indexes term

a

b

c

d

e

f

g

h

i

j

k

position

0

1

2

3

4

5

6

7

8

9

10

Whether it is a query term

x

x

x

x

x

It says , The main idea of the algorithm is " Card boundary "+" Find time ":

1. Card boundary

Find out term The leftmost position and the rightmost position , Jam these two boundaries , Then find the length :

  • Inquire about term On the far left is b, Subscript to be 1.
  • Inquire about term On the far right is h, Subscript to be 7.

Seek from b To h The total length of = 7-1+1=7 ( Why +1 Because h It also counts )

2. Find time

from b To h, If one of them term Does not belong to query term, It's a " room ", We need to find out from b To h Have a few moments . Obviously , Yes d and f Two " room ".

Because we are looking at pictures , It can be seen intuitively that there are 2 individual " room ", However, if you want to calculate 2 This value , Actually, you need to use :

 from b To h The total length of - Inquire about term Count =7-5=2.  ( Inquire about term Count =5 Because there is b,c,e,g,h 5 A query term)

Okay , Minimum slop Is equal to " room " The number of =2.

It can be seen that the core idea of this algorithm is very simple and intuitive , Our last " room " In fact, the calculation formula in the code corresponds to atMatch() Method :

	boolean atMatch() {
      boolean res = (maxEndPosition - top().startPosition() - totalSpanLength) <= allowedSlop;
      return res;
    }
    
maxEndPosition-top().startPosition() Is the total length of card boundary calculation .
totalSpanLength It's a query term Count .  But our inquiry term Because the length is 1,  So just count the numbers ,  For length is not 1 The situation of ,  In fact, the total length is calculated ,  That is to say totalSpanLength.

" Card boundary "+" Find time " Our algorithm is only for a group of query words position Of , Then each query word may have multiple position, So you need to maintain a heap , Match one group at a time position, Let the top of the pile ( At present position Minimum ) Query words for advance Go to the next position . Then for the new group position Keep using " Card boundary "+" Find time " The algorithm of .

payload check Less recall problems

stay slop Not for 0 And cooperate payload check When you use it , inOrder=false The algorithm will also cause the problem of less recall . Reason and inOrder=true The situation is similar .

原网站

版权声明
本文为[Call me Jiabao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/01/202201181651505529.html