当前位置:网站首页>Learning summary of spanquery source code
Learning summary of spanquery source code
2022-06-23 03:28:00 【Call me Jiabao】
General idea
SpanQuery The main idea of implementation :
SpanQuery->SpanWeight->SpanScorer.
SpanScorer Contains a Spans object , SpanScorer hold iterator() and twoPhraseIterator() Methods are entrusted to Spans object . Spans The class itself inherits DocIdSetIterator, in other words Spans The object itself represents a document inversion table , In addition to being an inverted table , Spans Class also implements nextStartPosition() /startPosition() /endPosition(), When matching a document , Through these three interfaces, you can traverse the matching position in the current document , For phrase matching .
SpanWeight Contains a getSpans(LeafReaderContext ctx, Postings requiredPostings) Abstract method of , Generate one for each segment Spans object , This object is in SpanScorer Objects contained in .
SpanNearQuery
ES Syntax example
{
"query": {
"span_near": {
"clauses": [
{ "span_term": { "field": "value1" } },
{ "span_term": { "field": "value2" } },
{ "span_term": { "field": "value3" } }
],
"slop": 12,
"in_order": false
}
}
}Recall logic
Recall candidates
Through multiple sub spans Take the document obtained from the intersection as the candidate set .
Filtering stage
For many spans For each group of matching positions , according to inOrder Parameter values determine the use of NearSpansOrdered or NearSpansUnordered The algorithm of . See... Below for details Spans Class detailed explanation of the algorithm .
SpanOrQuery
ES Syntax example
{
"query": {
"span_or" : {
"clauses" : [
{ "span_term" : { "field" : "value1" } },
{ "span_term" : { "field" : "value2" } },
{ "span_term" : { "field" : "value3" } }
]
}
}
}Recall logic
Through multiple sub spans Get the recall document set from the union set . ( Through the heap ).
SpanNotQuery
ES Syntax example
{
"query": {
"span_not": {
"include": {
"span_term": { "field1": "hoya" }
},
"exclude": {
"span_near": {
"clauses": [
{ "span_term": { "field1": "la" } },
{ "span_term": { "field1": "hoya" } }
],
"slop": 0,
"in_order": true
}
}
}
}
}Require matching documents to match include Of SpanQuery, And can not overlap matching exclude Of SpanQuery.
The recall logic is as follows :
Recall candidates
Use include Of SpanQuery Recall all hit documents as candidate sets .
Filtering stage 1
Use exclude Of SpanQuery Filter the documents in the candidate set , If the candidate document does not hit exclude Of SpanQuery, Is returned directly as a hit document . If the candidate document hits exclude Of SpanQuery, Then enter the next filtering stage .
Filtering stage 2
For simultaneous hits include and exclude Documents , Need to detect include and exclude The hit of position Whether there is overlap . Such as include Hit location is [0,5], exclude Hit location is [7,8] There is no coincidence . include Hit location is [0,5], exclude Hit location is [4,8], There is overlap . about include and exclude There are overlapping documents in the hit location , To filter out .
SpanContainingQuery
ES Syntax example
{
"query": {
"span_containing": {
"little": {
"span_term": { "title": "b" }
},
"big": {
"span_near": {
"clauses": [
{ "span_term": { "title": "a" } },
{ "span_term": { "title": "c" } }
],
"slop": 5,
"in_order": true
}
}
}
}
}Recall logic
Recall candidates
adopt little and big Take the intersection set as the candidate set .
Filtering stage
For the stage 1 The results of the recall , need little Matching position The scope is big Within the matching range of .
For the example above , I.e. request to pass big Match the "a" and "c" At the same time , little Of "b" Must appear in "a" and "c" In the middle of the .
such as :
file a x x b c Can match , because b Appear in a and c In the middle of the .
file a x x c b Can't match , because b Not in a and c In the middle of the .
Matching position
We need to pay attention to , SpanContainingQuery The matching position is big The location of . What does that mean ? Let's look at the following example :
The assumptions are as follows query:
{
"query": {
"span_near": {
"clauses": [
{
"span_containing": {
"little": {
"span_term": {
"title": "b"
}
},
"big": {
"span_near": {
"clauses": [
{
"span_term": {
"title": "a"
}
},
{
"span_term": {
"title": "c"
}
}
],
"slop": 5,
"in_order": true
}
}
}
},
{
"span_term": {
"title": "d"
}
}
],
"slop": 0,
"in_order": true
}
}
} Ask if you can match the document a x x b c d?
The answer is yes . The outermost span_near It is required to match at the same time span_containing and span_term(title=d) And there is no gap between the two (slop=0), We've already said , This span_containing Can be matched , span_term(title=d) Obviously, it can be matched by itself , So how to calculate the distance between them ? At this time, we need to know span_containing Actual matching position How much is it .
span_containing Specify the matching position yes big Of position, Corresponding to a x x b c, So distance d The distance is 0, So the whole query Can match documents .
As you can imagine , If span_containing Matching position No big It is little, So the document a x x b c d There's no match . Because if span_containing The match is little Of position, Then it is equivalent to matching the b, So distance d The distance is 1, slop=0 Can't be matched . If span_containing Matching position No big It is little, Then it becomes another kind of SpanQuery: SpanWithinQuery.
SpanWithinQuery
Recall logic and when used alone SpanContainingQuery Exactly the same . Only the matching position is different , SpanContainingQuery To match the position big Of , SpanWithinQuery To match the position little Of .
ES Syntax example
{
"query": {
"span_within": {
"little": {
"span_term": { "title": "b" }
},
"big": {
"span_near": {
"clauses": [
{ "span_term": { "title": "a" } },
{ "span_term": { "title": "c" } }
],
"slop": 5,
"in_order": true
}
}
}
}
}Recall logic
Recall candidates
adopt little and big Take the intersection set as the candidate set .
Filtering stage
For the stage 1 The results of the recall , need little Matching position The scope is big Within the matching range of .
For the example above , I.e. request to pass big Match the "a" and "c" At the same time , little Of "b" Must appear in "a" and "c" In the middle of the .
such as :
file a x x b c Can match , because b Appear in a and c In the middle of the .
file a x x c b Can't match , because b Not in a and c In the middle of the .
Matching position
And SpanContainingQuery contrary , SpanWithinQuery The matching position is little The location of . For detailed explanation, please refer to SpanContainingQuery Explanation in .
SpanFieldMaskingQuery
ES Syntax example
{
"query": {
"span_near": {
"clauses": [
{
"span_term": {
"text": "quick brown"
}
},
{
"span_field_masking": {
"query": {
"span_term": {
"text.stems": "fox"
}
},
"field": "text"
}
}
],
"slop": 5,
"in_order": false
}
}
}In order to support in SpanNearQuery/SpanOrQuery Using multiple search domains in query.
Can put a normal SpanQuery use SpanFieldMaskingQuery Wrap it up and specify a custom field field_x, So wrapped SpanQuery And other fields field_x Of SpanQuery Mixed use .
SpanFirstQuery
ES Syntax example
{
"query": {
"span_first": {
"match": {
"span_term": { "user.id": "kimchy" }
},
"end": 3
}
}
} Wrap a SpanQuery, The matching position must be at the beginning of the document . say concretely , requirement Matching the position of endPosition<= Parameter specified end.
SpanMultiTermQuery
ES Syntax example
{
"query": {
"span_multi": {
"match": {
"prefix": { "user.id": { "value": "ki" } }
}
}
}
}Put one MultiTermQuery Bao Yaocheng SpanQuery, So that it can be used as SpanQuery Nesting uses .
Spans Class explanation
It realizes DocIdSetIterator, Inverted linked list used to represent documents , Added nextStartPosition() Used to traverse all of a document position.
int nextStartPosition() void collect(SpanCollector collector) // At present, it is mainly used to operate payload.
ConjunctionSpans
Multiple defined span The operation of taking intersection , Allow subclasses to pass twoPhaseCurrentDocMatches() Method to customize the matching rules for phrases .
NearSpansOrdered
Match phrases in order .
The core algorithm
@Override
boolean twoPhaseCurrentDocMatches() throws IOException {
assert unpositioned();
oneExhaustedInCurrentDoc = false;
while (subSpans[0].nextStartPosition() != NO_MORE_POSITIONS && !oneExhaustedInCurrentDoc) {
if (stretchToOrder() && matchWidth <= allowedSlop) {
return atFirstInCurrentDoc = true;
}
}
return false;
}
@Override
public int nextStartPosition() throws IOException {
if (atFirstInCurrentDoc) {
atFirstInCurrentDoc = false;
return matchStart;
}
oneExhaustedInCurrentDoc = false;
while (subSpans[0].nextStartPosition() != NO_MORE_POSITIONS && !oneExhaustedInCurrentDoc) {
if (stretchToOrder() && matchWidth <= allowedSlop) {
return matchStart;
}
}
return matchStart = matchEnd = NO_MORE_POSITIONS;
}
private boolean stretchToOrder() throws IOException {
Spans prevSpans = subSpans[0];
matchStart = prevSpans.startPosition();
assert prevSpans.startPosition() != NO_MORE_POSITIONS : "prevSpans no start position "+prevSpans;
assert prevSpans.endPosition() != NO_MORE_POSITIONS;
matchWidth = 0;
for (int i = 1; i < subSpans.length; i++) {
Spans spans = subSpans[i];
assert spans.startPosition() != NO_MORE_POSITIONS;
assert spans.endPosition() != NO_MORE_POSITIONS;
if (advancePosition(spans, prevSpans.endPosition()) == NO_MORE_POSITIONS) {
oneExhaustedInCurrentDoc = true;
return false;
}
matchWidth += (spans.startPosition() - prevSpans.endPosition());
prevSpans = spans;
}
matchEnd = subSpans[subSpans.length - 1].endPosition();
return true; // all subSpans ordered and non overlapping
}Simply speaking , Is the fixed header node , Then detect the remaining nodes in turn .
matchWidth For the total distance to be moved , Total distance =
( The first 2 Node locations - The first 1 Node locations ) +
( The first 3 Node locations - The first 2 Node locations ) +
( The first 4 Node locations - The first 3 Node locations ) +
...
Finally, judge the total distance matchWidth<=slop that will do .
payload check Less recall problems
The algorithm slop Not for 0 And cooperate payload check There will be a problem when using :
If a document is :
china|1 bank|0.5 bank|1
We search :
{!payload_check f='TITLE_PAYLOADS' v='china bank' payloads='1 1' operator='phrase' slop=100 inOrder=true}You can't find documents , because NearSpansOrdered For the first time, I will give 0,1 The subscript of this group of phrases , Then this group of phrases because bank Of payload=0.5 Unqualified , The second time we had expected NearSpansOrdered Should give 0,2 The subscript of this group of phrases , But it's not , because NearSpansOrdered Every time you call nextStartPosition() The first word must be called nextStartPosition() Of .
That is to say 0,1 If this set of subscripts does not match , that china The subscript of is going to go back . This is reasonable in a separate phrase query scenario , Because usually after the head node is fixed , If the nearest node cannot be selected from other nodes slop, It is even more impossible to match the farther one , But in joining payload check The scene is not , Nearby may be because payload check Do not conform to the , But the distant ones may be consistent , However NearSpansOrdered It doesn't give you a chance to test .
therefore , Specify inOrder=true And slop!=0 Scene , Make sure there are no duplicates in the document data term, Otherwise, there may be a risk of missed recall .
NearSpansUnordered
Unordered matching phrases
The core algorithm
@Override
boolean twoPhaseCurrentDocMatches() throws IOException {
// at doc with all subSpans
spanWindow.startDocument();
while (true) {
if (spanWindow.atMatch()) {
atFirstInCurrentDoc = true;
oneExhaustedInCurrentDoc = false;
return true;
}
if (! spanWindow.nextPosition()) {
return false;
}
}
}
@Override
public int nextStartPosition() throws IOException {
if (atFirstInCurrentDoc) {
atFirstInCurrentDoc = false;
return spanWindow.top().startPosition();
}
assert spanWindow.top().startPosition() != -1;
assert spanWindow.top().startPosition() != NO_MORE_POSITIONS;
while (true) {
if (! spanWindow.nextPosition()) {
oneExhaustedInCurrentDoc = true;
return NO_MORE_POSITIONS;
}
if (spanWindow.atMatch()) {
return spanWindow.top().startPosition();
}
}
}
private class SpanTotalLengthEndPositionWindow extends PriorityQueue<Spans> {
int totalSpanLength;
int maxEndPosition;
public SpanTotalLengthEndPositionWindow() {
super(subSpans.length);
}
@Override
protected final boolean lessThan(Spans spans1, Spans spans2) {
return positionsOrdered(spans1, spans2);
}
void startDocument() throws IOException {
clear();
totalSpanLength = 0;
maxEndPosition = -1;
for (Spans spans : subSpans) {
assert spans.startPosition() == -1;
spans.nextStartPosition();
assert spans.startPosition() != NO_MORE_POSITIONS;
add(spans);
if (spans.endPosition() > maxEndPosition) {
maxEndPosition = spans.endPosition();
}
int spanLength = spans.endPosition() - spans.startPosition();
assert spanLength >= 0;
totalSpanLength += spanLength;
}
}
boolean nextPosition() throws IOException {
Spans topSpans = top();
assert topSpans.startPosition() != NO_MORE_POSITIONS;
int spanLength = topSpans.endPosition() - topSpans.startPosition();
int nextStartPos = topSpans.nextStartPosition();
if (nextStartPos == NO_MORE_POSITIONS) {
return false;
}
totalSpanLength -= spanLength;
spanLength = topSpans.endPosition() - topSpans.startPosition();
totalSpanLength += spanLength;
if (topSpans.endPosition() > maxEndPosition) {
maxEndPosition = topSpans.endPosition();
}
updateTop();
return true;
}
boolean atMatch() {
boolean res = (maxEndPosition - top().startPosition() - totalSpanLength) <= allowedSlop;
return res;
}
}
Simply speaking , The main idea of the algorithm is " Card boundary "+" Find time ".
This algorithm takes into account each sub span The length of , To understand thoughts , Let's not consider every child span The length of , Suppose the lengths are all 1, Then we can consider the following question :
Suppose the following documents are currently indexed :
"a b c d e f g h i j k"
We use it SpanNearQuery To search this document , The parameters are as follows :
- phrase="b c e g h"
- slop=?
- inOrder=false
Ask if you can match the document , slop How small can the smallest be taken .
To answer this question , Let's draw a picture first , Index the document's term, position, And what we want to inquire about phrase Query for term All marked on it :
Indexes term | a | b | c | d | e | f | g | h | i | j | k |
|---|---|---|---|---|---|---|---|---|---|---|---|
position | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Whether it is a query term | x | x | x | x | x |
It says , The main idea of the algorithm is " Card boundary "+" Find time ":
1. Card boundary
Find out term The leftmost position and the rightmost position , Jam these two boundaries , Then find the length :
- Inquire about term On the far left is b, Subscript to be 1.
- Inquire about term On the far right is h, Subscript to be 7.
Seek from b To h The total length of = 7-1+1=7 ( Why +1 Because h It also counts )
2. Find time
from b To h, If one of them term Does not belong to query term, It's a " room ", We need to find out from b To h Have a few moments . Obviously , Yes d and f Two " room ".
Because we are looking at pictures , It can be seen intuitively that there are 2 individual " room ", However, if you want to calculate 2 This value , Actually, you need to use :
from b To h The total length of - Inquire about term Count =7-5=2. ( Inquire about term Count =5 Because there is b,c,e,g,h 5 A query term)
Okay , Minimum slop Is equal to " room " The number of =2.
It can be seen that the core idea of this algorithm is very simple and intuitive , Our last " room " In fact, the calculation formula in the code corresponds to atMatch() Method :
boolean atMatch() {
boolean res = (maxEndPosition - top().startPosition() - totalSpanLength) <= allowedSlop;
return res;
}
maxEndPosition-top().startPosition() Is the total length of card boundary calculation .
totalSpanLength It's a query term Count . But our inquiry term Because the length is 1, So just count the numbers , For length is not 1 The situation of , In fact, the total length is calculated , That is to say totalSpanLength." Card boundary "+" Find time " Our algorithm is only for a group of query words position Of , Then each query word may have multiple position, So you need to maintain a heap , Match one group at a time position, Let the top of the pile ( At present position Minimum ) Query words for advance Go to the next position . Then for the new group position Keep using " Card boundary "+" Find time " The algorithm of .
payload check Less recall problems
stay slop Not for 0 And cooperate payload check When you use it , inOrder=false The algorithm will also cause the problem of less recall . Reason and inOrder=true The situation is similar .
边栏推荐
- Troubleshooting and solution of error 400 in easygbs video platform
- Why can only a small number of condition type prices be maintained in me12 of SAP mm?
- Decentralized networks are not decentralized
- New configuration of Alipay
- How does easyplayer embed a video snapshot into a demo?
- How to make distribution box label
- Analysis of China's integrated circuit industry chain in 2021: huge downstream market demand [figure]
- TRTC zero foundation -- Video subscription on the code
- Goframe framework (RK boot): enable tls/ssl
- C. Differential Sorting
猜你喜欢

Analysis on the development of China's satellite navigation industry chain in 2021: satellite navigation is fully integrated into production and life, and the satellite navigation industry is also boo
![Analysis on the development prospect of China's brain computer interface industry in 2021: wide application prospect, sustained and rapid growth of market scale [figure]](/img/84/192d152ceb760264b6b555b321f129.jpg)
Analysis on the development prospect of China's brain computer interface industry in 2021: wide application prospect, sustained and rapid growth of market scale [figure]
![Analysis of the number of urban residents covered by basic medical insurance, their treatment and medical treatment in other places in China in 2021 [figure]](/img/81/4d3cb059f700dd9243645e64023be7.jpg)
Analysis of the number of urban residents covered by basic medical insurance, their treatment and medical treatment in other places in China in 2021 [figure]
![Analysis on demand and market scale of China's steamed stuffed bun industry in 2020 [figure]](/img/4b/dd272f98b89a157180bf68570d2763.jpg)
Analysis on demand and market scale of China's steamed stuffed bun industry in 2020 [figure]

Detailed discussion on modular architecture design of MCU firmware

Gakataka student end to bundle Version (made by likewendy)
![[quick view] Analysis on the development status and future development trend of the global and Chinese diamond cultivation industry in 2021 [figure]](/img/f1/972a760459a6d599b5681aa634df09.jpg)
[quick view] Analysis on the development status and future development trend of the global and Chinese diamond cultivation industry in 2021 [figure]

Jmeter- (V) simulated user concurrent login for interface test
![Analysis of China's integrated circuit industry chain in 2021: huge downstream market demand [figure]](/img/de/d73805aaf4345ca3d2a7baf85aab8d.jpg)
Analysis of China's integrated circuit industry chain in 2021: huge downstream market demand [figure]

Analysis on the development of duty-free industry in Hainan Province in 2021: the implementation of the new policy makes the duty-free market in Hainan more "prosperous" [figure]
随机推荐
Simply use the pagoda to build WordPress
Micro build low code to realize user login and registration
Troubleshooting and solution of error 400 in easygbs video platform
Batch generation of Codabar codes using Excel files
How to configure the domain name with low code of micro build
How to generate code-11 barcode in batch through TXT file
Why don't I suggest you use "! = null" to judge empty space?
YouTube security scenarios
JS event delegation (event agent)
Mybatties plus batch warehousing
New configuration of Alipay
Know res.send() and res.end() of Express
[Alibaba middleware technology series] "Nacos technology" service registration and discovery related principle analysis
The primary level of SAP retail uses the transaction code wrfmatcopy to create commodity master data
What are the advantages of the completely free and open source flutter? How to learn about flutter?
PHP composer yii2 installation
6 values in JS are 'false'
TRTC zero foundation -- Video subscription on the code
DAAS architecture and Implementation (I)
Build a weather forecast applet using a widget