当前位置:网站首页>ES can finally find brother Wukong!
ES can finally find brother Wukong!
2022-06-25 07:09:00 【Wukong chat architecture】
zhi'ci reply PDF Collect information

This is Wukong's first 90 Original articles
author | Wukong chat structure
source | Wukong chat structure (ID:PassJava666)
Elasticsearch( abbreviation ES) Our search engine has many built-in word splitters , But yes. Chinese word segmentation unfriendly , For example, search Brother Wukong , It can't be found , So we need a third-party Chinese word segmentation toolkit .
Brother Wukong specially studied the following ik How to play with Chinese word segmentation toolkit , I hope that's helpful .
The main contents of this paper are as follows :

1 ES The principle of participle in
1.1 ES The concept of word participator of
ES It's a word breaker for ( tokenizer ) Receive a character stream , Divide it into independent lexical elements ( tokens ) , Then output the stream of words .
ES Provides many built-in word breakers , Can be used to build a custom word breaker ( custom ananlyzers )
1.2 The principle of standard word segmentation
such as stadard tokenizer Standard participator , When you encounter a blank space, you can participle . The word breaker is also responsible for recording entries ( term ) The order or order of position Location ( be used for phrase Phrases and word proximity Word nearest neighbor query ) . The character offset of each word ( Used to highlight the search content ) .
1.3 Examples of English and punctuation participle
The query example is as follows :
POST _analyze
{
"analyzer": "standard",
"text": "Do you know why I want to study ELK? 2 3 33..."
}
Query results :
do, you, know, why, i, want, to, study, elk, 2,3,33
You can see from the query results that :
(1) Punctuation has no participle .
(2) Numbers can be used for word segmentation .

1.4 Examples of Chinese word segmentation
But this kind of word breaker is not friendly to Chinese word segmentation support , Chinese characters can be divided into words . For example, the following example will Wukong chat structure The participle is enlightenment , empty , chat , frame , structure , The expected participle is The wu is empty , chat , framework .
POST _analyze
{
"analyzer": "standard",
"text": " Wukong chat structure "
}

We can install ik Word breaker to support Chinese word segmentation more friendly .
2 install ik Word segmentation is
2.1 ik Address of word breaker
ik Address of word breaker :
https://github.com/medcl/elasticsearch-analysis-ik/releases
First check ES edition , The version I installed was 7.4.2, So we install ik The version of the word breaker is also selected 7.4.2
http://192.168.56.10:9200/
{
"name" : "8448ec5f3312",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "xC72O3nKSjWavYZ-EPt9Gw",
"version" : {
"number" : "7.4.2",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
"build_date" : "2019-10-28T20:40:44.881551Z",
"build_snapshot" : false,
"lucene_version" : "8.2.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

2.2 install ik The way word breakers work
2.2.1 Mode one : Install in the container ik Word segmentation is
Get into es Inside of container plugins Catalog
docker exec -it < Containers id> /bin/bash
obtain ik Word breaker compression package
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
decompression ik Compressed package
unzip Compressed package
Delete the downloaded package
rm -rf *.zip
2.2.2 Mode two : Mapping file installation ik Word segmentation is
Go to the map folder
cd /mydata/elasticsearch/plugins
Download installation package
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
decompression ik Compressed package
unzip Compressed package
Delete the downloaded package
rm -rf *.zip
2.2.3 Mode three :Xftp Upload the compressed package to the mapping directory
First use XShell Tools to connect virtual machines ( Operation steps can refer to the previous article 02. Quickly build Linux Environmental Science - Necessary for operation and maintenance ) , And then use Xftp Copy the downloaded installation package to the virtual machine .

3 decompression ik Word breaker into container
If not installed unzip Unzip tool , Install unzip Unzip tool .
apt install unzip
decompression ik Word breaker to current directory ik Under the folder .
Command format :unzip <ik Word breaker compression package >
example :
unzip ELK-IKv7.4.2.zip -d ./ik

Modify folder permissions to be readable and writable .
chmod -R 777 ik/
Delete ik Word breaker compression package
rm ELK-IKv7.4.2.zip
4 Check ik Word breaker installation
Enter into container
docker exec -it < Containers id> /bin/bash
see Elasticsearch Plug in for
elasticsearch-plugin list
give the result as follows , explain ik The word breaker is installed . Is it simple .
ik

And then quit Elasticsearch Containers , And restart Elasticsearch Containers
exit
docker restart elasticsearch
5 Use ik Chinese word segmentation
ik There are two modes of word segmentation
Intelligent word segmentation mode ( ik_smart )
Maximum combination word segmentation mode ( ik_max_word )
Let's take a look first Intelligent participle The effect of the pattern . For example, for A little star Chinese word segmentation , You get two words : One 、 Little star
We are Dev Tools Console Enter the following query
POST _analyze
{
"analyzer": "ik_smart",
"text": " A little star "
}
The results are as follows , Be divided into One and the little star .

And then look at Maximum combination word segmentation mode . Enter the following query statement .
POST _analyze
{
"analyzer": "ik_max_word",
"text": " A little star "
}
A little star Divided into 6 A word : One 、 One 、 star 、 Little star 、 Little star 、 The stars .

Let's look at another Chinese word segmentation . For example, search Wukong brother chat Architecture , Expected results : Brother Wukong 、 chat 、 Structure three words .
The actual result : enlightenment 、 Brother Kong 、 chat 、 Structure four words .ik The word segmentation device will Wukong brother word segmentation , Think Brother Kong It's a word . So we need to let ik The word breaker knows Brother Wukong It's a word , There's no need to split . So what to do ?

6 Custom word segmentation Thesaurus
6.1 How to customize Thesaurus
programme
Create a new thesaurus file , And then in ik The configuration file of the word breaker specifies the path of the word segmentation thesaurus file . You can specify a local path , You can also specify the remote server file path . Here we use the remote server file scheme , Because this solution can support hot updates ( Update server files ,ik The thesaurus will also be reloaded ) .
Modify the configuration file
ik The path of the word breaker's configuration file in the container :
/usr/share/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml.
You can modify this file by modifying the mapping file , File path :
/mydata/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml
Edit profile :
vim /mydata/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml
The contents of the configuration file are as follows :
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer Extended configuration </comment>
<!-- Users can configure their own extended dictionary here -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
<!-- Users can configure their own extended stop word dictionary here -->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!-- Users can configure the remote extended dictionary here -->
<entry key="remote_ext_dict">location</entry>
<!-- Users can configure the remote extension stop word dictionary here -->
<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>
Modify the configuration remote_ext_dict The attribute value , To specify a The path to the remote web site file , such as http://www.xxx.com/ikwords.text.
Here we can build our own set nginx Environmental Science , And then put ikwords.text Put it in nginx root directory .
6.2 build nginx Environmental Science
programme : First of all get nginx Mirror image , Then start a nginx Containers , And then nginx Copy the configuration file to the root directory , Delete the original nginx Containers , Then use the way to map the folder to restart nginx Containers .
adopt docker erection of tank nginx Environmental Science .
docker run -p 80:80 --name nginx -d nginx:1.10
Copy nginx Container configuration file to mydata The directory conf Folder
cd /mydata
docker container cp nginx:/etc/nginx ./conf
mydata Catalog It creates nginx Catalog
mkdir nginx
Move conf Folder to nginx Map folders
mv conf nginx/
Terminate and delete the original nginx Containers
docker stop nginx
docker rm < Containers id>
Start a new container
docker run -p 80:80 --name nginx \
-v /mydata/nginx/html:/usr/share/nginx/html \
-v /mydata/nginx/logs:/var/log/nginx \
-v /mydata/nginx/conf:/etc/nginx \
-d nginx:1.10
visit nginx service
192.168.56.10
newspaper 403 Forbidden, nginx/1.10.3 said nginx Service started normally .403 The reason for the abnormality is that nginx There are no documents under the service .
nginx Create a new directory html file
cd /mydata/nginx/html
vim index.html
hello passjava
Revisit nginx service
Browser printing hello passjava. Explain the visit nginx There is no problem with the service page .
establish ik Word segmentation thesaurus file
cd /mydata/nginx/html
mkdir ik
cd ik
vim ik.txt
Fill in Brother Wukong , And save the file .
Access thesaurus file
http://192.168.56.10/ik/ik.txt
The browser will output a bunch of garbled code , You can ignore the garbled problem first . Indicates that thesaurus file can access .
modify ik Word breaker configuration
cd /mydata/elasticsearch/plugins/ik/config
vim IKAnalyzer.cfg.xml

restart elasticsearch Container and set to start every time the machine is restarted elasticsearch Containers .
docker restart elasticsearch
docker update elasticsearch --restart=always
Query the word segmentation results again
You can see Brother Wukong talks about the structure Split into Brother Wukong 、 chat 、 framework Three words , Describes the user-defined thesaurus Brother Wukong Have effect .

- END -


This article is from WeChat official account. - Wukong chat structure (PassJava666).
If there is any infringement , Please contact the [email protected] Delete .
Participation of this paper “OSC Source creation plan ”, You are welcome to join us , share .
边栏推荐
- Want to self-study SCM, do you have any books and boards worth recommending?
- 【他字字不提爱,却句句都是爱】
- 哇哦,好丰富呀。
- 3dmax软件的制作木桶过程:三步流程
- Power representation in go language
- Acwing2013. three lines
- Ht81293 built in adaptive dynamic boost 20W mono class D power amplifier IC solution
- How to configure log4j to only keep log files for the last seven days?
- Americo technology launches professional desktop video editing solution
- Shandong finds clean energy that can be used by China for 3800 years? You should know the truth first
猜你喜欢
![[learn shell programming easily]-5. Plan tasks](/img/6f/8067d4201f0c2e7a692d89885e3ad9.png)
[learn shell programming easily]-5. Plan tasks

CTFHub-Web-信息泄露-目錄遍曆

Americo technology launches professional desktop video editing solution

joda. Time get date summary

CTFHub-Web-信息泄露-目录遍历

Several schemes of traffic exposure in kubernetes cluster

Expression of fatherly love

Kubernetes cluster dashboard & kuboard installation demo

アルマ / alchemy girl

Modify the default log level
随机推荐
ACWING/2004. 错字
1W字|40 图|硬核 ES 实战
Acwing2013. three lines
[轻松学会shell编程]-5、计划任务
Ht81293 built in adaptive dynamic boost 20W mono class D power amplifier IC solution
R & D thinking 07 - embedded intelligent product safety certification required
[acnoi2022] the structure of President Wang
lotus v1.16.0-rc2 Calibration-net
Streaming a large file using PHP
深入解析 Apache BookKeeper 系列:第三篇——读取原理
Practice of hierarchical management based on kubesphere
MSG_ OOB MSG_ PEEK
Capable people never complain about the environment!
Period to string [repeat] - period to string [duplicate]
Derivation of sin (a-b) =sina*cosb-sinb*cosa
Single lithium battery 3.7V power supply 2x12w stereo boost audio power amplifier IC combination solution
Three laws of go reflection
聚类和分类的最基本区别。
高效探索|ES地理位置查询的一次应用实践
100 times larger than the Milky way, Dutch astronomers found mysterious objects in deep space