当前位置:网站首页>Kingbasees plug-in ftutilx of Jincang database
Kingbasees plug-in ftutilx of Jincang database
2022-06-25 11:07:00 【Thousands of sails pass by the side of the sunken boat_】
Catalog
1. The plugin is introduced
ftutilx It's a KingbaseES An extension of , It is mainly used to format files from storage streams blob Extract text content from the type field . among blob Type field contents can include pdf、doc、docx、wps、xls、xlsx、ppt and pptx Format file .ftutilx The plug-in does not support encrypted file format .
2. Add plug-ins
In the use of ftutilx Before , You need to add it to kingbase.conf Of documents shared_preload_libraries in , And restart KingbaseES database .
shared_preload_libraries = 'ftutilx' # (change requires restart)
CREATE EXTENSION ftutilx;
3. Parameter configuration
ftutilx.max_string_length
Maximum length of extraction result , The default value is :128M, This parameter takes effect immediately after it is set .
ftutilx.jvm_option_string
JVM Initialize parameters , The default value is :"-Xmx1024m,-Xms1024m,-Xmn256m,-XX:MetaspaceSize=64m,-XX:MaxMetaspaceSize=128m,-XX:CompressedClassSpaceSize=256m", This parameter is only called for the first time in the session process extracttext Function creation JVM Effective when , Setting this parameter again is no longer valid .
Under the database default extended loading mechanism , After creating an extension in a session , The extended dynamic library is not loaded immediately after a new session starts , Instead, the extension dynamic library will not be loaded until the interface in the extension is called for the first time , As a result, it is invalid to set the extension parameters in the new session . The solution is : Modify... In the database configuration file shared_preload_libraries perhaps session_preload_libraries One of the two parameters , Make the parameter value include ftutilx, It can be loaded immediately after the new session starts ftutilx Extend dynamic library , And set the extension parameters .
4. Use ftutilx
ftutilx The plug-in provides extracttext Function is used to extract data stored in blob File contents in the type field .extracttext The() function accepts a that represents the contents of a file blob Type parameter , Returns the extracted text Type text content .
CREATE TABLE tab (title text, body blob);
INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));
SELECT title, length(extracttext(body)) FROM tab;
4.1. Use ftutilx The joint use scheme of full-text retrieval
Because the extraction speed of electronic document content is slow , To improve the performance of full-text retrieval , You can add storage columns to a table , It is used to store content extraction results or word position lists .
Scheme 1 :
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION zhparsercfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION zhparsercfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
CREATE EXTENSION ftutilx;
CREATE TABLE tab (title text, body blob);
ALTER TABLE tab ADD COLUMN content text GENERATED ALWAYS AS (extracttext(body)) STORED;
CREATE INDEX tab_idx ON tab USING GIN (to_tsvector('zhparsercfg', content));
INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));
SELECT title FROM tab WHERE to_tsvector('zhparsercfg', content) @@ to_tsquery(' journal ');
Option two :
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION zhparsercfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION zhparsercfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
CREATE EXTENSION ftutilx;
CREATE TABLE tab (title text, body blob);
ALTER TABLE tab ADD COLUMN tab_idx_col tsvector GENERATED ALWAYS AS (to_tsvector('zhparsercfg', extracttext(body))) STORED;
CREATE INDEX tab_idx ON tab USING GIN (tab_idx_col);
INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));
SELECT title FROM tab WHERE tab_idx_col @@ to_tsquery(' journal ');
4.2. matters needing attention
1) ftutilx Need to rely on jre-1.8.0 Runtime environment , Settings required after deployment LD_LIBRARY_PATH The system environment variable contains jre-1.8.0 Of libjvm.so route .
2) ftutilx.max_string_length Parameter is used to configure the maximum length of the extraction result , But because of tsvector At present, the biggest support (1M-1), therefore extracttext combination to_tsvector When using , The size of the word segmentation result cannot exceed (1M-1).
3) ftutilx Need to create JVM,JVM It will occupy more memory . Although adjusted ftutilx.jvm_option_string Of -Xmx Can restrict JVM Memory footprint , But too small -Xmx Value will cause large file parsing JVM An out of memory exception occurred .
4) Based on the previous full-text retrieval joint use scheme , In an environment with less system memory , You need to limit the number of session processes that insert data in parallel , In case the system memory is exhausted .
5. Uninstall plugins
drop extension ftutilx;
边栏推荐
猜你喜欢
Network remote access using raspberry pie
報名開啟|飛槳黑客馬拉松第三期如約而至,久等啦
网易开源的分布式存储系统 Curve 正式成为 CNCF 沙箱项目
无心剑中译伊玛·拉扎罗斯《新巨人·自由女神》
Explanation and use of kotlin syntax for Android
Shen Lu, China Communications Institute: police open source Protocol - ofl v1.1 Introduction and Compliance Analysis
软件测试 避免“试用期被辞退“指南,看这一篇就够了
Coscon'22 lecturer solicitation order
龙书虎书鲸书啃不动?试试豆瓣评分9.5的猴书
Technical practice and development trend of video conference all in one machine
随机推荐
XSS attack
撸一个随机数生成器
手机炒股安全吗?
Network remote access using raspberry pie
Android之Kotlin语法详解与使用
一个数学难题,难倒两位数学家
Server rendering
2022年PMP项目管理考试敏捷知识点(2)
Oracle彻底卸载的完整步骤
Query method and interrupt method to realize USART communication
Jincang database kingbasees plug-in force_ view
c盘使用100%清理方法
keep-alive
Detection and analysis of simulator in an app
ES 学习
Explanation and use of kotlin syntax for Android
Performance network
动态规划解决股票问题(上)
Chinese translation of IMA Lazarus' the new giant, the goddess of Liberty
Writing wechat applet with uni app