当前位置:网站首页>Kingbasees plug-in ftutilx of Jincang database

Kingbasees plug-in ftutilx of Jincang database

2022-06-25 11:07:00 Thousands of sails pass by the side of the sunken boat_


1. The plugin is introduced


ftutilx It's a KingbaseES An extension of , It is mainly used to format files from storage streams blob Extract text content from the type field . among blob Type field contents can include pdf、doc、docx、wps、xls、xlsx、ppt and pptx Format file .ftutilx The plug-in does not support encrypted file format .



2. Add plug-ins


In the use of ftutilx Before , You need to add it to kingbase.conf Of documents shared_preload_libraries in , And restart KingbaseES database .

shared_preload_libraries = 'ftutilx' # (change requires restart)
Get into KingbaseES Creating plug-ins
CREATE EXTENSION ftutilx;


3. Parameter configuration


ftutilx.max_string_length
Maximum length of extraction result , The default value is :128M, This parameter takes effect immediately after it is set .
ftutilx.jvm_option_string
JVM Initialize parameters , The default value is :"-Xmx1024m,-Xms1024m,-Xmn256m,-XX:MetaspaceSize=64m,-XX:MaxMetaspaceSize=128m,-XX:CompressedClassSpaceSize=256m", This parameter is only called for the first time in the session process extracttext Function creation JVM Effective when , Setting this parameter again is no longer valid .
Under the database default extended loading mechanism , After creating an extension in a session , The extended dynamic library is not loaded immediately after a new session starts , Instead, the extension dynamic library will not be loaded until the interface in the extension is called for the first time , As a result, it is invalid to set the extension parameters in the new session . The solution is : Modify... In the database configuration file shared_preload_libraries perhaps session_preload_libraries One of the two parameters , Make the parameter value include ftutilx, It can be loaded immediately after the new session starts ftutilx Extend dynamic library , And set the extension parameters .


 

4. Use ftutilx


ftutilx The plug-in provides extracttext Function is used to extract data stored in blob File contents in the type field .extracttext The() function accepts a that represents the contents of a file blob Type parameter , Returns the extracted text Type text content .

CREATE TABLE tab (title text, body blob);
INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));
SELECT title, length(extracttext(body)) FROM tab;

4.1. Use ftutilx The joint use scheme of full-text retrieval
Because the extraction speed of electronic document content is slow , To improve the performance of full-text retrieval , You can add storage columns to a table , It is used to store content extraction results or word position lists .
Scheme 1 :
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION zhparsercfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION zhparsercfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;

CREATE EXTENSION ftutilx;
CREATE TABLE tab (title text, body blob);

ALTER TABLE tab ADD COLUMN content text GENERATED ALWAYS AS (extracttext(body)) STORED;
CREATE INDEX tab_idx ON tab USING GIN (to_tsvector('zhparsercfg', content));

INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));

SELECT title FROM tab WHERE to_tsvector('zhparsercfg', content) @@ to_tsquery(' journal ');
Option two :
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION zhparsercfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION zhparsercfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;

CREATE EXTENSION ftutilx;
CREATE TABLE tab (title text, body blob);

ALTER TABLE tab ADD COLUMN tab_idx_col tsvector GENERATED ALWAYS AS (to_tsvector('zhparsercfg', extracttext(body))) STORED;
CREATE INDEX tab_idx ON tab USING GIN (tab_idx_col);

INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));

SELECT title FROM tab WHERE tab_idx_col @@ to_tsquery(' journal ');


4.2. matters needing attention
1) ftutilx Need to rely on jre-1.8.0 Runtime environment , Settings required after deployment LD_LIBRARY_PATH The system environment variable contains jre-1.8.0 Of libjvm.so route .
2) ftutilx.max_string_length Parameter is used to configure the maximum length of the extraction result , But because of tsvector At present, the biggest support (1M-1), therefore extracttext combination to_tsvector When using , The size of the word segmentation result cannot exceed (1M-1).
3) ftutilx Need to create JVM,JVM It will occupy more memory . Although adjusted ftutilx.jvm_option_string Of -Xmx Can restrict JVM Memory footprint , But too small -Xmx Value will cause large file parsing JVM An out of memory exception occurred .
4) Based on the previous full-text retrieval joint use scheme , In an environment with less system memory , You need to limit the number of session processes that insert data in parallel , In case the system memory is exhausted .


 

5. Uninstall plugins

drop extension ftutilx;

 
原网站

版权声明
本文为[Thousands of sails pass by the side of the sunken boat_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206251044549835.html