当前位置：网站首页>Paddlenlp's UIE relationship extraction model [executive relationship extraction as an example]

Paddlenlp's UIE relationship extraction model [executive relationship extraction as an example]

2022-07-25 14:22:00 【Ting】

Review of previous projects ：

Paddlenlp And UIE Model actual combat entity extraction task 【 Taxi Data 、 Express bill 】

Paddlenlp And UIE Classification model 【 Take the classification of emotional tendency analysis news as an example 】 Including intelligent annotation scheme ）

Application practice ： Classification model integrator [PaddleHub、Finetune、prompt]

Link to this project ： It only needs fork It can be reproduced directly

Paddlenlp And UIE Relational extraction model 【 Take executive relationship extraction as an example 】

0. Background introduction

This project will demonstrate how to fine tune the model with small samples , Complete relationship extraction .

Data set situation ：
Executive data set demo：

Ma Yun is from Hangzhou, Zhejiang Province , One of the main founders of Alibaba Group . Currently, he is the chairman and CEO of Alibaba Group , He is 《 Forbes 》 The magazine was founded 50 The first mainland entrepreneur to become the cover person for many years , Once elected as the future global leader .
Ren Zhengfei is a private telecommunications equipment enterprise in Chinese Mainland - The founder and President of Huawei . He is about business “ Crisis management ” Its theory and practice have had a wide impact both inside and outside the industry .
ma , He is one of the main founders of Tencent and now serves as the chairman and CEO of the company's holding board . As a native entrepreneur in Shenzhen , He majored in computer and application in Shenzhen University , On 1993 Obtained a Bachelor of Science degree from Shenzhen University in .
Robin Lee is the founder, chairman and CEO of Baidu , Be fully responsible for the strategic planning and operation management of Baidu company , After years of development , Baidu has firmly occupied the Chinese search engine more than 7 Market share achieved .
Lei jun , 2012 year 8 In June, Xiaomi company, which it invested and founded, officially released Xiaomi mobile phone .
Qiang Dong Liu , Suyu District, Suqian City, Jiangsu Province , 360buy CEO.1996 Graduated from the Sociology Department of Renmin University of China .
Mr Liu , Famous Chinese Entrepreneur , investor , Former chairman of Lenovo Holdings Limited 、 Chairman of the board of directors of Lenovo Group Co., Ltd .

{"id":1845,"text":" Ma Yun is from Hangzhou, Zhejiang Province , One of the main founders of Alibaba Group . Currently, he is the chairman and CEO of Alibaba Group , He is 《 Forbes 》 The magazine was founded 50 The first mainland entrepreneur to become the cover person for many years , Once elected as the future global leader .","entities":[{"id":945,"label":" The person's name ","start_offset":0,"end_offset":2},{"id":946,"label":" company ","start_offset":10,"end_offset":16}],"relations":[{"id":11,"from_id":945,"to_id":946,"type":" senior executive "}]}
{"id":1846,"text":" Ren Zhengfei is a private telecommunications equipment enterprise in Chinese Mainland - The founder and President of Huawei .  He is about business “ Crisis management ” Its theory and practice have had a wide impact both inside and outside the industry .","entities":[{"id":949,"label":" The person's name ","start_offset":0,"end_offset":3},{"id":950,"label":" company ","start_offset":19,"end_offset":23}],"relations":[{"id":13,"from_id":949,"to_id":950,"type":" senior executive "}]}
{"id":1847,"text":" ma , He is one of the main founders of Tencent and now serves as the chairman and CEO of the company's holding board . As a native entrepreneur in Shenzhen , He majored in computer and application in Shenzhen University , On 1993 Obtained a Bachelor of Science degree from Shenzhen University in .","entities":[{"id":954,"label":" The person's name ","start_offset":0,"end_offset":3},{"id":955,"label":" company ","start_offset":5,"end_offset":7}],"relations":[{"id":16,"from_id":954,"to_id":955,"type":" senior executive "}]}
{"id":1848,"text":" Robin Lee is the founder, chairman and CEO of Baidu , Be fully responsible for the strategic planning and operation management of Baidu company , After years of development , Baidu has firmly occupied the Chinese search engine more than 7 Market share achieved .","entities":[{"id":932,"label":" The person's name ","start_offset":0,"end_offset":3},{"id":933,"label":" company ","start_offset":4,"end_offset":8},{"id":934,"label":" company ","start_offset":25,"end_offset":29}],"relations":[{"id":6,"from_id":932,"to_id":933,"type":" senior executive "}]}
{"id":1849,"text":" Lei jun , 2012 year 8 In June, Xiaomi company, which it invested and founded, officially released Xiaomi mobile phone .","entities":[{"id":941,"label":" The person's name ","start_offset":0,"end_offset":2},{"id":942,"label":" company ","start_offset":17,"end_offset":21}],"relations":[{"id":9,"from_id":941,"to_id":942,"type":" senior executive "}]}

Data loading

During data annotation , Don't get the relationship annotation wrong , See the previous article in detail , Annotation Teaching
doccano_file: from doccano Exported data annotation file .

save_dir: Training data storage directory , Default stored in data Under the table of contents .

negative_ratio: Maximum negative case ratio , This parameter is only valid for extraction type tasks , Properly constructing negative examples can improve the effect of the model . The number of negative instances is related to the actual number of tags , Maximum number of negative instances = negative_ratio * Number of positive examples . This parameter is only valid for training sets , The default is 5. In order to ensure the accuracy of the evaluation indicators , All negative examples of validation set and test set default construction .

splits: Training set when dividing data set 、 Proportion of validation sets . The default is [0.8, 0.1, 0.1] In accordance with the said 8:1:1 Divide the data into training sets 、 Validation set and test set .

task_type: Select the task type , There are two types of tasks: extraction and classification .

options: Specify the category label of the classification task , This parameter is only valid for category type tasks . The default is [“ positive ”, “ Negative ”].

prompt_prefix: Declare the classification task prompt Prefix information , This parameter is only valid for category type tasks . The default is " Emotional inclination ".

is_shuffle: Whether to randomly scatter the data set , The default is True.

seed: Random seeds , The default is 1000.

*separator: Entity category / Separator between evaluation dimension and classification label , This parameter is only applicable to entities / Evaluate the effectiveness of dimension level classification tasks . The default is "##".

import os
import time
import argparse
import json
import numpy as np

from utils_1 import set_seed, convert_ext_examples


def do_convert():
    set_seed(args.seed)

    tic_time = time.time()
    if not os.path.exists(args.input_file):
        raise ValueError("Please input the correct path of doccano file.")

    if not os.path.exists(args.save_dir):
        os.makedirs(args.save_dir)

    if len(args.splits) != 0 and len(args.splits) != 3:
        raise ValueError("Only []/ len(splits)==3 accepted for splits.")

    if args.splits and sum(args.splits) != 1:
        raise ValueError(
            "Please set correct splits, sum of elements in splits should be equal to 1."
        )

    with open(args.input_file, "r", encoding="utf-8") as f:
        raw_examples = f.readlines()

    def _create_ext_examples(examples, negative_ratio=0, shuffle=False):
        entities, relations = convert_ext_examples(examples, negative_ratio)
        examples = [e + r for e, r in zip(entities, relations)]
        if shuffle:
            indexes = np.random.permutation(len(examples))
            examples = [examples[i] for i in indexes]
        return examples

    def _save_examples(save_dir, file_name, examples):
        count = 0
        save_path = os.path.join(save_dir, file_name)
        with open(save_path, "w", encoding="utf-8") as f:
            for example in examples:
                for x in example:
                    f.write(json.dumps(x, ensure_ascii=False) + "\n")
                    count += 1
        print("\nSave %d examples to %s." % (count, save_path))

    if len(args.splits) == 0:
        examples = _create_ext_examples(raw_examples, args.negative_ratio,
                                        args.is_shuffle)
        _save_examples(args.save_dir, "train.txt", examples)
    else:
        if args.is_shuffle:
            indexes = np.random.permutation(len(raw_examples))
            raw_examples = [raw_examples[i] for i in indexes]

        i1, i2, _ = args.splits
        p1 = int(len(raw_examples) * i1)
        p2 = int(len(raw_examples) * (i1 + i2))

        train_examples = _create_ext_examples(
            raw_examples[:p1], args.negative_ratio, args.is_shuffle)
        dev_examples = _create_ext_examples(raw_examples[p1:p2])
        test_examples = _create_ext_examples(raw_examples[p2:])

        _save_examples(args.save_dir, "train.txt", train_examples)
        _save_examples(args.save_dir, "dev.txt", dev_examples)
        _save_examples(args.save_dir, "test.txt", test_examples)

    print('Finished! It takes %.2f seconds' % (time.time() - tic_time))


if __name__ == "__main__":
    # yapf: disable
    parser = argparse.ArgumentParser()

    parser.add_argument("--input_file", default="./data/data.json", type=str, help="The data file exported from doccano platform.")
    parser.add_argument("--save_dir", default="./data", type=str, help="The path to save processed data.")
    parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the classification task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
    parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.")
    parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")

    args = parser.parse_args()
    # yapf: enable

    do_convert()

! python preprocess.py --input_file ./data/gaoguan.jsonl \
    --save_dir ./data/ \
    --negative_ratio 5 \
    --splits 0.85 0.15 0 \
    --seed 1000

Converting doccano data...
100%|██████████████████████████████████████████| 8/8 [00:00<00:00, 21358.65it/s]
Adding negative samples for first stage prompt...
100%|██████████████████████████████████████████| 8/8 [00:00<00:00, 88534.12it/s]
Constructing relation prompts...
Adding negative samples for second stage prompt...
100%|██████████████████████████████████████████| 8/8 [00:00<00:00, 24070.61it/s]
Converting doccano data...
100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 25420.02it/s]
Adding negative samples for first stage prompt...
100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 45839.39it/s]
Constructing relation prompts...
Adding negative samples for second stage prompt...
100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 30066.70it/s]
Converting doccano data...
0it [00:00, ?it/s]
Adding negative samples for first stage prompt...
0it [00:00, ?it/s]

Save 64 examples to ./data/train.txt.

Save 6 examples to ./data/dev.txt.

The output part shows ：

{"content": " Chief architect of Netease , ding 1997 year 6 In September, Netease was founded , Netease has developed from a private enterprise with more than a dozen people to today's nearly 3000 Well known Internet technology companies whose employees are publicly listed in the United States .", "result_list": [{"text": " ding ", "start": 12, "end": 14}], "prompt": " The person's name "}
{"content": " Chief architect of Netease , ding 1997 year 6 In September, Netease was founded , Netease has developed from a private enterprise with more than a dozen people to today's nearly 3000 Well known Internet technology companies whose employees are publicly listed in the United States .", "result_list": [{"text": " Netease company ", "start": 0, "end": 4}, {"text": " Netease company ", "start": 23, "end": 27}], "prompt": " company "}
{"content": " Chief architect of Netease , ding 1997 year 6 In September, Netease was founded , Netease has developed from a private enterprise with more than a dozen people to today's nearly 3000 Well known Internet technology companies whose employees are publicly listed in the United States .", "result_list": [{"text": " ding ", "start": 12, "end": 14}, {"text": " ding ", "start": 12, "end": 14}], "prompt": " Netease executives "}
{"content": " Robin Lee is the founder, chairman and CEO of Baidu , Be fully responsible for the strategic planning and operation management of Baidu company , After years of development , Baidu has firmly occupied the Chinese search engine more than 7 Market share achieved .", "result_list": [{"text": " Robin Li ", "start": 0, "end": 3}], "prompt": " The person's name "}
{"content": " Robin Lee is the founder, chairman and CEO of Baidu , Be fully responsible for the strategic planning and operation management of Baidu company , After years of development , Baidu has firmly occupied the Chinese search engine more than 7 Market share achieved .", "result_list": [{"text": " Baidu company ", "start": 4, "end": 8}], "prompt": " company "}
{"content": " Robin Lee is the founder, chairman and CEO of Baidu , Be fully responsible for the strategic planning and operation management of Baidu company , After years of development , Baidu has firmly occupied the Chinese search engine more than 7 Market share achieved .", "result_list": [{"text": " Robin Li ", "start": 0, "end": 3}], "prompt": " Baidu executives "}

2. model training

import argparse
import time
import os
from functools import partial

import paddle
from paddle.utils.download import get_path_from_url
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import ErnieTokenizer

from model import UIE
from utils_1 import set_seed, convert_example, reader, evaluate, create_dataloader, SpanEvaluator

from visualdl import LogWriter

def do_train():
    paddle.set_device(args.device)
    rank = paddle.distributed.get_rank()
    if paddle.distributed.get_world_size() > 1:
        paddle.distributed.init_parallel_env()

    set_seed(args.seed)

    hidden_size = 768
    url = "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_state.pdparams"

    tokenizer = ErnieTokenizer.from_pretrained('ernie-3.0-base-zh')
    model = UIE('ernie-3.0-base-zh', hidden_size)

    if args.init_from_ckpt is not None:
        pretrained_model_path = args.init_from_ckpt
    else:
        pretrained_model_path = os.path.join(args.model, "model_state.pdparams")
        if not os.path.exists(pretrained_model_path):
            get_path_from_url(url, args.model)

    state_dict = paddle.load(pretrained_model_path)
    model.set_dict(state_dict)
    print("Init from: {}".format(pretrained_model_path))
    if paddle.distributed.get_world_size() > 1:
        model = paddle.DataParallel(model)

    train_ds = load_dataset(
        reader,
        data_path=args.train_path,
        max_seq_len=args.max_seq_len,
        lazy=False)
    dev_ds = load_dataset(
        reader,
        data_path=args.dev_path,
        max_seq_len=args.max_seq_len,
        lazy=False)

    trans_func = partial(
        convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)

    train_data_loader = create_dataloader(
        dataset=train_ds,
        mode='train',
        batch_size=args.batch_size,
        trans_fn=trans_func)

    dev_data_loader = create_dataloader(
        dataset=dev_ds,
        mode='dev',
        batch_size=args.batch_size,
        trans_fn=trans_func)

    optimizer = paddle.optimizer.AdamW(
        learning_rate=args.learning_rate, parameters=model.parameters())

    criterion = paddle.nn.BCELoss()
    metric = SpanEvaluator()
    # Initialize the logger 
    writer=LogWriter("./log/scalar_test")
    writer1=LogWriter("./log/scalar_test1")
        
    loss_list = []
    global_step = 0
    best_step = 0
    best_f1 = 0
    tic_train = time.time()
    for epoch in range(1, args.num_epochs + 1):
        for batch in train_data_loader:
            input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
            start_prob, end_prob = model(input_ids, token_type_ids, att_mask,
                                        pos_ids)
            start_ids = paddle.cast(start_ids, 'float32')
            end_ids = paddle.cast(end_ids, 'float32')
            loss_start = criterion(start_prob, start_ids)
            loss_end = criterion(end_prob, end_ids)
            loss = (loss_start + loss_end) / 2.0
            loss.backward()
            optimizer.step()
            optimizer.clear_grad()
            loss_list.append(float(loss))

            global_step += 1
            if global_step % args.logging_steps == 0 and rank == 0:
                time_diff = time.time() - tic_train
                loss_avg = sum(loss_list) / len(loss_list)
                writer.add_scalar(tag="train/loss", step=global_step, value=loss_avg) # Record loss
                print(
                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
                    % (global_step, epoch, loss_avg,
                    args.logging_steps / time_diff))
                tic_train = time.time()

            if global_step % args.valid_steps == 0 and rank == 0:
                # save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
                # if not os.path.exists(save_dir):
                # os.makedirs(save_dir)
                # save_param_path = os.path.join(save_dir, "model_state.pdparams")
                # paddle.save(model.state_dict(), save_param_path)

                precision, recall, f1 = evaluate(model, metric, dev_data_loader)
                writer1.add_scalar(tag="train/precision", step=global_step, value=precision)
                writer1.add_scalar(tag="train/recall", step=global_step, value=recall)
                writer1.add_scalar(tag="train/f1", step=global_step, value=f1)

                print("Evaluation precision: %.5f, recall: %.5f, F1: %.5f" %
                    (precision, recall, f1))
                if f1 > best_f1:
                    print(
                        f"best F1 performence has been updated: {
      best_f1:.5f} --> {
      f1:.5f}"
                    )
                    best_f1 = f1
                    save_dir = os.path.join(args.save_dir, "model_best")
                    save_best_param_path = os.path.join(save_dir,
                                                        "model_state.pdparams")
                    paddle.save(model.state_dict(), save_best_param_path)
                tic_train = time.time()


if __name__ == "__main__":
    # yapf: disable
    parser = argparse.ArgumentParser()
    #!
    parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.")
    # parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
    parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
    parser.add_argument("--train_path", default="./data/train.txt", type=str, help="The path of train set.")
    parser.add_argument("--dev_path", default="./data/dev.txt", type=str, help="The path of dev set.")
    parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
    parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum input sequence length. "
        "Sequences longer than this will be truncated, sequences shorter will be padded.")
    #! Here parameters determine the amount of training 
    parser.add_argument("--num_epochs", default=1, type=int, help="Total number of training epochs to perform.")
    # parser.add_argument("--num_epochs", default=100, type=int, help="Total number of training epochs to perform.")
    parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization")
    parser.add_argument("--logging_steps", default=1, type=int, help="The interval steps to logging.")
    parser.add_argument("--valid_steps", default=2, type=int, help="The interval steps to evaluate model performance.")
    #!
    parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", help="Select which device to train model, defaults to gpu.")
    parser.add_argument("--model", choices=["uie-base", "uie-tiny"], default="uie-tiny", type=str, help="Select the pretrained model for few-shot learning.")
    #? Model parameter initialization path 
    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.") 

    args = parser.parse_args()
    # yapf: enable

    args.device = paddle.device.get_device()

    do_train()

!python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint" \
    --learning_rate 1e-5 \
    --batch_size 8 \
    --max_seq_len 512 \
    --num_epochs 50 \
    --model "uie-base" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 50 \
    --device "gpu"

Some training effects are displayed ： The specific output has been folded

global step 640, epoch: 80, loss: 0.00002, speed: 3.99 step/s
global step 650, epoch: 82, loss: 0.00002, speed: 3.87 step/s
global step 660, epoch: 83, loss: 0.00002, speed: 3.99 step/s
global step 670, epoch: 84, loss: 0.00002, speed: 4.03 step/s
global step 680, epoch: 85, loss: 0.00002, speed: 4.00 step/s
global step 690, epoch: 87, loss: 0.00002, speed: 3.88 step/s
global step 700, epoch: 88, loss: 0.00002, speed: 3.99 step/s
Evaluation precision: 1.00000, recall: 0.85714, F1: 0.92308
global step 710, epoch: 89, loss: 0.00002, speed: 3.99 step/s
global step 720, epoch: 90, loss: 0.00002, speed: 4.00 step/s
global step 730, epoch: 92, loss: 0.00002, speed: 3.86 step/s
global step 740, epoch: 93, loss: 0.00002, speed: 3.97 step/s
global step 750, epoch: 94, loss: 0.00002, speed: 3.99 step/s
global step 760, epoch: 95, loss: 0.00002, speed: 3.99 step/s
global step 770, epoch: 97, loss: 0.00002, speed: 3.86 step/s
global step 780, epoch: 98, loss: 0.00002, speed: 3.98 step/s
global step 790, epoch: 99, loss: 0.00002, speed: 4.00 step/s
global step 800, epoch: 100, loss: 0.00001, speed: 4.00 step/s
Evaluation precision: 1.00000, recall: 0.85714, F1: 0.92308

Recommended GPU Environmental Science , Otherwise, memory overflow may occur .CPU In the environment , You can modify model by uie-tiny, Adjust properly batch_size.

To increase the accuracy ：–num_epochs Set up a bigger workout

Configurable parameter description ：
train_path: Training set file path .

dev_path: Validation set file path .

save_dir: Model storage path , The default is ./checkpoint.

learning_rate: Learning rate , The default is 1e-5.

batch_size: Batch size , Please adjust it in combination with the video memory , If there is insufficient video memory , Please lower this parameter appropriately , The default is 16.

max_seq_len: Maximum segmentation length of text , When the input exceeds the maximum length, the input text will be automatically segmented , The default is 512.

num_epochs: Number of training rounds , The default is 100.

model Choose a model , The program will fine tune the model based on the selected model , Optional uie-base and uie-tiny, The default is uie-base.

seed: Random seeds , The default is 1000.

logging_steps: Log printing interval steps Count , Default 10.

valid_steps: evaluate The interval of steps Count , Default 100.

device: What equipment to use for training , Optional cpu or gpu.

3. Model to evaluate

!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/dev.txt \
    --batch_size 8 \
    --max_seq_len 512

[2022-07-25 00:17:12,509] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
W0725 00:17:12.535902  1512 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0725 00:17:12.540369  1512 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[2022-07-25 00:17:16,896] [    INFO] - -----------------------------
[2022-07-25 00:17:16,897] [    INFO] - Class Name: all_classes
[2022-07-25 00:17:16,897] [    INFO] - Evaluation Precision: 1.00000 | Recall: 0.85714 | F1: 0.92308

**model_path**:  Model folder path for evaluation , The path needs to contain the model weight file model_state.pdparams And configuration files model_config.json.

**test_path:**  Test set files for evaluation .

**batch_size**:  Batch size , Please adjust it according to the situation of the machine , The default is 16.

**max_seq_len**:  Maximum segmentation length of text , When the input exceeds the maximum length, the input text will be automatically segmented , The default is 512.

**model**:  Select the model used , Optional uie-base, uie-medium, uie-mini, uie-micro and uie-nano, The default is uie-base.

**debug**:  Open or not debug The model evaluates each positive example category separately , This mode is only used for model debugging , Off by default .

4. Prediction of results

# Relationship extraction 
from pprint import pprint
import json
from paddlenlp import Taskflow

def openreadtxt(file_name):
    data = []
    file = open(file_name,'r',encoding='UTF-8')  # Open file 
    file_data = file.readlines() # Read all lines 
    for row in file_data:
        data.append(row) # Insert each row of data into data in  
    return data

data_input=openreadtxt('./input/test.txt')



schema = {
    " company ":" senior executive "}
few_ie = Taskflow('information_extraction', schema=schema, batch_size=16,task_path='./checkpoint/model_best')

results=few_ie(data_input)

with open("./output/result.txt", "w+",encoding='UTF-8') as f:    #a :  write file , If the file does not exist, it will be created first and then written , But it will not overwrite the original file , Instead, it is appended at the end of the file 
    for result in results:
        line = json.dumps(result, ensure_ascii=False)  # Default for Chinese ascii code . To output real Chinese, you need to specify ensure_ascii=False
        f.write(line + "\n")

print(" Data results have been exported ")
for idx, text in enumerate(data_input):
    print(data_input[idx],results[idx])

Input file display ：

 Huang Zheng ,1980 Born in Hangzhou, Zhejiang , Founder of pinduoduo , Graduated from Zhejiang University 、 Master's degree from the University of Wisconsin Madison .
 The founder of Bili Bili company is Xu Yi , Xu Yi is the earliest founder of BiliBili , But always behind the scenes , Not particularly public . Used to be Acfun Members of bullet screen , Then imitate Acfun Set up your own website , Now the director .

Output shows ：

 Huang Zheng ,1980 Born in Hangzhou, Zhejiang , Founder of pinduoduo , Graduated from Zhejiang University 、 Master's degree from the University of Wisconsin Madison .
 {' company ': [{'text': ' A lot of spelling ', 'start': 16, 'end': 19, 'probability': 0.935215170074585, 'relations': {' senior executive ': [{'text': ' Huang Zheng ', 'start': 0, 'end': 2, 'probability': 0.9996391253586268}]}}]}
 The founder of Bili Bili company is Xu Yi , Xu Yi is the earliest founder of BiliBili , But always behind the scenes , Not particularly public . Used to be Acfun Members of bullet screen , Then imitate Acfun Set up your own website , Now the director . {' company ': [{'text': ' Bleep company ', 'start': 0, 'end': 6, 'probability': 0.7246855227849665, 'relations': {' senior executive ': [{'text': ' Xu Yi ', 'start': 11, 'end': 13, 'probability': 0.9985462800938478}]}}]}

5 summary

UIE(Universal Information Extraction)：Yaojie Lu Et al. ACL-2022 A unified framework for general information extraction is proposed UIE. The framework implements Entity extraction 、 Relationship extraction 、 Event extraction 、 Sentiment analysis Unified modeling of such tasks , And make different tasks have good migration and generalization ability .PaddleNLP Learn from the method of this paper , be based on ERNIE 3.0 Knowledge enhancement pre training model , Train and open source the first Chinese general information extraction model UIE. The model can support the extraction of key information without limiting the industry field and the extraction target , Realize zero sample fast cold start , And have excellent small sample fine-tuning ability , Quickly adapt to specific extraction targets .

UIE The advantages of

Easy to use ： Users can use natural language to customize the extraction target , The corresponding information in the input text can be extracted uniformly without training . Out of the box , And meet all kinds of information extraction needs .

Authors efficiency ： The previous information extraction technology needs a large number of labeled data to ensure the effect of information extraction , In order to improve the development efficiency in the development process , Reduce unnecessary duplication of time , Open domain information extraction can achieve zero samples （zero-shot） Or less samples （few-shot） extract , Greatly reduce label data dependency , While reducing costs , It also improves the effect .

The effect is leading ： Open domain information extraction is used in many scenarios , On a variety of tasks , All of them have excellent performance .

This time, I mainly share this case with you through relationship extraction , send demo The project is better , Interested students can try Cross task extraction 、 as well as Multi entity 、 Multi relation extraction

At present, I have been evaluating open source datasets F1 stay 85%–90% Between , By comparison, the difficulty of the data set is generally in line with expectations , You can leave a message if you have problems .

My blog ：https://blog.csdn.net/sinat_39620217?type=blog

原网站

版权声明
本文为[Ting]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251416322631.html