当前位置:网站首页>【知识图谱】实践篇——基于医疗知识图谱的问答系统实践(Part3):基于规则的问题分类
【知识图谱】实践篇——基于医疗知识图谱的问答系统实践(Part3):基于规则的问题分类
2022-07-25 16:38:00 【科皮子菊】
前序文章:
背景
基于前面的章节,我们可以认为当前已经有了一个可以提供关于医疗知识的问答知识库。在进行pipline方式问答任务时,接到问题后,通常就是将问题进行分类,以作精细化的处理与回答。这个问题分类通常也被称为意图识别。对于意图识别获问题分类来说,本质上就是对文本进行分类,可以使用传统的机器学习算法以及深度学习算法来处理该问题,但是在缺乏语料标注的情况下,使用规则可能是最好的方式。原项目就是如此。
基于规则的问题分类
在知识图谱数据入库的模块中提供了实体数据导出功能,导出的数据即为一些实体数据,除此之外源代码中还提供了一些否定词deny.txt,我也将该文件放到dict文件夹下。这部分都是基于规则进行分类的特征词。问题的问题主要是接下来的对应类别的问题解析,已经问题搜索做准备。
下面就开始设计问题分类的类。KGQAMedicine\question_classify\rule_question_classify.py
其中为了能够快速匹配到问句中是否包含特征词库,这里引入一个包ahocorasick, 安装:pip install pyahocorasick
问题分类的第一步是判断问句内容中是否有图数据库中的实体内容,如果没有就无法做出相关的查询解答。
基于规则的分类方式主要是使用关键词匹配。其中问题支持以下类别:
| 问句类型 | 中文含义 | 问句举例 |
|---|---|---|
| disease_symptom | 疾病症状 | 乳腺癌的症状有哪些? |
| symptom_disease | 已知症状找可能疾病 | 最近老流鼻涕怎么办? |
| disease_cause | 疾病病因 | 为什么有的人会失眠? |
| disease_acompany | 疾病的并发症 | 失眠有哪些并发症? |
| disease_not_food | 疾病需要忌口的食物 | 失眠的人不要吃啥? |
| disease_do_food | 疾病建议吃什么食物 | 耳鸣了吃点啥? |
| food_not_disease | 什么病最好不要吃某事物 | 哪些人最好不好吃蜂蜜? |
| food_do_disease | 食物对什么病有好处 | 鹅肉有什么好处? |
| disease_drug | 啥病要吃啥药 | 肝病要吃啥药? |
| drug_disease | 药品能治啥病 | 板蓝根颗粒能治啥病? |
| disease_check | 疾病需要做什么检查 | 脑膜炎怎么才能查出来? |
| check_disease | 检查能查什么病 | 全血细胞计数能查出啥来? |
| disease_prevent | 预防措施 | 怎样才能预防肾虚? |
| disease_lasttime | 治疗周期 | 感冒要多久才能好? |
| disease_cureway | 治疗方式 | 高血压要怎么治? |
| disease_cureprob | 治愈概率 | 白血病能治好吗? |
| disease_easyget | 疾病易感人群 | 什么人容易得高血压? |
| disease_desc | 疾病描述 | 糖尿病 |
具体实现如下:
import os
import ahocorasick
import tqdm
from utils.config import SysConfig
class RuleQuestionClassifier(object):
disease_feature_words = []
department_feature_words = []
check_feature_words = []
drug_feature_words = []
food_feature_words = []
producer_feature_words = []
symptom_feature_words = []
region_feature_words = set()
deny_feature_words = []
# 问句疑问词
symptom_qwds = ['症状', '表征', '现象', '症候', '表现']
cause_qwds = ['原因', '成因', '为什么', '怎么会', '怎样才', '咋样才', '怎样会', '如何会', '为啥', '为何', '如何才会', '怎么才会', '会导致', '会造成']
acompany_qwds = ['并发症', '并发', '一起发生', '一并发生', '一起出现', '一并出现', '一同发生', '一同出现', '伴随发生', '伴随', '共现']
food_qwds = ['饮食', '饮用', '吃', '食', '伙食', '膳食', '喝', '菜', '忌口', '补品', '保健品', '食谱', '菜谱', '食用', '食物', '补品']
drug_qwds = ['药', '药品', '用药', '胶囊', '口服液', '炎片']
prevent_qwds = ['预防', '防范', '抵制', '抵御', '防止', '躲避', '逃避', '避开', '免得', '逃开', '避开', '避掉', '躲开', '躲掉', '绕开',
'怎样才能不', '怎么才能不', '咋样才能不', '咋才能不', '如何才能不', '怎样才不', '怎么才不', '咋样才不', '咋才不',
'如何才不', '怎样才可以不', '怎么才可以不', '咋样才可以不', '咋才可以不', '如何可以不', '怎样才可不', '怎么才可不',
'咋样才可不', '咋才可不', '如何可不']
lasttime_qwds = ['周期', '多久', '多长时间', '多少时间', '几天', '几年', '多少天', '多少小时', '几个小时', '多少年']
cureway_qwds = ['怎么治疗', '如何医治', '怎么医治', '怎么治', '怎么医', '如何治', '医治方式', '疗法', '咋治', '怎么办', '咋办', '咋治']
cureprob_qwds = ['多大概率能治好', '多大几率能治好', '治好希望大么', '几率', '几成', '比例', '可能性', '能治', '可治', '可以治', '可以医']
easyget_qwds = ['易感人群', '容易感染', '易发人群', '什么人', '哪些人', '感染', '染上', '得上']
check_qwds = ['检查', '检查项目', '查出', '检查', '测出', '试出']
belong_qwds = ['属于什么科', '属于', '什么科', '科室']
cure_qwds = ['治疗什么', '治啥', '治疗啥', '医治啥', '治愈啥', '主治啥', '主治什么', '有什么用', '有何用', '用处', '用途',
'有什么好处', '有什么益处', '有何益处', '用来', '用来做啥', '用来作甚', '需要', '要']
def __init__(self):
self.region_actree = None
self.word_kind_dict = None
self._init()
@staticmethod
def _load_line_file(file_path):
print(f"load file {
file_path}")
data_list = []
with open(file_path, 'r', encoding='utf8') as reader:
for line in reader:
if not line.strip():
continue
data_list.append(line.strip())
return data_list
def _init(self):
# load data
file_list = ["disease", "department", "check", "drug", "food", "producer", "symptoms", "deny"]
for index, file_path in enumerate(file_list):
data_list = self._load_line_file(os.path.join(SysConfig.DATA_DICT_DIR, file_path + ".txt"))
setattr(self, file_path + "_feature_words", data_list)
self.region_feature_words.update(data_list)
# build actree
self.region_actree = self._get_actree(list(self.region_feature_words))
# build word kind dict
self._build_word_kind_dict()
print("object init over")
def _build_word_kind_dict(self):
word_kind_dict = {
}
for word in tqdm.tqdm(self.region_feature_words, desc='building word kind dict'):
word_kind_dict.setdefault(word, [])
if word in self.disease_feature_words:
word_kind_dict[word].append("disease")
if word in self.department_feature_words:
word_kind_dict[word].append("department")
if word in self.check_feature_words:
word_kind_dict[word].append("check")
if word in self.drug_feature_words:
word_kind_dict[word].append("drug")
if word in self.food_feature_words:
word_kind_dict[word].append("food")
if word in self.symptom_feature_words:
word_kind_dict[word].append("symptom")
if word in self.producer_feature_words:
word_kind_dict[word].append("producer")
self.word_kind_dict = word_kind_dict
@staticmethod
def _get_actree(key_list):
actree = ahocorasick.Automaton()
for index, word in enumerate(key_list):
actree.add_word(word, (index, word))
actree.make_automaton()
return actree
def classify(self, question):
classify_res = {
}
medical_dict = self.check_query(question)
if not medical_dict:
return {
}
classify_res['args'] = medical_dict
region_word_kinds = []
for kinds in medical_dict.values():
region_word_kinds.extend(kinds)
question_kinds = []
# disease symptom
self.sub_classify(self.symptom_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_symptom")
# symptom disease
self.sub_classify(self.symptom_qwds, question, 'symptom', region_word_kinds, question_kinds, "symptom_disease")
# disease cause
self.sub_classify(self.cause_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_cause")
# disease accompany
self.sub_classify(self.acompany_qwds, question, 'disease', region_word_kinds, question_kinds,
"disease_accompany")
# disease food
if self.check_words(self.food_qwds, question) and 'disease' in region_word_kinds:
deny_status = self.check_words(self.deny_feature_words, question)
if deny_status:
question_kind = "disease_not_food"
else:
question_kind = "disease_do_food"
question_kinds.append(question_kind)
# food disease
if self.check_words(self.food_qwds + self.cure_qwds, question) and 'food' in region_word_kinds:
deny_status = self.check_words(self.deny_feature_words, question)
if deny_status:
question_kind = 'food_not_disease'
else:
question_kind = 'food_do_disease'
question_kinds.append(question_kind)
# disease_drug
self.sub_classify(self.drug_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_drug")
# drug disease
self.sub_classify(self.cure_qwds, question, 'drug', region_word_kinds, question_kinds, "drug_disease")
# disease check
self.sub_classify(self.check_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_check")
# check disease
self.sub_classify(self.check_qwds + self.cure_qwds, question, 'check', region_word_kinds, question_kinds,
"check_disease")
# disease prevent
self.sub_classify(self.prevent_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_prevent")
# disease last time
self.sub_classify(self.lasttime_qwds, question, 'disease', region_word_kinds, question_kinds,
"disease_lasttime")
# disease cure way
self.sub_classify(self.cureway_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_cureway")
# disease cure prob
self.sub_classify(self.cureprob_qwds, question, 'disease', region_word_kinds, question_kinds,
"disease_cureprob")
# disease easy get
self.sub_classify(self.easyget_qwds, question, 'disease', region_word_kinds, question_kinds, "disease_easyget")
# others deal
if question_kinds == [] and 'disease' in region_word_kinds:
question_kinds.append('disease_desc')
if question_kinds == [] and 'symptom' in region_word_kinds:
question_kinds.append('symptom_disease')
classify_res['question_kinds'] = question_kinds
return classify_res
def sub_classify(self, kind_qkwds, question, key, region_word_kinds, question_kinds, kind_type):
if self.check_words(kind_qkwds, question) and (key in region_word_kinds):
question_kinds.append(kind_type)
@staticmethod
def check_words(kws, question):
for kw in kws:
if kw in question:
return True
return False
def check_query(self, question):
region_feature_words = []
for i in self.region_actree.iter(question):
feature_word = i[1][1]
region_feature_words.append(feature_word)
inner_words = []
for i in range(len(region_feature_words)):
wi = region_feature_words[i]
for j in range(i + 1, len(region_feature_words)):
wj = region_feature_words[j]
if wi in wj and wi != wj:
inner_words.append(wi)
final_dict = {
word: self.word_kind_dict.get(word) for word in
filter(lambda x: x not in inner_words, region_feature_words)}
return final_dict
效果测试:
效果也基本上符合预期。当然,也可以使用实体识别识别出目标实体以及使用基于深度学习的模型对问题进行分类提高问题分类泛化能力以及召回效果。使用深度学习的方式去优化,也就意味着需要大量的标注数据。
边栏推荐
- 百度富文本编辑器UEditor 图片宽度100%自适应,手机端
- Win11动态磁贴没了?Win11中恢复动态磁贴的方法
- The annualized interest rate of treasury bonds is too low. Is there a financial product with a higher annualized interest rate than the reverse repurchase of treasury bonds?
- 01. A simpler way to deliver a large number of props
- 免费的低代码开发平台有哪些?
- unity 最好用热更方案卧龙 wolong
- Various useful forms of London Silver K-line chart
- 7.依赖注入
- 3D semantic segmentation - PVD
- easyui datagrid控件使用
猜你喜欢

柏睿数据加入阿里云PolarDB开源数据库社区

IAAs infrastructure cloud cloud network

微信公众号开发之消息的自动回复

Breakthrough in core technology of the large humanoid Service Robot Walker x

easyui下拉框,增加以及商品的上架,下架

MyBaits

Why 4everland is the best cloud computing platform for Web 3.0

Use huggingface to quickly load pre training models and datasets in moment pool cloud

Paper notes: highly accurate protein structure prediction with alphafold (alphafold 2 & appendix)

Win11动态磁贴没了?Win11中恢复动态磁贴的方法
随机推荐
月薪1万在中国是什么水平?答案揭露残酷的收入真相
Fastadmin TP installation uses Baidu rich text editor ueeditor
easyui修改以及datagrid dialog form控件使用
复旦大学EMBA2022毕业季丨毕业不忘初心 荣耀再上征程
从业务需求出发,开启IDC高效运维之路
After 20 years of agitation, the chip production capacity has started from zero to surpass that of the United States, which is another great achievement made in China
【MySQL篇】一文带你初识数据库
0x80131500 solution for not opening Microsoft Store
MyBaits
doGet与doPost
Test framework unittest skip test
slf4j 搭配 log4j2 处理日志
fastadmin tp 安装使用百度富文本编辑器UEditor
Sum arrays with recursion
使用Huggingface在矩池云快速加载预训练模型和数据集
Getting started with easyUI
Who moved my memory and revealed the secret of 90% reduction in oom crash
【云驻共创】探秘GaussDB如何助力工商银行打造金融核心数据
国债年化利率太低了,有比国债逆回购年化利率还要高的理财产品吗?
3D 语义分割——Scribble-Supervised LiDAR Semantic Segmentation