当前位置：网站首页>[knowledge atlas] practice -- Practice of question and answer system based on medical knowledge atlas (Part2): Atlas data preparation and import

[knowledge atlas] practice -- Practice of question and answer system based on medical knowledge atlas (Part2): Atlas data preparation and import

2022-07-24 02:24:00 【Coriander Chrysanthemum】

Preface article ：

【 Knowledge map 】 Practice chapter —— Practice of question answering system based on medical knowledge map （Part1）： Project introduction and environmental preparation

background

The environment preparation of the system has been introduced above . Next, we will introduce the acquisition of atlas data , Data mainly from ：http://jib.xywy.com/ Crawling .

Environmental preparation

According to the original plan, the code related to data crawling is also passed , So the following related configurations are made .

The database selected for data storage here is mongodb. As usual , I still use docker Install in a containerized way , Relevant methods can refer to :Docker install MongoDB. Then install the connection mongodb Driver program ：pip install pymongo. Because you need to connect to the database , Here, follow the previous practice , Create a profile .
KGQAMedicine/data/config.ini

[neo4j]
host=http://192.168.56.101
port=7474
user=neo4j
password=root
[mongodb]
host=http://192.168.56.101
port=27017
user=admin
password=123456
[sys]

KGQAMedicine/utils/config.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-
""" @author: juzipi @file: config.py @time:2022/07/20 @description: """
from configparser import ConfigParser


class SysConfig(object):
    __doc__ = """ system config """

    #  Single case , Globally unique 
    def __new__(cls, *args, **kwargs):
        if not hasattr(SysConfig, '_instance'):
            SysConfig._instance = object.__new__(cls)
        return SysConfig._instance

    config_parser = ConfigParser()
    config_parser.read("./data/config.ini")
    # neo4j
    NEO4J_HOST = config_parser.get("neo4j", 'host')
    NEO4J_PORT = int(config_parser.get("neo4j", 'port'))
    NEO4J_USER = config_parser.get("neo4j", 'user')
    NEO4J_PASSWORD = config_parser.get('neo4j', 'password')
    # mongodb
    MONGODB_HOST = config_parser.get("mongodb", 'host')
    MONGODB_PORT = int(config_parser.get("mongodb", 'port'))
    MONGODB_USER = config_parser.get("mongodb", 'user')
    MONGODB_PASSWORD = config_parser.get('mongodb', 'password')

besides , We also use requests Libraries and lxml Crawl data and parse pages .

But considering the long time of the original project , The original crawled page may change . Here is the data source website , If you are interested in relevant data , You can crawl on this website . We use the data that has been crawled and processed in the original project .

Data sources

The data in the project comes from Seek medical advice Website , It should be noted that , The website also states ： It cannot be used as a basis for diagnosis and medical treatment . The website page is as follows ：
Insert picture description here
I put the data from the original project into KGQAMedicine/data/medicial.json in , And put the path configuration in the configuration file . In terms of data form , The data in this file should be from mongodb Derived from .

Map data import

The rewritten content code of this part is as follows ：

KGQAMedicine/get_data/build_graph.py

import json
import os
import tqdm
from py2neo import Graph, Node
from utils.config import SysConfig


class MedicalGraph(object):

    def __init__(self):
        self.data_path = SysConfig.DATA_ORIGIN_PATH
        self.graph = Graph(SysConfig.NEO4J_HOST + ":" + str(SysConfig.NEO4J_PORT), auth=(SysConfig.NEO4J_USER,
                                                                                         SysConfig.NEO4J_PASSWORD))
        self.raw_graph_data = None

    def _read_nodes(self):
        #  common ７ Class node 
        drugs = []  #  drug 
        foods = []  #  food 
        checks = []  #  Check 
        departments = []  #  department 
        producers = []  #  Big drugs 
        diseases = []  #  disease 
        symptoms = []  #  symptoms 
        disease_infos = []  #  Disease information 
        #  Build node entity relationships 
        relation_department_department = []  #  department － The relationship between departments 
        relation_diseases_noteat = []  #  disease － Avoid eating food 
        relation_diseases_doeat = []  #  disease － It's good to eat food 
        relation_diseases_recommandeat = []  #  disease － It is recommended to eat food 
        relation_diseases_commonddrug = []  #  disease － General drug relations 
        rels_recommanddrug = []  #  disease － Hot drug relationships 
        rels_check = []  #  disease － Check the relationship 
        relation_drug_producer = []  #  manufacturer － Drug relations 

        rels_symptom = []  #  The relationship between disease symptoms 
        rels_acompany = []  #  The relationship between disease and complication 
        rels_category = []  #  The relationship between diseases and departments 

        with open(self.data_path, 'r', encoding='utf8') as reader:
            for data in tqdm.tqdm(reader, desc=f"reading {
      self.data_path} fle"):
                disease_dict = {
    }
                data_json = json.loads(data)
                disease = data_json['name']
                disease_dict['name'] = disease
                diseases.append(disease)
                disease_dict['desc'] = ''
                disease_dict['prevent'] = ''
                disease_dict['cause'] = ''
                disease_dict['easy_get'] = ''
                disease_dict['cure_department'] = ''
                disease_dict['cure_way'] = ''
                disease_dict['cure_lasttime'] = ''
                disease_dict['symptom'] = ''
                disease_dict['cured_prob'] = ''

                if 'symptom' in data_json:
                    symptoms += data_json['symptom']
                    for symptom in data_json['symptom']:
                        rels_symptom.append([disease, symptom])

                if 'acompany' in data_json:
                    for acompany in data_json['acompany']:
                        rels_acompany.append([disease, acompany])

                if 'desc' in data_json:
                    disease_dict['desc'] = data_json['desc']

                if 'prevent' in data_json:
                    disease_dict['prevent'] = data_json['prevent']

                if 'cause' in data_json:
                    disease_dict['cause'] = data_json['cause']

                if 'get_prob' in data_json:
                    disease_dict['get_prob'] = data_json['get_prob']

                if 'easy_get' in data_json:
                    disease_dict['easy_get'] = data_json['easy_get']

                if 'cure_department' in data_json:
                    cure_department = data_json['cure_department']
                    if len(cure_department) == 1:
                        rels_category.append([disease, cure_department[0]])
                    if len(cure_department) == 2:
                        big = cure_department[0]
                        small = cure_department[1]
                        relation_department_department.append([small, big])
                        rels_category.append([disease, small])

                    disease_dict['cure_department'] = cure_department
                    departments += cure_department

                if 'cure_way' in data_json:
                    disease_dict['cure_way'] = data_json['cure_way']

                if 'cure_lasttime' in data_json:
                    disease_dict['cure_lasttime'] = data_json['cure_lasttime']

                if 'cured_prob' in data_json:
                    disease_dict['cured_prob'] = data_json['cured_prob']

                if 'common_drug' in data_json:
                    common_drug = data_json['common_drug']
                    for drug in common_drug:
                        relation_diseases_commonddrug.append([disease, drug])
                    drugs += common_drug

                if 'recommand_drug' in data_json:
                    recommand_drug = data_json['recommand_drug']
                    drugs += recommand_drug
                    for drug in recommand_drug:
                        rels_recommanddrug.append([disease, drug])

                if 'not_eat' in data_json:
                    not_eat = data_json['not_eat']
                    for _not in not_eat:
                        relation_diseases_noteat.append([disease, _not])

                    foods += not_eat
                    do_eat = data_json['do_eat']
                    for _do in do_eat:
                        relation_diseases_doeat.append([disease, _do])

                    foods += do_eat
                    recommand_eat = data_json['recommand_eat']

                    for _recommand in recommand_eat:
                        relation_diseases_recommandeat.append([disease, _recommand])
                    foods += recommand_eat

                if 'check' in data_json:
                    check = data_json['check']
                    for _check in check:
                        rels_check.append([disease, _check])
                    checks += check
                if 'drug_detail' in data_json:
                    drug_detail = data_json['drug_detail']
                    producer = [i.split('(')[0] for i in drug_detail]
                    relation_drug_producer += [[i.split('(')[0], i.split('(')[-1].replace(')', '')] for i in drug_detail]
                    producers += producer
                disease_infos.append(disease_dict)
        return set(drugs), set(foods), set(checks), set(departments), set(producers), set(symptoms), set(diseases), disease_infos, \
               rels_check, relation_diseases_recommandeat, relation_diseases_noteat, relation_diseases_doeat, relation_department_department, relation_diseases_commonddrug, relation_drug_producer, rels_recommanddrug, \
               rels_symptom, rels_acompany, rels_category

    def create_graph_nodes(self):
        if self.raw_graph_data is None:
            self.raw_graph_data = self._read_nodes()
        Drugs, Foods, Checks, Departments, Producers, Symptoms, Diseases, disease_infos = self.raw_graph_data[: 8]
        self.create_diseases_nodes(disease_infos)
        self.create_node('Drug', Drugs)
        self.create_node('Food', Foods)
        self.create_node('Check', Checks)
        self.create_node('Department', Departments)
        self.create_node('Producer', Producers)
        self.create_node('Symptom', Symptoms)

    def create_node(self, label, nodes):
        for node_name in tqdm.tqdm(nodes, desc=f"creating {
      label} nodes"):
            node = Node(label, name=node_name)
            self.graph.create(node)

    def create_diseases_nodes(self, disease_infos):
        """  Create the node of the knowledge map Center  :param disease_infos: :return: """
        for disease_dict in tqdm.tqdm(disease_infos, desc="creating diseases nodes"):
            node = Node("Disease", name=disease_dict['name'], desc=disease_dict['desc'],
                        prevent=disease_dict['prevent'], cause=disease_dict['cause'],
                        easy_get=disease_dict['easy_get'], cure_lasttime=disease_dict['cure_lasttime'],
                        cure_department=disease_dict['cure_department']
                        , cure_way=disease_dict['cure_way'], cured_prob=disease_dict['cured_prob'])
            self.graph.create(node)

    def create_graph_relations(self):
        if self.raw_graph_data is None:
            self.raw_graph_data = self._read_nodes()
        rels_check, rels_recommandeat, rels_noteat, rels_doeat, rels_department, rels_commonddrug, rels_drug_producer, rels_recommanddrug, rels_symptom, rels_acompany, rels_category = self.raw_graph_data[
                                                                                                                                                                                        8:]
        self.create_relationship('Disease', 'Food', rels_recommandeat, 'recommand_eat', ' Recommended recipes ')
        self.create_relationship('Disease', 'Food', rels_noteat, 'no_eat', ' Avoid eating ')
        self.create_relationship('Disease', 'Food', rels_doeat, 'do_eat', ' Suitable for eating ')
        self.create_relationship('Department', 'Department', rels_department, 'belongs_to', ' Belong to ')
        self.create_relationship('Disease', 'Drug', rels_commonddrug, 'common_drug', ' Common medicines ')
        self.create_relationship('Producer', 'Drug', rels_drug_producer, 'drugs_of', ' Produce drugs ')
        self.create_relationship('Disease', 'Drug', rels_recommanddrug, 'recommand_drug', ' Praise drugs ')
        self.create_relationship('Disease', 'Check', rels_check, 'need_check', ' Diagnostic tests ')
        self.create_relationship('Disease', 'Symptom', rels_symptom, 'has_symptom', ' symptoms ')
        self.create_relationship('Disease', 'Disease', rels_acompany, 'acompany_with', ' complications ')
        self.create_relationship('Disease', 'Department', rels_category, 'belongs_to', ' Department ')

    def create_relationship(self, start_node, end_node, edges, rel_type, rel_name):
        """  Create relationships  :param start_node: :param end_node: :param edges: :param rel_type: :param rel_name: :return: """
        #  To reprocess 
        set_edges = []
        for edge in edges:
            set_edges.append('###'.join(edge))
        for edge in tqdm.tqdm(set(set_edges), desc=f"building edge {
      start_node} - {
      end_node} rel type {
      rel_type} rel name {
      rel_name}"):
            edge = edge.split('###')
            p = edge[0]
            q = edge[1]
            query = "match(p:%s),(q:%s) where p.name='%s'and q.name='%s' create (p)-[rel:%s{name:'%s'}]->(q)" % (
                start_node, end_node, p, q, rel_type, rel_name)
            try:
                self.graph.run(query)
            except Exception as e:
                print(e)

    @staticmethod
    def _write(file_path, data_list):
        with open(file_path, 'w', encoding='utf8') as writer:
            writer.write("\n".join(data_list))

    def export_data_dict(self):
        if self.raw_graph_data is None:
            self.raw_graph_data = self._read_nodes()
        Drugs, Foods, Checks, Departments, Producers, Symptoms, Diseases = self.raw_graph_data[: 7]
        self._write(os.path.join(SysConfig.DATA_DICT_DIR, "drug.txt"), list(Drugs))
        self._write(os.path.join(SysConfig.DATA_DICT_DIR, "food.txt"), list(Foods))
        self._write(os.path.join(SysConfig.DATA_DICT_DIR, "check.txt"), list(Checks))
        self._write(os.path.join(SysConfig.DATA_DICT_DIR, "department.txt"), list(Departments))
        self._write(os.path.join(SysConfig.DATA_DICT_DIR, "producer.txt"), list(Producers))
        self._write(os.path.join(SysConfig.DATA_DICT_DIR, "symptoms.txt"), list(Symptoms))
        self._write(os.path.join(SysConfig.DATA_DICT_DIR, "disease.txt"), list(Diseases))

The program mainly rewrites the original program . Because it takes a long time to import the database , There is no attempt to run the import to database module program , Only enter the corresponding entity into KGQAMedicine/data/dict Under the table of contents . Interested friends can try to run import . Based on the first article importing data , perform MATCH p=()-->() RETURN p LIMIT 200 The results of querying the graph database are as follows ：

Insert picture description here
There is still a lot of content .