Medical Knowledge Organization System-Based Definition Generation and Visualization

Background: Definition extraction and/or generation is an important task of information extraction and has proven useful in many applications such as intelligent question and answering systems, especially in this big data era. Most of the current researches about definition extraction focused on lexico-syntactic patterns or word lattices to identify definitional sentences, which usually suffered from poor performance due to the noisy and variable syntactic structures and word lattices in the real-world documents and texts. Methods: This paper presents a straightforward approach to generate definition for medical terminology using the definitional relations from well-developed Medical Knowledge Organization System (MKOS), which will largely improve the accuracy and reliability of the results as the depended relations have already been reviewed and verified by the editors and domain experts of MKOS. Besides, two approaches of definition visualization were theoretically adopted and practically implemented to help the user intuitively understand the inherent nature of the generated definitions, which is firstly named as “definition visualization”. Results: To evaluate and verify the performance of the proposed methods, a big number of testing data from well-known MKOS were collected to conduct the experiments. Experimental results verify our approaches by showing exactly suitable statistical values for human reading and ordinary file memory, as well as promising precision and feedback from domain experts. Conclusion: The proposed approaches are able to generate precise definition based on the existing MKOS and will also innovatively convey the inherent nature and meaning of the generated definition in terms of two graphics diagrams. Definition Generation and Visualization.

been adopted in several languages, for instance English [8], Chinese [9], German [10], Portuguese [11] and so on. And the particular implementation has been done either semi-automatically or fully automatically. In the former group, patterns are usually pregenerated from simple sentences of words, such as X refers to Y, X is defined as Y, or X is a Y. Most of the recent work's attribute to the latter group [12][13][14][15][16]. Unlike regular expression-based hard matching patterns, Cui et al. [12,13] showed that soft patterns could model language variations probabilistically to extract definitional sentences and they later presented a new approach [14]  to learn the typical linguistic forms of definitions and proposed a genetic algorithm to learn the relative importance of these forms [15], which could be able to learn similar rules derived by a human linguistic expert and rank candidate definitions in an order of confidence. Navigli et al. proposed a generalization, Word-Class Lattices learned from Wikipedia dataset to model textual definitions [8], which compared favorably to other algorithms in the terms of no parameter tuning and being capable for quite complex task (as in real-world documents).
Recently, Anke et al. provided a supervised approach and only used the syntactic features derived from dependency relations [16], where the problem was modeled as a classification task and each sentence had to be classified as being or not definitional.
They got promising result by comparing with one well-known supervised and one unsupervised method. However, in the realworld documents and texts, as the definitional sentences often occur in highly variable syntactic structures and the definitional patterns are inherently very noisy, this kind of methods eventually becomes not aggressive in terms of low recall and precision. In contrast with the above algorithms, we present a straightforward approach to generate definition from the definitional relations in the well-developed MKOS, which will largely improve the accuracy and reliability of the results because the depended relations have already been reviewed and verified by the editors of MKOS, as well as the domain experts involved in the development of the MKOS.
Besides, in order to help the user better understand the inherent nature of the generated definitions, we then implement two approaches of scientific and information visualization to intuitively convey the definition of terminology in terms of graphic diagrams, which is firstly regarded as "definition visualization".

Differentiating Definitional from Non-Definitional Relationships in SNOMED CT
Among the large scale concepts and relationships, most of the relationships are used to define and represent the meaning of SNOMED CT concepts in these 9 branches of hierarchies: Clinical finding concepts, Procedure concepts, Evaluation procedure concepts, Specimen concepts, Body structure concepts, Pharmaceutical/biologic product concepts, Situation with explicit context concepts, Event concepts and Physical object concepts. Take the clinical finding concept as an example, the set of definitional relationships are shown in Table 1, sorted by the importance rank

Generating Term Definition from Definitional Relationships in SNOMED CT
With the definitional relationship from SNOMED CT, we could generate the definition for a specific concept or terminology with further organization. For example, there are 9 definitional relationships about breast neoplasm (a kind of clinical finding) in SNOMED CT, see Table 2. With these definitional relationships and the relevant concepts names linked to clinical finding concepts, we could generate definitions for these terminologies following the rules below: Therefore, the definition of breast neoplasm will be further modified and optimized as: Breast neoplasm is a breast lump, disorder of breast, neoplastic disease, neoplasm of trunk, neoplasm of thorax; It has finding site breast structure, trunk structure; It has associated morphology neoplasm; It has pathological process neoplastic process.

Theory and Approaches for Medical Definition Visualization
Scientific and information visualization is a branch of computer dynamically and interactively [20].
In this research, in order to intuitively convey the definition of terminology, we implement the node-link and the right-oriented tree diagrams based on the layout of D3.js (D3.layout). Here, we presume that the node-link tree layout highlights the relationships between the root and the leaf nodes with a ragged appearance

Experimental Setup
We conducted experiments on the dataset of the clinical finding branch from SNOMED CT, which contains a total number of 97,646 about clinical finding concepts and 2,338,582 relationships, grouped into 224 types of relationships. As shown above, there are 17 definitional types of relationships (Table I), involving 717,234 definitional relationships. The reason for using the hierarchy branch about clinical finding is that it primarily consists of the disease names (e.g., pneumonia) as well as the common signs and symptoms (e.g., fever, cough) known to mankind. Thus, clinical finding keeps the most importance to the public as it tightly closes to our healthcare. Besides, clinical finding is regarded as an application and reuses other branches of concepts, such as body structure, substance (e.g., drugs), procedure and so on. In other words, it is not possible to represent or describe the definitional content of clinical finding without explicitly or implicitly referring to anatomical concepts or chemical entities. For example, pneumonia must take for granted the existence of the lung structure and also the fully-defined diagnosis.

Measures
To assess the performance of our algorithm, we will calculate the following statistical measures:

Results and Discussion
Using the proposed approaches for definition generation and visualization, we get 100,332 definitions of terminologies about clinical finding. In Table 3, we report the results of statistical measures. The results show that the maximum length of the definitions contains 1035 characters, while the minimum length is 23. And the average length is 249, which will be exactly suitable for human reading as well as ordinary file memory (e.g., 255 characters for common text such as Microsoft Excel and Access).
Besides, from the dimension of the definitional relationships each terminology owns, the maximum number is 1148, while minimum becomes 1. And the average is 7, means that most of the definitional relationships used to generate definitions contains considerable information to fully define the terminologies of interest.
We  the generated definitional sentence having insufficient information.
The other reason (20.6%) is due to the primitive (not fully-defined) definition, as it is inadequate to uniquely distinguish its meaning from other similar concepts. However, the unaccepted definition could be further used as annotation or comment, which might make a reference to a specific explanation of terminology.  has pathological process allergic process; It has definitional manifestation allergic reaction to drug, immune system finding.
Totally, this sentence contains 10 definitional relationships grouped into 6 kinds. And a human being is quite difficult to quickly catch the key points, but the visualization does ( Figure   3).
b) There is a comma inside the related concept. According to the former rule 4 of the definition generation, in case of many individuals (linked concepts) among one relationship type, we add a comma into the definition sentence and make a distinction between each other. By coincidence, the expression form of linked concept also consists of a comma. Then it is not easy for the user to exactly segment sentence and make pause without the aid of definition visualization. Given an example of canicola fever, we generate its definition sentence as: Canicola fever is a leptospirosis; It has causative agent bacteria, leptospira interrogans, serogroup canicola; It has pathological process contagious disease, infectious process. Here, Figure 4 intuitively tells us that "bacteria" and "leptospira interrogans, serogroup canicola" are the causative agents of canicola fever, but not "bacteria", "leptospira interrogans" and "serogroup canicola".

Financial Support
This research is based on work supported (in part) by National Social Science Fund of China (No. 20BTQ062). Besides, the authors sincerely appreciate the editors and the anonymous reviewers who offer valuable suggestion and comments to help improve the quality of the manuscript.

Conflicts of Interest
There are no conflicts of interest.