Branch II

From Overdensity
Jump to: navigation, search

Branch II The semi-structured sections of the discharge summaries have primarily an “attribute-value” structure (Figure 4), hence they are easier to process as compared to unstructured sections. Hence, EpiDEA does not perform the complete NLP pipeline in this branch, but uses a rule-based approach for extracting relevant information.

An example of semi-structured sections.png

Step 1. Rule-based semi-structured text parser. The terms are extracted using EpSO based parsing rules to identify relevant terms. The extraction is performed using UIMA framework with EpSO ontology and rule based annotator (see Algorithm 2). After loading EpSO ontology, semi-structured text based regular expression rules are generated in terms of the name, labels, subclasses (direct or indirect), and labels of the subclasses for each class. The generated rules capture variations of the class, which are applied to each plain text file to extract, split and annotate the attribute-value pairs. For instance, the rules generated for “SemiologicalZone” take into account variant expressions for “SemiologicalZone” and its subclasses including “EpileptogenicZone” and “SeizureOnsetZone.”

Step 2. Post-processing. The values of sex, age, epileptogenic zone and etiology are extracted from the corresponding “attribute-value” pairs. The extracted information about sex includes “Female,” “F,” “Male” and “M,” which are normalized as “Female” and “Male.” Age information appears as only an integer or an integer followed by one of the following strings: “years old,” “year-old,” “years old.,” “years-old,” “year old,” “years,” “y.o.,” where only integers are extracted and integrated.

Pseudocode for ontology and rule based annotator for semi-structured text.png

The result from both Branch I and II are integrated and stored in the PRISM knowledge base implemented using a MySQ relational database.