MedLEE natural language processor

From Overdensity
Jump to: navigation, search

We have designed a model that embeds structured encoded information in a textual report using XML, and we have implemented an automated procedure using natural language processing that automatically processes clinical reports and transforms them into a valid XML output form consistent with the model. Associating structured output with portions of the original report adds significant functionality to the report. It means that applications can utilize the structured component of the XML output to obtain highly specific retrieval capabilities and then be able to highlight relevant information, thereby facilitating manual review. For example, a special browser could highlight specific information, such as diagnoses, procedures performed, medications given, or pertinent history, in order to assist the user in the reading of a report.


MedLEE can produce output in XML format, which is easily transformed into HL7 CDA. Significant phrases in the text are marked by a content tag, which is given a unique identification number. Following the text, one or more entry elements are used to define concepts and link them back to the marked items; these may be nested to any depth. ▶ illustrates this linked structure: three regions in the text are identified as A, B, and C, with B having a subregion identified as B1; the three entry items refer back to the text using these same identifiers. Entries can be nested to represent complex semantic structures (a detailed example appears below). Each entry has a code that links it with standardized coding schemes, e.g., Unified Medical Language System (UMLS), International Classification of Diseases 10 (ICD-10) or SNOMED CT. Using this approach, documents can represent coded information at the gross structural level (sections and fields), as well as the fine structural level (medical concepts and their modifiers).

Methods

MedLEE is written in Quintus Prolog and can run on most Unix and Windows platforms. It takes an average of 3 sec to process a complete radiologic report using a SUN Ultra-1 Model 170 workstation with a clock speed of 167 MHz. MedLEE requires 32 MB of RAM and 4 MB of disk space.

Medlee component.png

The first component of MedLEE is the preprocessor, which delineates the sentences of the report. Lexical lookup is performed to identify and categorize single words and multiword phrases in each sentence. The output of this component consists of a list of word positions, where each position is associated with a word or multiword phrase in the report. For example, if the sentence spleen appears to be moderately enlarged were at the beginning of the report, it would be represented as the list [1,2,5,6], where position 1 is associated with spleen, position 2 with the multiword phrase appears to be, position 5 with moderately, and position 6 with enlarged. The remainder of the list of word positions would be associated with the remaining words in the report.

The second component is the parser. It utilizes the grammar and categories assigned to the phrases of a sentence to recognize well-formed syntactic and semantic patterns in the sentence and to generate intermediate forms. The target form generated by the parser for the sample sentence spleen is moderately enlarged would be the following frame:

Objective: To design a document model that provides reliable and efficient access to clinical information in patient reports for a broad range of clinical applications, and to implement an automated method using natural language processing that maps textual reports to a form consistent with the model.

Methods: A document model that encodes structured clinical information in patient reports while retaining the original contents was designed using the extensible markup language (XML), and a document type definition (DTD) was created. An existing natural language processor (NLP) was modified to generate output consistent with the model. Two hundred reports were processed using the modified NLP system, and the XML output that was generated was validated using an XML validating parser.

Results: The modified NLP system successfully processed all 200 reports. The output of one report was invalid, and 199 reports were valid XML forms consistent with the DTD.

Conclusions: Natural language processing can be used to automatically create an enriched document that contains a structured component whose elements are linked to portions of the original textual report. This integrated document model provides a representation where documents containing specific information can be accurately and efficiently retrieved by querying the structured components. If manual review of the documents is desired, the salient information in the original reports can also be identified and highlighted. Using an XML model of tagging provides an additional benefit in that software tools that manipulate XML documents are readily available.

[1]