Branch I

From Overdensity
Jump to: navigation, search

The discharge summary reports used in this work have two distinct interleaved formats that are processed separately. Branch I uses the components available in cTAKES to analyze free text, whereas Branch II uses specialized rules to extract information from the semi-structured attribute-value sections of the reports

Branch I This branch uses the first stages of the cTAKES pipeline followed by three specialized components to process free text from unstructured sections in the discharge summary

Step 1. Sentence Detection. EpiDEA re-uses the cTAKES sentence boundary detector module that extends the OpenNLP maximum entropy sentence detector tool. The module identifies the use of punctuation marks, including period, question mark, and exclamation mark, to mark the end of a sentence[8].

Step 2. Tokenization. The cTAKES tokenizer used in EpiDEA extracts tokens using space and punctuation mark. Special cases corresponding to dates, acronyms, etc. are accounted for by merging back the tokens[8].

Step 3. Part-of-speech (POS) tagging. EpiDEA uses normalization rules to reconcile variations in lexical properties of the tokens as part of the preprocessing step. Hence, the output of the tokenization step can be directly used by the cTAKES POS tagger, which is a wrapper around the OpenNLP POS module.

Step 4. Shallow parsing. The cTAKES shallow parser is re-used in EpiDEA to tag noun phrases.

Step 5. Epilepsy Named Entity Recognition (Ep-NER). The output of the shallow parsing are further filtered and annotated in terms of EpSO (see Algorithm 1). We are only interested in the noun phrases extracted by cTAKES. For each ontology class in EpSO, unstructured text based regular expression rules are created and applied to the noun phrases. Related noun phrases are annotated with the ontology class.

Pseudocode for ontology and rule based annotator for unstructured text.png

Step 6. Negation detection. To detect negation and uncertainty, we adapt the NegEx algorithm, which identifies negation for clinical text, developed by Chapman et al.[18] The trigger lists for negation and uncertainty detection are based on the existing negation triggers and the observation of the discharge summaries (see Table 2). For each EpSO annotated noun phrase, the adapted NegEx algorithm is applied to the sentence where the noun phrase locates and determines if the EpSO related term is negated or uncertain.

Category Negation Uncertainty
Pre trigger terms no, no evidence of, no abnormal, etc. probably, possible, likely, might have, etc.
Post trigger terms is ruled out, unlikely, etc. suspected, suspicious, etc.
Pseudo trigger terms no change, no significant change, etc NA
Conjunction trigger terms but, reason, for etc. suffers from, suffered, etc.

Step 7. Post-processing.

  • EEG pattern. One patient may have multiple EEG patterns appearing in different places. The EpSO class “EEGPattern” and its subclasses are used to extract related patterns, which are then concatenated by comma with duplicates removed.
  • Current or past antiepileptic medications. Without considering the dose and time information, medications about drug brand names or ingredient names are extracted in terms of all the subclasses of the EpSO class “DrugBrandName” and “DrugIngredient”. To determine whether a medication is current or past, the beginning and ending indexes of the medication being annotated are compared to that of the headers “Past Antiepileptic Medication” and “Current Antiepileptic Medication.”

Also see Branch II