CTAKES

From Overdensity
Jump to: navigation, search

This is a multi-component pipeline, which combines rule-based and machine learning approaches for knowledge extraction from clinical free text[8].

The cTAKES builds on the Apache Unstructured Information Management Architecture (UIMA) framework[15] and OpenNLP toolkit[16]. cTAKES consists of six components, namely

(1) Sentence boundary detector, (2) Tokenizer, (3) Normalizer, (4) Part-of-speech (POS) Tagger, (5) Shallow parser, and (6) Named Entity Recognition (NER) annotator, which are invoked sequentially to create an annotated dataset.

The downloadable version of cTAKES was trained on GENIA, Penn TreeBank (PTB), and a corpus derived from the Mayo Clinic Electronic Medical Record (EMR) consisting of 273 clinical notes[8].

cTAKES is built using an extensible architecture, which allows implementation of specialized NLP systems focusing on specific clinical domains.