Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger.
Part-of-speech tagging represents an important first step for most medical natural language processing (NLP) systems. The majority of current statistically-based POS taggers are trained using a general English corpus. Consequently, these systems perform poorly on medical text. Annotated medical corpora are difficult to develop because of the time and labor required. We investigated a heuristic-based sample selection method to minimize annotated corpus size for retraining a Maximum Entropy (ME) POS [...]
Author(s): Liu, Kaihong, Chapman, Wendy, Hwa, Rebecca, Crowley, Rebecca S
DOI: 10.1197/jamia.M2392