Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports

Authors
Kim, TaehyeongHan, Sung WonKang, MinjiLee, Se HaKim, Jong-HoJoo, Hyung JoonSohn, Jang Wook
Issue Date
Feb-2021
Publisher
JMIR PUBLICATIONS, INC
Keywords
spelling correction; natural language processing; bacteria; electronic health record
Citation
JMIR MEDICAL INFORMATICS, v.9, no.2
Indexed
SCIE
SCOPUS
Journal Title
JMIR MEDICAL INFORMATICS
Volume
9
Number
2
URI
https://scholar.korea.ac.kr/handle/2021.sw.korea/50034
DOI
10.2196/25530
ISSN
2291-9694
Abstract
Background: Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems, including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of infectious diseases, including medical terminology extraction. If a dictionary is established, spelling algorithms using edit distance are efficient. However, in the absence of a dictionary, traditional spelling correction algorithms that utilize only edit distances have limitations. Objective: In this research, we proposed a similarity-based spelling correction algorithm using pretrained word embedding with the BioWordVec technique. This method uses a character-level N-grams-based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place. Methods: For detected typographical errors not mapped to Systematized Nomenclature of Medicine (SNOMED) clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pretrained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, a grid search was used to search for candidate groups of similar words. Thereafter, the correction candidate words were ranked in consideration of the frequency of the words, and the typographical errors were finally corrected according to the ranking. Results: Bacterial identification words were extracted from 27,544 bacterial culture and antimicrobial susceptibility reports, and 16 types of spelling errors and 914 misspelled words were found. The similarity-based spelling correction algorithm using BioWordVec proposed in this research corrected 12 types of typographical errors and showed very high performance in correcting 97.48% (based on F1 score) of all spelling errors. Conclusions: This tool corrected spelling errors effectively in the absence of a dictionary based on bacterial identification words in bacterial culture and antimicrobial susceptibility reports. This method will help build a high-quality refined database of vast text data for electronic health records.
Files in This Item
There are no files associated with this item.
Appears in
Collections
College of Medicine > Department of Medical Science > 1. Journal Articles
College of Engineering > School of Industrial and Management Engineering > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Sohn, Jang Wook photo

Sohn, Jang Wook
College of Medicine (Department of Medical Science)
Read more

Altmetrics

Total Views & Downloads

BROWSE