When you go look at the database that actually underlies an electronic medical record, sometimes you see weird things. Our product stores its clinical notes in Rich Text Format (RTF) which is probably pleasing to the reader’s eye, but makes it unwieldy to run through our natural language processing pipeline (Disclaimer: I’m aware that formatting could contain useful information. We’re not at a stage where we can make use of it.)
My educated guess is that the desktop client for our EHR uses the Windows Rich Edit control, and just dumps its output into the database. Be that as it may, we need to turn this RTF into clean text to be able to use it later. You’d think it’s straightforward, but you’d be wrong.
We avoid Not Invented Here syndrome as much as possible. We went looking for prebuilt solutions that allowed us to turn RTF into TXT. They also had to be multiplatform, as part of our workflow is UNIX-based. We found two:
- The Java libraries (RTFEditorKit)
- UnRTF
Both have advantages and disadvantages. UnRTF seems more robust, but requires running a separate process and capturing its output. The Java libraries are extremely simple to use, but crash occasionally. We started out trying the Java libraries. Here’s my first attempt at some Jython code that will transform RTF into text (we favor Python around here):
from javax.swing.text import DefaultStyledDocument
from javax.swing.text.rtf import RTFEditorKit
from java.io import StringReader
def cleanup_one_RTF_note(RTFText):
dummy_document=DefaultStyledDocument()
fake_file=StringReader(RTFText)
try:
RTFEditorKit().read(fake_file, dummy_document, 0)
the_text=dummy_document.getText(0, dummy_document.getLength())
except:
the_text=""
return the_text.strip()
Why is that try/catchall except there? Well, this code worked… except when it crashed inexplicably. So we went looking at the RTF itself, and there’s a WTF waiting there. Some of the RTF in the database is malformed. RTFEditorKit can’t handle it gracefully, so it throws an exception.
Attempt #2 included trying to parse the RTF and fix the most glaring issues like unbalanced braces. Unfortunately this seems to require writing a recursive descent parser which, even with the excellent pyparsing, seems like a bit much just to clean up some text.
Our current workflow consists of trying to parse the RTF using RTFEditorKit. If it doesn’t work, we pass the text off to UnRTF. Kludgey, but it seems to work and we can move on.