Mine That Record

June 21, 2011

PostgreSQL configuration on big machines

Filed under: Code — Dr. H @ 2:45 pm

Configuring PostgreSQL to perform well on big machines with gobs of resources is a bit of a challenge. Here are the relevant parts of our own postgresql.conf hoping someone may find it useful.

Relevant hardware specs: 32-core machine, Ubuntu 10.10 64-bit server, 64 GB of RAM, two drive arrays (one with eight and one with 15 HDDs). PostgreSQL 8.4.8. The actual data lives on the second (larger but slower) array, the WAL lives on the first (smaller but faster) array.

Whatever is not listed here sticks to the PostgreSQL defaults.

#------------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#------------------------------------------------------------------------------

# - Memory -
shared_buffers = 16GB                   # min 128kB
                                        # (change requires restart)
temp_buffers = 512MB                    # min 800kB
#max_prepared_transactions = 0          # zero disables the feature
                                        # (change requires restart)
# Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
# It is not advisable to set max_prepared_transactions nonzero unless you
# actively intend to use prepared transactions.
work_mem = 256MB                        # min 64kB
maintenance_work_mem = 1GB              # min 1MB
max_stack_depth = 7680                  # min 100kB

# - Asynchronous Behavior -
effective_io_concurrency = 16           # 1-1000. 0 disables prefetching
#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------

# - Settings -

#fsync = on                             # turns forced synchronization on or off
#synchronous_commit = on                # immediate fsync at commit
#wal_sync_method = fsync                # the default is the first option
                                        # supported by the operating system:
                                        #   open_datasync
                                        #   fdatasync
                                        #   fsync
                                        #   fsync_writethrough
                                        #   open_sync
#full_page_writes = on                  # recover from partial page writes
wal_buffers = 256MB                     # min 32kB
                                        # (change requires restart)
#wal_writer_delay = 200ms               # 1-10000 milliseconds

#commit_delay = 0                       # range 0-100000, in microseconds
#commit_siblings = 5                    # range 1-1000

# - Checkpoints -

checkpoint_segments = 64                # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min              # range 30s-1h
#checkpoint_completion_target = 0.5     # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s               # 0 disables

These settings leverage our machine much better than the defaults, which are ~ 10 years old and meant for small, slow machines. YMMV.

Leave a Comment

July 7, 2010

Data Fakehouse

Filed under: Code — Dr. H @ 1:24 am

Sharing data with other institutions is a problem. HIPAA makes it difficult and cumbersome, and some patients fear for their privacy. Giving data to students is also encumbered for the same reasons. And, frankly, much of it is due to fear: what if a students stores it on his/her laptop, unencrypted, and then loses it?

There is no simple answer to these questions. We take great care in protecting patient data. We spend money, time, and lots of elbow grease making sure it stays where it’s supposed to. Yet our students must learn, and we must do research.

Enter the Data Fakehouse.

This is a simple set of scripts that creates a simulated data warehouse. It has a lot of simplifying assumptions, but it’s good enough for some kinds of research (specially into data mining techniques) and for teaching purposes. Here are some of the more relevant simplifying assumptions:

1. All diseases are chronic.
2. Care is episodic. In other words, this is an encounter-based setting, like an outpatient clinic.
3. Patients have a condition from the start, or they don’t. Conditions don’t appear during the course of care.
4. There’s a standard set of labs that is ordered every single time a patient with a condition visits. You can think of vitals as ‘labs’ if that helps.
5. The number of potential conditions is small. This can be increased easily, if necessary,
6. All lab values are normally distributed, both the normal and abnormal ones.
7. We know the ground truth about whether a patient has a condition or not (great for computing sensitivity and specificity!)

To run it, you’ll need PostgreSQL. I only tested it on the 8.3 series, but pretty much any version greater than 8.0 should work. You’ll also need Python 2.x (2.5 or greater should do the trick) and psycopg2. You will need to edit the create_db.sh script to fit your system, but with a minimal amount of UNIX experience it will be straightforward.

Please let me know if you find it useful!

Leave a Comment

April 15, 2010

New edition of MetaMap Tools

Filed under: Code,Metamap — Dr. H @ 3:15 pm

My little metamap_tools script is all grown up! It can now coordinate 8 instances of MetaMap running simultaneously, and handle stuck processes gracefully. It has been running flawlessly for over 23 hours nonstop and has processed over a million and a half lines of text.

I’m currently running it under Ubuntu 10.04, and using MetaMap 2009.

Leave a Comment

March 24, 2010

Lazarus

Filed under: Code,General musings — Dr. H @ 3:15 am

Over the weekend I got to play with one of my favorite old technologies: Object Pascal, in the form of the Free Pascal Compiler and the Lazarus project. Lazarus is an Open Source reimplementation of what is, IMHO, the best RAD tool of all time: Delphi (originally Borland Delphi, then CodeGear, then GodKnowsWhat, then SomeOtherCompany, currently Embarcadero Delphi).

I saved my money for a LONG time to buy Delphi 1.0 for Windows 3.1, and I loved it to death. It ran rings around Visual Basic in ease of use, power, speed, and elegance; it was 100% object-oriented, and its class library was both very comprehensive and extremely well designed.

Unfortunately, Borland, sorry, Embarcadero prices it out of the reach of ordinary people, especially if you want a version that can connect to a database. Lazarus is rough around the edges and glitchy, but:

It works,
It’s multiplatform, and
It’s free and Free

So after struggling for a while I got Lazarus running, and the old Delphi love came rushing back. There’s no better way to throw a quick user interface together for a project and (if you need it) its immense power and flexibility are still there.

Plus, you gotta love programming in Pascal in 2010!

Leave a Comment

February 4, 2010

UIMA + MetaMap

Filed under: Code,Metamap — Dr. H @ 3:30 pm

There is a UIMA annotator for MetaMap! This made my day; I won’t have to write my own like I was planning to.

Leave a Comment

UIMA

Filed under: Classification,General musings — Dr. H @ 3:29 pm

Thanks to a great presentation put together by Chuck I have a clearer understanding of what UIMA is all about.

UIMA is not just Apache UIMA. UIMA stands for Unstructured Information Management Architecture and is, in fact, a standard.
UIMA can handle more than just text documents. It works just fine on audio and video.
The most common implementation is Apache UIMA.
UIMA works around configurable workflows (including conditional workflows) in which documents are passed to Annotators. Each annotator can examine the document and any annotations already attached to it, and then appends its own annotations.
The order in which annotators execute is therefore important, because an annotator may depend on the output of another annotator.
An annotator can annotate whatever it deems important and is free to ignore the rest of the record.
This workflow is fed from a “factory” that can generate documents however it deems appropriate: crawling the web, reading a database, reading files… the “reading documents” part of the workflow is isolated.
The end product of the workflow is consumed by a consumer class that can do whatever it wants with the annotations. Typical choices are to write them to a file or insert them into a database.

There, that’s UIMA in a nutshell. For our purposes we’re interested in two UIMA pipelines: medKAT/P and cTAKES.

Leave a Comment

December 17, 2009

R.T.F.?

Filed under: Code,Exploring the depths — Dr. H @ 4:06 pm

When you go look at the database that actually underlies an electronic medical record, sometimes you see weird things. Our product stores its clinical notes in Rich Text Format (RTF) which is probably pleasing to the reader’s eye, but makes it unwieldy to run through our natural language processing pipeline (Disclaimer: I’m aware that formatting could contain useful information. We’re not at a stage where we can make use of it.)

My educated guess is that the desktop client for our EHR uses the Windows Rich Edit control, and just dumps its output into the database. Be that as it may, we need to turn this RTF into clean text to be able to use it later. You’d think it’s straightforward, but you’d be wrong.

We avoid Not Invented Here syndrome as much as possible. We went looking for prebuilt solutions that allowed us to turn RTF into TXT. They also had to be multiplatform, as part of our workflow is UNIX-based. We found two:

The Java libraries (RTFEditorKit)
UnRTF

Both have advantages and disadvantages. UnRTF seems more robust, but requires running a separate process and capturing its output. The Java libraries are extremely simple to use, but crash occasionally. We started out trying the Java libraries. Here’s my first attempt at some Jython code that will transform RTF into text (we favor Python around here):

from javax.swing.text import DefaultStyledDocument
from javax.swing.text.rtf import RTFEditorKit
from java.io import StringReader

def cleanup_one_RTF_note(RTFText):
   dummy_document=DefaultStyledDocument()
   fake_file=StringReader(RTFText)
   try:
       RTFEditorKit().read(fake_file, dummy_document, 0)
       the_text=dummy_document.getText(0, dummy_document.getLength())
   except:
       the_text=""
   return the_text.strip()

Why is that try/catchall except there? Well, this code worked… except when it crashed inexplicably. So we went looking at the RTF itself, and there’s a WTF waiting there. Some of the RTF in the database is malformed. RTFEditorKit can’t handle it gracefully, so it throws an exception.

Attempt #2 included trying to parse the RTF and fix the most glaring issues like unbalanced braces. Unfortunately this seems to require writing a recursive descent parser which, even with the excellent pyparsing, seems like a bit much just to clean up some text.

Our current workflow consists of trying to parse the RTF using RTFEditorKit. If it doesn’t work, we pass the text off to UnRTF. Kludgey, but it seems to work and we can move on.

Leave a Comment

December 3, 2009

Data cleaning cost

Filed under: Exploring the depths — Dr. H @ 9:15 pm

In the end, getting the data required in my previous post took me approximately an entire day. During that day I had to pore over the data schema for approximately twelve different tables, and I created four views. I’m reasonably proficient with SQL so the views aren’t horrible, and perform adequately. But it took me a long time to get them right.

In the end, I wrote 37 lines of very carefully hand-tuned SQL. It runs, it does what it’s supposed to, and it did its job. And I know that there are patients on the medication I’m looking for that are not in the resulting data set. But for a first pass it was good.

A little birdie told me that a large academic institution spends $100k cleaning data for each study it does on its data warehouse. I believe it. We must get this down to a reasonable level by improving the tools. Otherwise, the promise of simple, fast research on existing data will go unfulfilled.

Leave a Comment

November 26, 2009

This isn’t that easy

Filed under: Exploring the depths — Dr. H @ 12:49 am

I spent most of the afternoon trying to get one measurement from our database for patients that take a certain medication. Since our data warehouse (i2b2) does not support this I had to delve into a backup of the SQL database that hosts the original records.

The schema for our clinical database is reasonable and a person with a working knowledge of SQL and some relational database experience like myself can grasp it quickly. However, the way the schema is used is flabbergasting. And I’m not completely sure if it’s our installation, our users, or the end-user software. But for example, when people enter a medication order erroneously, I would expect one of two sensible things to happen:

The medication order is deleted and an appropriate entry reflecting this is added to the audit log (I’m not sure if our EHR supports this), or
The medication order is followed by an immediate cancellation order (which our EHR supports).

However, our clinical system in its infinite wisdom stores this occurrence as discontinuation orders. Yes, those that mean “Mrs. Smith? You know that medication you are taking? Please stop taking it.” So we have patients taking drugs legitimately and then stopping for infinitesimal, or 0, amounts of time. So instead of canceling orders we have to do date arithmetic in SQL. And hope that we are interpreting what we see in the data tables correctly.

Don’t get me started on the measurements. “The measurement’s wrong? Why, let’s just enter a new measurement and leave the wrong one there! Flags and deleting are for sissies!”

Comments (1)

November 25, 2009

MetaMap Tools update

Filed under: Code,Metamap — Dr. H @ 8:36 pm

My first edition of metamap_tools had several problems and inefficiencies. I’ll keep updating it as I use it and refine it. It currently creates a process every 500 lines and instead of keeping track of line IDs itself relies on MetaMap to do so.

The current edition is much, much less prone to die suddenly from a broken pipe and in fact I haven’t seen it die… yet.

Leave a Comment

June 21, 2011

July 7, 2010

April 15, 2010

March 24, 2010

February 4, 2010

December 17, 2009

December 3, 2009

November 26, 2009

November 25, 2009

Blogroll