by Rachel Shadoan

We wrote the project proposal before we saw the data we’d be working with. That was our first mistake.

It’s not unusual to write a proposal before having access to a client’s data, especially for experienced consultants. However, this project was a first-of-its-kind collaboration between our university and a large regional hospital, and we were grad students so green that lawn fertilizer companies wanted to replicate our hue for their ads. It was not, therefore, a project that should have been embarked upon without a discovery phase.

Unfortunately, I didn’t learn about discovery until three years later when Kronda Adair of Karvel Digital introduced the idea to me, so onward we pressed with our proposal.

The project seemed straightforward enough, for all that it was a unique collaboration. My partner and I were to liberate the data from the hospital’s operations database, clean it up, and produce de-identified data sets for university researchers. Exporting a database, we reasoned, was simple: at most, it would take a few small scripts. The bulk of the work would go towards cleaning the data once it was exported. We’d host de-identified data sets on a university server, but the bulk of the work would take place on servers in the hospital’s server room. This server setup provided an extra layer of insurance for the hospital: they could end the project at any time just by walking into the server room and turning the project servers off.

Two years later, after installing the servers, procuring sufficient insurance to be allowed to even see the data, and much waiting, we were granted access.

We knew that the database was legacy government software, written in an implementation of MUMPS (Massachusetts General Hospital Utility Multi-Programming System). It’s used primarily in hospital applications; I suspect the original developers feel very clever about the name. MUMPS’ primary selling point is a transparently included key-value database. So much for our plan to export the data directly into Postgres.

We did not want to learn MUMPS. We especially did not want to learn a proprietary implementation of MUMPS. This feeling intensified when we saw the documentation. There were instructions for interacting with the database system by entering MUMPS commands into a terminal; it was just that the documentation reproduced the hundreds of pages of console spew that a command might produce. Including a page number for the page featuring the next command would have been too convenient; instead, we had to flip through page after page of console spew until we happened across it. It was as though the documentation requirement in the original contract for developing the software included a page quota but imposed no quality restrictions.

After weeks of digging through PDFs downloaded from increasingly difficult-to-locate government and vendor websites, we found what we thought would be our salvation: documentation for the hospital software’s SQL compatibility layer. There were even JDBC (Java Database Connectivity) bindings! We’d planned to be scripting in Python, but Java was our wheelhouse. We’d export the data using Java, plunk it into Postgres, no big deal.

It might have been no big deal, too, had the JDBC bindings been completely implemented. Instead, we spent additional weeks trying to find workarounds for methods in the JDBC bindings that were either entirely unimplemented or failed inexplicably. When we finally coaxed the software to give us the metadata through the JDBC bindings, we learned that there were 26,000 tables.

Twenty-six thousand. Many of the tables had hundreds of columns in addition to having millions of rows.

The tables were spread across ten or so namespaces, with two namespaces holding the majority of the data. Many of the table names were duplicated across the two largest namespaces, but it was unclear if the data the tables contained were duplicate as well. We filed that as a Problem for Future Us, and concentrated on completing the export. We had all the metadata–now we could select full tables at a time and put them directly in Postgres.

As it turned out, there were entire columns in most of the tables that could not be exported using the JDBC bindings. When we selected those columns in a table for export, the entire select failed. There was no indication of which columns would cause a table selection to fail; it wasn’t apparent until after the column failed to export. We took to calling them “column bombs”.

There were a few tables we could successfully export that way, and we attempted to put them directly in Postgres. Unfortunately, the length of the column names from the hospital database exceeded the maximum column name length allowed of 64 characters in Postgres. In hindsight, we should have abandoned this approach around the time we found ourselves custom-compiling a version of Postgres that allowed column name lengths of 128 characters or more.

To deal with the tables containing column bombs, we began exporting tables one column at at time. Putting tables into Postgres one column at a time is one of the least efficient ways to build tables. Even processing tables in parallel, the data export was taking forever—nearly two weeks to remove the data from the hospital database and import it to Postgres.

Then there was the problem of the row bombs. Much like the column bombs, which made entire columns unselectable, the row bombs appeared without warning and made it impossible to retrieve the data from cells after that row in a column without selecting the rows after the row bomb in a separate selection. The hospital database was not at all optimized for this kind of access; the transfer process became more and more inefficient as we attempted to deal with these errors.

We decided to switch tactics. If, as we hypothesized, most of the data was duplicated across the two main namespaces, we could export only one of the copies of it and reduce the amount of data we needed to put in Postgres by half. Thus, the Deduplicator was born. Using the akka framework to handle our scalability concerns, we started building an application to reduce the data to its smallest version. Conceptually, the Deduplicator was awesome–nested data un-rollers and bi-directional maps that kept track of every occurrence of any given value throughout the whole data set. It would make it so easy to answer our client’s favorite kind of question: “Is such and such in the data?” In practice, though, much of the functionality we needed in akka had not yet been released. Six months later, the Deduplicator was still a beautiful, unproven dream, and we had thousands of lines of code that turned out to be nearly impossible to debug.

Our stakeholders were getting antsy. We could hardly blame them–the project had turned out dramatically more involved than we had expected. We had burned too much time trying to solve the general problem of de-deduplicating the data. The researchers were uninterested in the technical challenges of manipulating such a large, gnarly data set ensnared in such a persnickety tangle of software. They wanted analysis, and we needed to deliver.

We gave up on Postgres, and instead exported the data to Java serializables, one column at a time, maintaining lists of the columns, tables, and rows that failed. We wrote a script to do a simple layer of de-duplication based on tables and columns with the same name in different namespaces. We partially denormalized the data and put it in Elasticsearch, which turned out to be the best choice we made during the entire project.

Elasticsearch, and its accompanying web front-end Kibana, finally made the data usable. Elasticsearch’s full-text search even made the unstructured text in the data, like physician’s notes, query-able. Kibana provides an interface for our primary stakeholder at the hospital to explore the data directly. Elasticsearch makes analyses that I thought would take years of effort feasible to complete in a few weeks. In short, Elasticsearch was a miracle for this project (though not enough of a miracle to save me from the Baby Matching Drama of 2016).1

This project is still ongoing; we are currently working on a new version of the export software. This time, rather than using the SQL compatibility layer, we’re using a Telnet client that allows us to export the data without forcing a transformation to relational data. Using this method, there are fewer than 4000 “tables”2, rather than the 26,000 we had been working with3. The new process also freezes our copy of the hospital database during the export to ensure that no new writes happen during the transfer. The data will be written in plaintext to local git repos on the hospital servers so that each transfer has a commit hash and we have an auditable trail as the data changes over time. From there, the data is added to Elasticsearch.

When we started this project, we came in as research scientists, expecting the challenges to be primarily in analysis and data cleaning. We expected to be writing machine learning algorithms to build profiles of patients, and to be doing statistical analysis of the impact of climate change on various health concerns. In short, we were completely unprepared for what we were getting into. In the five years since we began, we learned a lot about software development and client management. Here’s the advice I wish I could give Past Me:

1. Solve the specific problem FIRST. The general solution is often more interesting, but your clients do not care about the general problem. They care about their problem. They are unlikely to be sympathetic to delays encountered when you realize the general solution is a lot harder than you expected it to be. You can always iterate later, producing a general solution informed by your knowledge of the specific problem.

2. Do not re-invent the wheel. If you re-invent the wheel, you have to maintain the wheel, and you do not want that job. My goal now is to strip as much of our own software out of this project as possible. Everything that we can implement using mature software maintained by people who aren’t me, we do. It’s easy for a new developer to think, “It will take me less time to make this feature than it will take me to learn someone else’s library”. That is almost always false. Someone else’s library is very likely to account for corner cases that you haven’t even conceived of yet. Read the documentation, use someone else’s library, then send them money, kind words, and pull requests.

3. Don’t transform key-value data into relational data (or vice versa) unless you have a compelling reason to do so. Using the SQL compatibility layer seemed like a good idea to us because it allowed us to use tools we were very familiar with, but in practice the transformation caused far more problems than it solved.

4. Any data transform process you make must be version-controlled, repeatable, and fully auditable. You need to be able to identify the exact software that produced any given version of the data. The same goes for any analysis.

5. Universities are unconcerned about your consultancy receiving checks on time, regardless of how much you need them to make rent. Do not be afraid to remind them before the invoice is due, and hound them when the payments are late. Note that universities often work on Net 45 or Net 60 payment rules, so late from your perspective may not be late from their perspective. Which brings us to number 6.

6. Never work without a contract and a statement of work that stipulates exactly what you will be implementing, who will own the software, and what the payment schedule is. A lawyer specializing in contracts for freelancers and small business can help you with that; Katie Lane of Work Made For Hire helped us immensely in the early days of our consultancy.

Finally, and most importantly:

7. NEVER SUBMIT A PROPOSAL FOR A DATA PROJECT WITHOUT HAVING SEEN THE DATA. Ideally, set up a discovery period in which the client pays you to investigate their data and infrastructure and produce a proposal for the work they want done. If you have not seen the data yourself, you are not able to give an accurate estimate of how much time the project will take and how much it will cost. Following this one rule would have made this project much smoother for everyone.

Working with data can be incredibly rewarding—a treasure hunt for meaning in our digital trails. It is not, however, without its pitfalls. In data science we talk a fair amount about bad analysis; we pay less attention to the engineering challenges. This does our discipline a disservice, especially given that a slipshod approach to the underlying data pipelines can contribute to bad analysis. Learn from our mistakes so you can make new and different ones of your own!

Rachel Shadoan is the co-founder and CEO of Akashic Labs, a Portland-based research and development consultancy, where she specializes in combining research methodologies to provide rich and accurate answers to technology’s pressing questions. When she’s not playing with data, she can be found optimizing her balcony for planting space and live-tweeting episodes of Mister Rogers’ Neighborhood.

  1. After we successfully exported the data, I was asked to make a dataset of mothers and their babies for researchers who were exploring how mothers’ health and economic factors impact the health of their babies. Unfortunately, not only was there not a simple way to identify whether a patient had given birth in the hospital, but also nowhere in the 26,000 tables that directly linked mothers to their babies. Building this data set required developing a way to identify patients who had given birth in the hospital, a way to identify patients who had been born in the hospital, then matching babies to their mothers based on a variety of features. This work is ongoing, but you can read more about the challenges we encountered here
  2. Key-value databases don’t exactly have tables. The program with the telnet interface actually considers them “files”. They’re similar to tables, but have “multiples”, which are fields that can hold lists of values, including nested lists. 
  3. This is not a typo. Key-value databases can store data with one-to-many and many-to-many relationships more efficiently than relational databases can. For instance, in a key-value database, a patient’s record could contain a list of medications. Representing that in a relational database requires a patient table, a medication table, and a join table that maps patients to medications. When converting from a key-value database to a relational database, an increase in the number of tables is expected. However, an increase of 6.5x the number of tables is highly unusual.