Taking the Elbow Grease Out of Scrubbing Sensitive Data from Medical Records

Science Highlights
May 31, 2009
Example of de-identified nursing notes.
Example of de-identified nursing notes.

Computerized medical records have slowly begun taking the place of paper records in many large health care facilities across the United States and are gaining popularity even among smaller practices. Many of the benefits of electronic health records (EHRs) are closely tied to direct health care delivery to the patient (e.g., coordination of care, continuous quality measurement, and reduction of medical errors through monitoring). However, secondary uses of EHRs – such as analysis, research, quality and safety measurement, and public health – are equally important. "Clinical data provide a potential treasure-trove of information that can help us develop a better understanding of diseases, improve the ways in which we treat them, and make the medical care process more efficient," says Dr. Peter Szolovits, Professor of Computer Science and Engineering at MIT and collaborator on medical informatics research funded by the National Institutes of Healths' National Institute of Biomedical Imaging and Bioengineering (NIBIB).

These important secondary uses of the data are made possible by pooling EHR data from large numbers of patients into a common database. Dr. Roger Mark, Professor in Health Sciences and Technology and Electrical Engineering and his colleagues at MIT, Beth Israel Deaconess Medical Center, and Philips Healthcare have developed one such database, the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC II) database. It contains more than 30,000 ICU records, each of which includes detailed physiologic data, lab reports, medication and treatment records, and unformatted-text clinical progress notes and discharge summaries. MIMIC II supports retrospective clinical studies and also the development of new monitoring algorithms for tracking and even predicting the clinical state of patients.

Much valuable data are in the form of unformatted-text medical notes (e.g., nursing notes, discharge summaries, and x-ray reports). According to Prof. Mark, "The narrative (free text) portions of a medical record are critical for fully understanding the case. While free text is very efficient in transmitting meaning to a human reader, it presents major challenges to researchers who wish to abstract key concepts in machine-readable form."

Although rich in information, free text carries the risk of containing information that could be used to identify a specific person through use of names, birth dates, and other descriptive information. Removing protected health information (PHI) such as potential patient identifiers from unformatted-text narratives in an efficient and accurate manner presents another hurdle for researchers. In addition to the Health Information Portability and Accountability Act (HIPAA) and other legal requirements, there are ethical considerations. Potential negative effects of revealing patient identity include discrimination in employment and insurance as well as social stigma. For all of these reasons, the MIMIC II database had to be scrubbed of all PHI, including PHI found in free text, before it could be made available to researchers. "We need to be able to work with these data without putting the privacy of the patients and the confidentiality of the data at serious risk," adds Prof. Szolovits. "Our work on de-identification of narrative text is a step toward this goal."

Scrubbing the Data

Manually removing PHI from EHR narratives is a costly and time-consuming process that is prone to error. For example, a consensus of two human de-identifiers has been shown to identify only 94% of all instances of PHI in the text. To reduce costs, save time, and increase efficient, accurate scrubbing of masses of EHR data, MIT researchers developed de-identification software. In addition to HIPAA-specific PHI, the software also removes other identifying health information (e.g., references to ethnicity and common holidays that may indicate dates of events or the cultural or ethnic background of the patient). In a test of 1,836 nursing notes (about 300,000 words), the software did not miss any patient names and missed only one full date and one age over 89. The new software outperformed a single human de-identifier and performed just as well as a consensus of two human de-identifiers.

To facilitate research in critical care and medical decision support, the extensive MIMIC II database has been made available to the research community on PhysioNet – a resource that offers free Web-based access to large databases of recorded physiologic signals and related open-source software (www.physionet.org). The NIBIB and NIH’s National Institute of General Medical Sciences (NIGMS) fund PhysioNet. "The natural language processing research community also may find the narrative sections of MIMIC II to be a useful research corpus," explains Prof. Mark.

Additional Safeguards for Electronic Health Records

"The richness of the detail found in narrative portions of medical records raises the possibility that, in unusual circumstances, the identity of an individual could be discovered by correlating information in the [de-identified] medical record with available public records," explains Prof. Mark. "As an imaginary example, news reports of a collision between a Segway® operated by an inebriated 75-year-old woman and a police car might be correlated with de-identified textual data in MIMIC II that mentions Segway,® thus revealing the patient’s identity." Although such cases are likely to be very rare, investigators who want to use MIMIC II data must sign a data use agreement (DUA) promising that they will not attempt to identify subjects. The DUA also specifies that the researcher will notify MIMIC II collaborators if data are discovered to have escaped de-identification.

The use of the de-identification software in conjunction with health information systems not only addresses some important legal and ethical concerns related to sharing health information, but also allays patients’ fears of data misuse. By de-identifying all shared health information, researchers have access to the data without sacrificing the privacy and peace-of-mind of the patients themselves.

Program Area
Health Terms