Children’s Hospital of Philadelphia and its Research Institute has the ability to explore pediatric data better than almost anywhere else in the world to solve challenging problems in child health. In his new role as associate vice president and chief research informatics officer at the Research Institute, Jeff Pennington sees a “don’t miss” window opening where CHOP is at the right time with the right tools, infrastructure, people, and skills in place to launch Arcus, an integrated data science platform.
In essence, Arcus will produce a more holistic picture of pediatric health and disease by building a one-of-its-kind library based on all of the data generated at CHOP over the course of patients’ clinical encounters and research study visits throughout childhood and adolescence. We sat down with Pennington to hear how expanding our in-house library science expertise will allow researchers to take advantage of rapidly advancing computational biology and machine learning methods to make entirely new kinds of breakthroughs.
It’s fascinating to hear how science careers often take serendipitous twists. How did your career evolve from studying biology to becoming a top expert in data and technology at the Research Institute?
I was going to be a veterinarian when I grew up, which was based on my love of science and animals and nature. I worked with small and large animals, actually race horses, but I figured out that path wasn’t for me. I had no idea of what to do next, so I fell into a startup company that helped businesses make effective use of their data. It seemed completely accidental at the time, but that became the defining theme for the rest of my career: how to help people make productive use of data and technology.
The transition from reading animal behaviors to talking terabytes must have been a challenging learning curve. How did you adapt to a new field of interest?
I was fortunate to have some really excellent technical mentors who threw me into the deep end and helped me swim. I learned how to program, build databases, and I also had strategic thinking mentors who helped me see the big picture.
My next job was at a dot.com search engine company, where I figured out how internet searches represented patterns in people’s interests and behavior. That was my first exposure to big data — where there was more data than you could possibly organize.
From there, I went to a biotechnology company as a software engineer when the dot.com boom busted. I was thrilled to put my life’s interest in the biological sciences together with the technical tools and strategic thinking that I had learned. Again, I gained a lot of experience and expertise with help of great mentors on how to use technology to unlock biomedical data. This was at the time of early genomics and gene expression data, before the Human Genome Project was completed. We knew more about how somebody’s genes were behaving than you could make sense of, so my job was to link clinical and pre-genomic data.
And then your career trajectory and the emergence of big data science synced up when you came to the Research Institute in 2007 as an analyst. A decade later, how is CHOP in a unique position to make our wealth of pediatric data more useful for exploring child health and disease?
CHOP is a very special place because we are a digital health enterprise that has been using a single electronic health record since 2010. In another 10 years, we’ll have digital health clinical data on a cohort of 18-year-olds, many of whom will also have participated in research. We will have a lifespan picture of the progression of child health, diagnosis, and treatment. The better we are at understanding child health and also diagnosing and predicting disease, the longer and happier children’s lives are going to be. We’re heading into a 10-year window when we could do amazing things.
With about 2.6 million patients on record at CHOP, that’s an ambitious goal. How do you even get started?
Out of those patients, we see about 500,000 of them a year clinically, which generates hundreds of terabytes of data from MRIs, CTs, bedside monitors, laboratory results, and more that are products of us operating as pediatric hospital. At the Research Institute, right now, we have six petabytes of research data, a lot of it genomic and molecular biology data. All combined, our data is too large, too heterogeneous, too complex for a traditional IT approach like a data warehouse or database.
About five years ago, it hit me that librarians have been wrestling with this problem of how to organize the infinite amount of information that humans have been producing since the Middle Ages. That has been the organizing principle of a program that I’m really excited about and responsible for called Arcus that combines library science, data education, applied data science, data privacy, cloud computing, and data management.
At a fundamental level, Arcus is a library of our clinical and research data sets linked by cross references. We’re systematically enriching that data to increase its value and potential. For example, Arcus will label all 2.6 million patients in CHOP’s electronic health record with a standardized description based on the Human Phenotype Ontology, a structured and controlled vocabulary that describes patients’ symptoms in a way that is useful for research
Tell us more about Arcus and your team from the Department of Biomedical and Health Informatics. They’ve accomplished a great deal of foundational work so far.
With the right team, you can do anything, so much of our startup effort has been recruiting highly skilled and passionate people. This fantastic team is coming together by working on pilot projects in many areas of research, and we are looking forward to engaging with the broader research community in the future.
Arcus includes an educational component to help our research community become more capable and confident in how to use these complex and large data sets. We have incredibly smart people at the Research Institute, and Arcus will give them the creative and intellectual horsepower to take advantage of this library of enriched data.
Another essential part of Arcus is that we will achieve these very broad ambitions to collect and link vast quantities of data on a vulnerable population — children — in an ethical way. We work hand in hand with Research Family Partners, the Office of General Counsel, the Office of Research Compliance and Regulatory Affairs, the Privacy Office, and the Institutional Review Board to meet our patients’ expectations of their right to privacy.
One of our first hires on the Arcus team was a privacy expert who helps us understand and navigate our ethical, legal and compliance requirements. The team also includes librarians, archivists, data scientists, bioinformaticians, clinical data analysts, software engineers, data services and technology experts, and education specialists.
Arcus sounds like an unprecedented approach. How do you envision it will enable next-generation computational biology, machine learning, and translational research?
If we can be good library scientists, then we can be amazing data scientists. Data scientists spend 80 percent of their time acquiring and organizing data, and 20 percent of their time actually analyzing it. We need to flip that so our researchers can ask and answer questions about the broad spectrum of conditions that our patients are affected by — and how those conditions change over time as these children grow and develop. Arcus opens the door to entirely new research uses of pediatric data.
For example, our investigators, especially our early career investigators, are passionate about asking multidisciplinary, composite questions. Let’s say a team of immunologists observes a possible connection between a specific immunologic response to an acute disease and a child’s genetics. Using Arcus, they can explore the data and let it tell them if they have a testable hypothesis. And because we’ve linked clinical to research data in the library, the study team is in a much better position to understand their starting point and begin to quickly explore their ideas.
Arcus is the bricks and foundation to link all of CHOP’s research and clinical data and help the Research Institute to make a dramatic leap forward in how we use computational methods to make the kinds of discoveries that we’ve never made before.