Start new collaborations. Find valuable data. Create cohorts that can seed new research endeavors. These are some of the key drivers for Arcus, an internal strategic program designed for researchers to more intuitively navigate clinical and research data produced by Children’s Hospital of Philadelphia and the Research Institute. Essentially, that means making the promising wealth of data in Arcus discoverable.
In this Cornerstone post, meet Spencer Lamm, MLIS, supervisor of Library Science within the Department of Biomedical and Health Informatics (DBHi), who is adopting standards and practices for managing large volumes of data at places like NASA to make the new Arcus Archives a source of reliable, reproducible data for CHOP researchers long-term.
Your library scientist role with Arcus is far from traditional. If you had to choose a few words to sum up your position, what would it be?
Findable, reusable, trustworthy research data.
That makes the process sound so simple! Yet, in fact, ingesting data into Arcus and archiving it is a huge endeavor. Give us a sense of scale in terms of potential Arcus content.
We’ve been working with Research Information Systems to learn about their Isilon storage solution, where many labs manage their data, so we know that has roughly six petabytes of research data. And that number is growing at a rapid rate. Certainly not all of this data will be appropriate for Arcus, and collaboration with labs is required prior to bringing data into Arcus, but that gives a sense of the scale. In addition, while we have reliable info about that data since it’s centralized, we’re also very aware there is valuable research data across CHOP — local hard drives, slide libraries, audio collections, etc.
Clinical and research data collection is ongoing, so Arcus will always keep expanding and become more robust. What are the key steps you’re taking now to establish this solid informatics infrastructure at CHOP for repeat use of high quality data?
The first focus for the Library Science team is to ensure the data we bring into Arcus is managed from ingestion to access in such a way that we can ensure it is reusable long-term. It is available not only for the researchers who deposited it, but also for future researchers who will look to answer new research questions by linking and leveraging the broad range of data in the archives.
What this means is that we have to do more than just transfer a lab’s raw or secondary data into Arcus. We need to work with researchers to understand why and how it was created and what tools, software, and contextual information were used to generate it. Having that full picture and structuring the content in Arcus in ways that maintain the necessary relationships between, say, a custom piece of software and the dataset it was used to create, will allow Arcus to assure contributors their data is well-curated and also respond to requests for that data in ways that will support reproducing or repurposing it for new research.
We’re accomplishing this by using digital archives principles and practices, specifically those that have helped design large-scale data archives at places like NASA, to establish the processes for ingesting, managing, and preserving data for long-term access.
What are some of the other data management challenges you encounter as the Arcus team embarks on linking and navigating these diverse systems to produce quality science?
We are establishing a research data catalog, the core function of which is to answer the question: “What data do we have at CHOP?” While answering this question is highly valuable, one of our goals is to build on that by identifying ways that a discovery layer on top of Arcus can map and expose relationships across datasets and research efforts that are currently hidden because the data is in silos.
We’re beginning user research that will help us formalize our understanding of researchers’ needs relative to Arcus. This research will inform much of what we do, relative to the catalog. It will help us understand what types of connections between datasets researchers are looking for, what metadata we can expose in the catalog that will help them identify the right resource, and develop novel ways to present this information and help spur the kind of serendipitous discovery that leads to new collaborations.
Undergirding all of this, we’re establishing a descriptive metadata schema based on publicly available standards from the NIH and other sources that will allow researchers to find data by familiar medical and biological search terms.
Tell us about some of the unique resources at CHOP and evolving technology capabilities you use to overcome these challenges?
One of the key tools we’re making use of to ensure that data in Arcus is trustworthy and available long-term are cloud computing platforms. Cloud technologies offer Arcus users a number of valuable services, and for the Library Science team, it serves as a data storage and management solution that will provide contributing labs and PIs with the confidence their data is preserved in accord with industry standards for digital archiving, such as the National Digital Stewardship Alliance’s Levels of Digital Preservation. With the help of the Department of Biomedical and Health Informatics DevOps team, we’ve been able to meet these requirements, such as ensuring that archived data is stored redundantly in two geographic regions to guard against catastrophic events.
In what other ways do you collaborate with team members and principal investigators interested in using Arcus to advance its greater use in research?
Collaboration with PIs and data managers has been key for the Library Science team as we establish the data contribution process to Arcus. That relationship involves working closely with individual labs to better understand their data and learn about access and reuse requirements. We also work with Dianna Reuter, JD, Arcus privacy and security analyst, who reviews the potential contribution for privacy, legal, and security considerations.
We show them how Arcus relieves the burden that many labs have in terms of spending their valuable time managing their data, files, and folders, over the long-term. Data that’s valuable for preservation and reuse will be well-organized and findable. One researcher described having an internal solution to manage these needs as a “paradigm shift.”
From a library scientist’s perspective, what excites you the most about this one-of-a-kind project?
I personally have always been the most excited to design systems that help people get the best possible resources and answers to their questions. Arcus is a unique opportunity for our team to take the professional practice and history of libraries and archives and apply it, transform it, and sometimes reinvent it to address new types of questions with content that just hasn’t been available in this way before.
Not only are we supporting researchers in achieving their immediate goals and advancing their careers, we’re also hopefully sparking unprecedented opportunities to discover new knowledge and breakthroughs in child health by establishing a foundation for future pediatric researchers to build on the Institute’s legacy of scientific success.
(This post is the third in a series exploring how Arcus team members are using their expertise to find innovative ways to expedite the scientific process and uncover novel research opportunities at CHOP. Read more about Jeff Pennington and Dianna Reuter, JD.)