NREL Provides Guidelines for Creating Next-Generation Data Ecosystem

Nov. 22, 2021

Illustration of the sun, clouds, and rain over a computer data storage stack
The Research Data Infrastructure (RDI) behind the High-Throughput Experimental Materials Database (HTEM-DB), publicly available at htem.nrel.gov, is a set of custom data tools that collect, process, and store experimental data and metadata, enabling the HTEM-DB repository for inorganic thin-film materials data collected during combinatorial experiments at NREL. Illustration by Joelynn Schroeder, NREL

Scientists at the National Renewable Energy Laboratory (NREL) are helping pave the way for the next generation of data-driven, AI-enabled material science.

In a new article in the journal Patterns, the authors describe a decade-long effort to build a modern data ecosystem to support close interaction between computational sciences and materials sciences. They want people to know that this is not an easy feat, but also that the importance of such a system cannot be overstated.

The article, “Research Data Infrastructure for High-Throughput Experimental Materials Science,” appears as the cover story in the journal Patterns and provides a blueprint for the development of future research data infrastructure in a better way, which is designed to increase the integration of experimental and data research in the world of materials science.

“For machine learning to make significant contributions to a materials science, algorithms must ingest and learn from high-quality, large-volume datasets,” said Andriy Zakutayev, a senior materials scientist and co-author of this study. “NREL’s Research Data Infrastructure (RDI) described in this article provides such a dataset by organizing experimental materials data into the High-Throughput Experimental Materials Database (HTEM-DB).”

In addition to Zakutayev, the article was written by Kevin Talley, Robert White, Nick Wunder, Matthew Eash, Marcus Schwarting, Dave Evenson, John Perkins, William Tumas, Kristin Munch, and Caleb Phillips, all of whom did this work as part of NREL’s Materials, Chemical, and Computational Science (MCCS) Directorate.

Databases are the cornerstone of modern data-driven material science, enabling materials discovery by summarizing crystal structures and predicted properties for tens of thousands of calculated materials. To provide access to large amounts of experimental data, NREL in 2018 opened to the public the HTEM-DB, which allows researchers to discover experimental materials with useful properties. This experimental dataset includes material synthesis conditions, chemical composition, crystal structure, and physical properties. The dataset initially contained 140,000 samples, with more than half available to the public, and now holds in excess of 320,000 samples. These numbers are on par with computational materials databases such as the Department of Energy-funded Materials Project.

Key to the success of the HTEM-DB is NREL’s Research Data Infrastructure (RDI), a data management system. The RDI is integrated into the laboratory workflow by cataloging data collected from experiments conducted at NREL over the past decade.

“The sustained effort to develop research data infrastructure at NREL continues to pay dividends,” noted Bill Tumas, NREL’s associate laboratory director for MCCS and a strong advocate of the materials discovery and development. “By establishing various RDI components, integrating them together, and providing them to researchers, a complete data workflow has been implemented that curates valuable data in HTEM-DB for future use in machine learning studies.”

The data tools that form the RDI, such as the data warehouse, metadata collector, the data extraction process, and the HTEM-DB itself, are all considered critical to the success of the RDI and have been in use at NREL for the past decade. The researchers said that, by describing these RDI components in this article, they hope these data tools would serve as best practices for other institutions to follow.

In materials science and its quest for the new, researchers take two distinct approaches: testing a hypothesis through experimentation and filtering through the resulting data to analyze the connections. Each of those experiment-driven and data-driven approaches has its own requirements, but they can be integrated into a single process. The experimental researchers need tools to analyze and learn from the data, while the data researchers need large, diverse, high-quality datasets. However, both need access to previously obtained data and a repository for new data, so there is a strong overlap in their inputs and outputs of the experiment-driven and data-driven materials research.

Those requirements, the researchers noted, motivated the creation of the RDI that collects, processes, and stores experimental data and metadata, as well as the HTEM-DB. The RDI provides improved efficiency and increased accuracy of experimental research data handling and allows machine learning methods to sift through a wide range of materials. The authors make clear the immense value for interdisciplinary science in building a common framework for carefully curated experimental data and metadata.

“Building the research data infrastructure at NREL was done from the bottom up, with multiple contributions from many people over the time span of more than a decade. We learned a lot from this process,” said Kristin Munch, one of the early developers of the RDI. “Ideally, future RDI systems could be engineered with some top-down focus as well, to ensure data consistency and provenance over time.”

The top-down approach, however, would require a substantial upfront investment in hardware, network installation, and software development, as well as continued spending on maintenance and improvement.

“An interesting third option would be to develop the RDI framework from the top down using external funding at one institution with prior experience in this, like NREL, and then customize from the bottom up to be most useful for other external research labs outside of NREL,” said Caleb Phillips, the senior data scientist on this study. “But no matter the funding mechanism, research data infrastructure investments like these are critical to advancing modern data-driven science at the intersection of material science and many other fields.”

Financial support for the HTEM and RDI operation and improvements came from NREL’s Laboratory Directed Research and Development program. The original funding for Data Warehouse prototyping was supported by the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy.

Learn more about materials science and computational science research at NREL.