Data Lake Provides One-Stop Shop for Open Data
The Open Energy Data Initiative (OEDI) Data Lake Hosts High-Value Data Sets and Analytics Tools in One Public Location, Expanding Access and Accelerating Innovation
July 1, 2020
Historically, the way the research community accesses public data sets from the U.S. Department of Energy (DOE) or its national laboratories has been to download data from a website onto a personal computer.
While that approach works for small data sets like Excel files, it is not possible with data that has fine temporal and spatial resolutions like NREL's National Solar Radiation Data Base (NSRDB) or Wind Integration National Dataset (WIND) Toolkit. A single file from those databases exceeds the available space on a standard computer hard drive.
The only way to access data of such magnitude is either by running it through a high-performance computer or by downloading much smaller subsets to work with individually—and neither of those options allow people outside of NREL to perform large-scale analyses. That was the motivation behind the Open Energy Data Initiative (OEDI), a DOE-sponsored effort to build the cloud-hosted Data Lake of high-value energy research data sets developed by DOE programs and national laboratories.
"The OEDI Data Lake turns the paradigm on its head. Instead of people bringing data to themselves, the Data Lake brings people to the data," said Michael Rossol, an NREL software engineer who manages the 500 terabytes of data that make up the NSRDB and WIND Toolkit and is helping to lead the development of the Data Lake at NREL on behalf of DOE.
Through the Data Lake, large, complex data sets and analytic tools from across the DOE complex are becoming available in one location for the first time, opening the door for unprecedented accessibility and collaboration.
What Is a Data Lake?
A "data lake" is a massive collection of curated and diverse data sets that is open to anyone with internet access and is often supported by cloud-computing vendors. Data flows into the "lake" from a variety sources, users perform analyses, and new insights from streamlined and filtered data can flow out of the lake or flow back into the lake for use by other analysts. The process then continues, accelerating research and development and creating opportunities for collaboration.
Through partnerships with cloud vendors including Google, Microsoft, and Amazon, the OEDI Data Lake lives in the cloud. Currently, it includes NREL's PV Rooftop Data set, the NSRDB, Lawrence Berkeley National Laboratory's Tracking the Sun data set, the WIND Toolkit, and DOE's Water Power Technology Office's U.S. Wave data set.
Without the OEDI Data Lake, those data sets would not be accessible to the public, let alone in one place. Even within a national laboratory, data sets often "live" in different places, whether that is on a server in a high-performance computing system or within individual research teams' databases.
"Now that we've created an infrastructure in the cloud, there's an ecosystem for people to mash up multiple data sets—including ones they contribute—to improve analysis and consider new questions," Rossol said.
The Data Lake welcomes all open energy data, and the team is working on adding more data sets, including the Utility Rate Database, NREL's PV Array data, and DOE's Geothermal Technology Office's PoroTomo dataset. The Data Lake is intended to be a tool across all of the DOE Office of Energy Efficiency & Renewable Energy, and the long-term dream is to expand it to a federal-wide tool for any office to access and add open energy data.
Leveraging the Power of Cloud Computing
In addition to hosting high-value data sets, the OEDI Data Lake provides access to cloud computing through Amazon Web Services. Users can take advantage of analytic tools and machine learning to directly analyze the data where it lives.
"Cloud computing is one of the biggest promises of the Data Lake that we think will really advance innovation," Rossol said."Because you're not using your own system's resources, it's possible to spin up a compute instance 500 times larger than your laptop."
Researchers can store their results in the cloud and choose to contribute them back into the data lake for others to build upon, breaking down data silos and allowing for more rounds of collaborative analysis.
A History of Open Energy Data
The OEDI Data Lake is the next innovation in a long history of open energy data at NREL.
"Opening up data really just opens up innovation for everybody," said Debbie Brodt-Giles, who manages the Data Analysis Tools & Applications Group at NREL and is leading the development of the Data Lake at NREL. She helped launch the OpenEI website in 2008.
At that time, open energy data for public use was a new concept. The open data repositories that did exist were often hard to navigate. The Open Energy Data Initiative allowed people for the first time to easily search for specific data sets, click a link to download the data, and upload their own data sets.
OpenEI is currently being revamped into the OEDI Catalog, a single location that will host links to data sets in the cloud or a repository, as well as code examples like GitHub repositories or Python script. More data and resources will become available in the OEDI Catalog.
"The OEDI Data Lake doesn't just open up access to data," Brodt-Giles said. "It provides an avenue for future multi-lab, multi-institution research because we have a way to seamlessly collaborate on the same platform. When we bring together the power of cloud computing and the brain power of the energy community, the potential can be limitless."