Transcript of the "Transportation Secure Data Center: Project and Procedures Overview" Webinar
This is a text version of the webinar—titled "Transportation Secure Data Center: Project and Procedures Overview"—that was originally presented by Jeff Gonder of the National Renewable Energy Laboratory (NREL) in September 2012.
Hello, my name is Jeff Gonder, and I'm the project lead for the Transportation Secure Data Center, or TSDC. In this webinar, I will give an overview of the TSDC project and of the procedures that were developed to operate the TSDC.
Note that the website is given here—
www.nrel.gov/vehiclesandfuels/secure_transportation_data.html—and it will also be given at the end of the presentation along with my contact information. The TSDC is hosted by the National Renewable Energy Laboratory's Center for Transportation Technologies and Systems.
I'd like to begin this webinar by making two acknowledgements. One for Evan Burton, a database and GIS analyst who supports the TSDC and helped put this webinar together. I would also like to acknowledge Elaine Marconi of the Department of Transportation, whose Office of Planning in the Federal Highway Administration has provided the majority of the support for this project.
This webinar is divided into two sections. In the first half of the webinar, I'll give a general project summary and background information along with example uses of the data sets available in the TSDC. In the second half, I'll give details on the TSDC's procedures for data archiving, processing, and secure access along with the background rationale that led to the creation of these procedures.
I'll begin with a high-level summary of the Transportation Secure Data Center, which is that the TSDC serves as a secure area for archiving and accessing detailed transportation data. A prime example of such detailed data is GPS travel profiles collected from transportation surveys or studies. The detailed GPS location information, particularly when it's combined with other demographic details from the survey or study, can allow someone to identify an individual from the data set who is supposed to remain anonymous; because of this, access to this type of data must be controlled.
Motivation for creation of the TSDC stemmed originally from the fact that more and more travel surveys and studies were including GPS as part of their data collection, and the technology has become more accurate and less expensive. Even so, the data collection can cost a significant amount of money, so it made sense to maximize the use of the data that was collected.
The high spatial and temporal resolution of the collected GPS data make it very valuable for both traditional and non-traditional applications. The idea of a secure central repository to maximize research returns, while safeguarding the person's privacy, was championed several years ago in a National Research Council report. This approach benefits data providers by giving them a secure place to archive their data and also by relieving them from having to respond to data sharing requests from individuals.
Data users benefit by having access to data that may not otherwise be available and by having a central location through which they can go look for data.
As mentioned earlier, the TSDC is operated by NREL's Transportation Center, and most of the support for it to date has come from the Federal Highway Office of Planning. There has been additional cost-sharing support provided by the Federal Highway Office of Operations, NREL internal funding, and the Department of Energy's Vehicle Technologies Program.
With respect to stakeholders who either provide data for archiving or access data from the TSDC, the table summarizes many of the applications and types of organizations interested in the data. These include traditional areas, such as travel modeling and transit planning by metropolitan planning organizations or state DOTs, as well as emerging applications, such as alternative vehicle energy analysis, research laboratories, and auto manufacturers.
The overall approach to setting up and operating the TSDC has been guided by the dual goals to first and foremost maintain the anonymity of individual participants in the archive data sets and secondly to maximize the research usability of the data within the privacy-protection constraints.
To help support oversight, we established from the outset an advisory committee including both data providers and users from private industry, academia, and government. Input from this group helped guide development of the operating procedures that I will discuss in more detail in the second half of this webinar.
One key element of the TSDC structure is its division into three distinct areas. The first area is a secure enclave for raw data in which we store and process the data and allow no external access. In the second public-download area, we post cleansed versions of the data sets, which are freely available for download. The final controlled-access area is restricted to users who complete an application process to work with versions of the data that include spatial details.
This area is accessible through remote web connections, but users are restricted to operating within the provided environment and cannot remove data. I will give more details on each of these areas in the next section.
Many researchers have already started benefiting from the data provided through the TSDC, and the references (shown on the screen) list several publications for the first half of this year that we know have drawn on TSDC data.
As mentioned a couple of slides ago, the data is useful for a large number of potential applications. I will briefly highlight two uses of this data from the publication list that fall into the non-traditional category of vehicle energy analysis. Since these are also the two examples from the list in which I was involved, I'm able to share an excerpt from each paper.
The first study that I'll mention looked at the influence of driving style on vehicle fuel use in order to inform feedback techniques that could help teach drivers how to improve their fuel economy. The TSDC data that supported this study came from a Texas Department of Transportation GPS-enhanced household travel survey that was conducted in Austin and San Antonio and included about 800 vehicles in the GPS component.
The second-by-second nature of this data allowed us to extract details about the vehicles' acceleration rates and speeds and to predict their fuel use with the aid of a vehicle model. In the most fuel-efficient trips, drivers maintained speeds between 30 and 60 miles per hour and avoided hard acceleration.
The next study that I'll mention examined the potential influence of driving conditions on batteries and plug-in electric vehicles. This effort used the TSDC data set provided by the Puget Sound Regional Council from that organization's study of the influence of hypothetical road tolling on driver behavior.
This effort, known as the Traffic Choices Study or TCS, collected GPS data from over 400 vehicles for an extended period of time and included a three month control period of baseline data collection. Because the battery wear rate in electric vehicles depends strongly on how far vehicles are driven from day to day, it's helpful to examine the variability of driving distances over time from such a longitudinal data set.
Figure six (from this paper excerpt) shows the driving distance distribution from three TCS longitudinal profiles as compared with the cross-section taken from the single day National Household Travel Survey. That concludes the first section of this webinar.
In the second section, I will review the procedures used in the TSDC and how those procedures were established.
As background for this section on procedures development for the TSDC, I wanted to mention that the TSDC is one of several transportation data centers hosted at NREL. These are the Alternative Fuels and Advanced Vehicles Data Center, the Hydrogen Secure Data Center, and the Commercial Fleet Data Center. The Hydrogen Secure Data Center was used as the initial model for several of the procedures for the TSDC because it had experience with protecting sensitive data on hydrogen and fuel cell vehicles and disseminating the data in an aggregated format. The Commercial Fleet Data Center deals with GPS data much as the TSDC does, but focuses on applications for commercial vehicles.
The data archiving and storing procedures were taken directly from those in place for the Hydrogen Secure Data Center. They began by establishing an agreement with the data provider in which the data is received at NREL and loaded onto a secure server for handling the data, which is protected in a building secured by badge access and an on-site security force. The specific room where the data is stored is limited to data center staff. Data backups are then maintained both digitally and with a regular tape backup stored in a separate location to protect the data in the event of fire or other disaster.
Once the data is loaded on the TSDC server, NREL performs processing to standardize the formatting, to remove explicitly identifying information that may have been inadvertently left in when the provider provided the data, and to perform quality control to clean up points that may be errant from the GPS recordings.
The data is packaged to support analyses at different levels, providing aggregated data and also summarizing the data broken out by household, by vehicle, and by time. The data is maintained in both uncorrected and corrected formats in case the user wants to provide their own correction procedures.
In addition to performing processing on the data as it comes into the TSDC, the NREL team adds reference data that can be helpful for those making use of the data sets. This could include high-level information, such as the context of the original study, which is important to understand before attempting any secondary use of the data.
Adding vehicle details, such as vehicle class and fuel economy, and the controlled-access area spatial-reference information, such as demographic and land-use information available from UrbanSIM, Census 2010, or the American Community Survey, or USGS elevation information, which can be helpful for incorporating road grade into the data.
In developing the TSDC's procedures for accessing the data, we reference best-practice examples from the other NREL data centers, as well as from data centers outside of NREL. One useful example came from the National Household Travel Survey, or NHTS, process for granting access to enhanced information in the NHTS DOT file.
This procedure was informative, because the enhanced details of the NHTS DOT file contain geographic information that could be of use to identify a participant, which is the same basic concern with the TSDC data.
With input from the TSDC advisory group as well as the models of the NHTS methodology and other data centers, we decided to create two distinct levels of data access, as mentioned earlier. One being a public area where a cleansed version of the data set would be freely available and the second being a controlled-access area at higher levels of data detail. This structure limits the number of accounts in the controlled-access area to those with a legitimate need to work with a higher level of data detail. It provides broad access to the lower level of detail data, so researchers can conduct a variety of analysis with data that doesn't need the same level of protection as that in a controlled-access area.
The steps taken to provide a cleansed version of the data in the public-download area include removing latitude and longitude from the point data so that no precise spatial information can be connected with the individual profiles. We also removed other potentially identifying details such as vehicle model in cases where rare vehicle models might enable identifying a participant.
We add back in supplemental information—such as vehicle class and fuel economy—that can be important for a variety of different analyses, and then we have a simple point-and-click user registration and usage agreement before anyone can access even that cleansed public-download data.
As mentioned on the previous slide, the NHTS procedures for accessing the NHTS DOT file provided a useful example due to the similar level of data sensitivity. In collaboration with our advisory group, we sought to establish procedures that at least met those established by the NHTS, and the table on the following two slides will highlight the specific procedures that we put in place.
The first protocol for gaining access to the more detailed TSDC data, which is similar to the NHTS analogous example, is to complete a data use and disclaimer agreement for accessing the data. This includes confidential data protection legal language and an explicit pledge not to attempt to identify an individual participant from the data set.
Going above what the requirements are for the NHTS analogy, only a single user is granted access for each application, and the application requires signature from the applicant and also either a university advisor or a line manager. The next requirement, which is again similar to the NHTS point of reference, is for the user to provide a document describing the analysis that they would like to conduct in explaining why they need to access data in the TSDC controlled-access area versus completing the analysis with a different data set.
The last line on this comparison table lists a "condition of use for cyber resources" form; this form is unique to NREL. It's a requirement by our Cyber Security Office in order to gain access to any NREL cyber resources.
The final two protocols summarized on this slide describe protections above and beyond those of our reference point. The first is that our advisory group reviews the application and provides their recommendation whether or not to grant the applicant access based on their description of the analysis required and the required signatures on the various agreements.
An example of why an applicant might not be allowed access is if it's determined that the analysis they want to conduct for a similar analysis could be completed with data outside of the controlled-access area. The final section covers technical controls put in place in the environment for working with the data. As mentioned earlier, this environment is accessible through a remote web terminal and provides a dedicated environment in which users can work with the data.
Within this environment, data transfer back to the user's home network or computer is prohibited and the user is restricted to using software packages that are provided within the environment.
NREL audits any externally developed code the user would want to bring into the environment, but we do allow that to happen. And likewise if the user has other software packages that they would like to provide for their own use in the environment, we can work with them on that.
Similarly, once a user has completed their analysis and saved their results, as long as they are in an aggregated form, they can request that NREL send the data out of the secure area to them. NREL will audit this data and provide it back to the user via FTP or email based on the size of the data.
This mix of restricted access, legal agreements, and technical controls is designed to meet our overall stated goal of protecting the data while still creating an environment where legitimate research can be conducted.
This concludes the introductory webinar on the background and overview of the TSDC as well as its operating procedures. I invite you to visit our website where you can read more about the project, download the cleansed public data that we have available there, or read our fact sheet or paper that we have published on the data center. You can also sign up to receive email updates when we have new data posted or other pertinent announcements.
You may also feel free to email me directly at Jeff.Gonder@NREL.gov if you have further questions or if you're interested in contributing data, in learning about ways to support the TSDC, or if you'd like to apply for access to the detailed spatial data through our controlled-access environment.
Thank you very much for your interest.