Transcript of the "Overview of the Transportation Data Center" Webinar
We host the Transportation Secure Data Center (TSDC) here at NREL. The background and rationale for the TSDC are that high-resolution survey data and GPS travel profiles can be really valuable for research purposes, but are sensitive due to privacy concerns because the original survey participants are supposed to be anonymous.
In 2007, a National Research Council report suggested a secure enclave or secure data center as the best approach to resolving this dilemma. Providing a centralized repository would maximize the benefit of the public funding that went into collecting the data while removing the burden of providing the data from the entities that collected the data.
That's what led to the formation of the TSDC. It started up around 2009 with some internal NREL funding and has been supported since then, and at that time, by the Department of Transportation and the Department of Energy (DOE). As a DOE lab, NREL has a history of hosting other similar secure data centers, and we also use this data for energy analysis. It's how we came to be the host of the TSDC.
NREL hosts a variety of data centers related to transportation. The first in the list (the Alternative Fuels Data Center) serves as an information clearinghouse, and the others are secure data centers that have sensitive data in one form or another, starting with the National Fuel Cell Technology Evaluation Center (NFCTEC). While commercial partners and automakers are sensitive about their vehicle and fueling station information, DOE wants to monitor the progress of the technology, so NREL hosts the secure NFCTEC data center for that data.
Other secure data centers include the TSDC, which I'm talking about here, and Fleet DNA, which is analogous to the TSDC but for the commercial vehicle sidethere's medium- and heavy-duty drive cycle information from vehicles operating in commercial fleets. In addition to evaluating advanced technology vehicles in those fleets, we are interested in understanding travel behavior and drive cycle aggressiveness and how these affect the energy efficiency of commercial fleets. The last one (FleetDash) is specifically a business intelligence tool for federal fleets.
This slide gives an example of some of the energy-focused uses of GPS driving profiles that include the origin and destination locations as well as specific speed and acceleration data for the drive profiles. We are able to use those, combined with models of conventional vehicles or advanced technology vehicles (such as hybrids or electric vehicles) to look at energy efficiency over various driving profiles.
Here I have a link to a visual for what some of these data look like. This animation shows individual vehicle trajectories over the course of a travel dayjust a snapshot over a rush-hour period of time, and then the video loops. This is a data set of about 1,600 peoples' daily travel profiles around Atlanta.
Obviously, beyond energy-related uses, there are lots of other uses for these data, and we make the data available for external users to come in; one example is looking at day-to-day destination variation for a data set of households in California's Bay Area, and looking at, on a global scale, the variation and their destinations. This was a wearable sub-sample from this survey. You can see that there's obviously a lot of travel around the Bay area and California and the West Coast, but this data set includes national and international destination points for the trips as well. There are lots of other applicationssuch as looking at travel time variability or day-to-day mode choice variability, particularly with the multi-day versions of the data sets.
So, a little more about the TSDC's operating procedures. Again, our goal is to maintain a balanced focused on privacy protection, first and foremost, while maximizing the usability of these data within those constraints. We have an advisory committee that provides oversight support for our procedure development. The committeewhich includes data collectors, providers, and end usershas representation across private industry as well as academia and various areas of the government.
In contrast to some data centers, such as the Census Research Data Center Program, that require users to physically go and work with the sensitive data in a designated secure environment facility, we set up the TSDC as a virtual data center where you don't have to physically travel to work in the secure environment. I'll get into that shortly.
As far as the data archiving, we establish agreements with the original collectors of the data, and then load and store the data on the servers here at NREL, maintaining security and backing up the original data sets. We also do a fair amount of data processing. These data sets were collected from a variety of different studies with similar types of information, but with different naming conventions, structures, etc. And so we put them all into a common organization for handling on our side and making them available.
We remove anything explicitly identifiable (if the original data providers fail to do so before sending the data to us), and then we do quality control, particularly with GPS drive cycle trajectories. If we are going to use these drive cycle trajectories for energy analyses, a spurious or mis-recorded acceleration could have a big impact. So, we developed procedures for filtering out errant points in those drive cycles.
We also add reference informationa couple of these images show typical temperature information across the country and across a year. We add commercial road grade data sets, and have developed techniques for appending road grade data sets with USGS digital elevation model data with some filtering routines that we developed and calibrated against some of the commercial road grade data products. And then, for those surveys that include demographic information, we add information on land use and so forth.
This shows a snapshot of a paper where we summarized the GPS data filtration method used to filter out points of spurious acceleration or data dropout. This also shows an illustration of the matching we do to the underlying road network. We're able to do a pretty decent job between the data filtering and logic on the connectivity between links to attach data points to the road network even in situations with complex interchanges and overlapping links on the underlying road map.
We established two distinct methods for data access. The first is a public website with an area for downloading cleansed data sets for people interested in aspects of the data that don't require sensitive location information. These data include high-level information on total driving distances as well as second-by-second time and speed informationwithout the latitude and longitude locations associated with the data points. As mentioned previously, such data even without the precise location information can be useful for many applications, such as drive cycle-based energy analysis. Other identifying details that our advisory committee recommended removing include vehicle model, which could be a little too precise with some types of vehicles.
To access the cleansed data sets, we require users to fill out a point-and-click user registration form that includes an agreement where the user pledges not to attempt to identify individuals (even though, in principle, it shouldn't be possible to do so from these data sets).
As I alluded to before, we've established what we call our secure portal environment for working with the more detailed and spatial data for which there is a higher level of sensitivity. We modeled the secure portal environment after secure data centers where you go in and work with the data there. But in this case, it's a virtual connection so you can connect to a virtual machine here at NREL that hosts this environment. The application packet for accessing this area is posted on our website (www.nrel.gov/tsdc). The application packet includes a data use and disclaimer agreementdeveloped with our legal office here at NREL to include data protection legal language and an explicit pledge not to identify any individuals from the data set. This must be signed prior to gaining access. Each individual user who would like an account in the secure environment needs to go through this process, which necessitates a signature from the applicant as well as the applicant's supervisor.
We also ask for an analysis description documenting the proposed analysis and specifically why accessing the data in the secure portal environment is necessary to complete that analysis (and why the cleansed data sets or other available data sets are insufficient to conduct the analysis). The application packet also includes a form that must be filled out before using NREL's cyber resources.
After the applicant fills out and submits the required forms, the advisory group (and in some cases, specific data providers) will review the application materials and provide a recommendation on data access.
Once an individual is granted access within the environment, they're prohibited from transferring data in and out that they can't share via their local clipboard or local drive access, and internet access is disabled. We provide software packages within the environment for users to interact with the data. As part of the analysis description document, we ask for a description of anything that the user would like to remove in the way of aggregated results, and then (after the user completes the analysis and generates the aggregated results) we'll review it to ensure its consistency with their approved application and then send those aggregated data tables or images to the users after confirming that the aggregated results cannot be used to identify individuals.
This slide provides a snapshot of some of the examples of data sets hosted in the TSDC. A lot of these are GPS add-ons to larger metropolitan planning organizations' travel survey efforts, where the GPS subsample helps them get at trip underreporting (in addition to providing the aforementioned benefits of detailed GPS data collection). One of the bigger surveys we have is from the California State-Wide Household Travel Survey. The full sample includes roughly 43,000 households, with geocoded trip ends.
Again, for more information on the TSDC, I'd welcome anyone to visit the website where you can learn more about the project. We have fact sheets and publications that go into more details on the TSDC. You can also sign up to receive e-mail updates about the TSDC. My e-mail address is posted there for potential project partners. We also have a dedicated TSDC@NREL.gov e-mail address for people with specific data questions or user-support issues.
Questions and Answers
Question: What are the options if the researcher wants to use custom code or software to perform work in the secure portal environment?
If users have a code or script they've developed outside of the environment that they would like to use inside the environment, they can just send that to us, and then we can load it into their individual workspace within the environment. The environment includes software tools for querying the database, GIS software packages, and one commercial package (ArcGIS). We also have QGIS and Python GIS linkages that work in the environment. Because we're making this environment openly available, we are limited as to the commercial software packages we can provide; so again, ArcGIS is the one commercial package. But, we also offer statistical analysis software (such as R) and various reference maps from free and open source sources. Hopefully, those are sufficient for meeting our users' needs; however, if there are other open source software packages that a user would like to use in the environment, we can usually add them.
Question: What are reasons that an application to the secure portal environment might get declined?
Typically the only time this occurs is when someone proposes to conduct analysis that they did not realize could be conducted with the data that is posted on the public site. Or, the applicant is requesting to take out data that could (on its own or in combination with data from the public download site) be used to identify an individual. In most cases, applicants are able to revise their research proposal to, for instance, remove data that would be sufficiently aggregated to alleviate such concerns.