Transcript of the "Transportation Secure Data Center Demonstration" Webinar
This is a text version of the webinar—titled "Transportation Secure Data Center Demonstration"—that was originally presented by Jeff Gonder and Evan Burton of the National Renewable Energy Laboratory (NREL) in March 2013.
My name is Jeff Gonder. I am a senior engineer at NREL and the project lead for the Transportation Secure Data Center project.
On our website, there are a couple other things I wanted to point out. We recorded a webinar that has been posted for a few months now that gives a more detailed overview of the Transportation Secure Data Center and the operating procedures we use. The focus of the webinar today is really more of a demonstration. I did all the talking for the current one that's posted on the overview of operating procedures, so I'm going to hand it over to Evan Burton, who is a data analyst and database manager for the Transportation Secure Data Center project at NREL. He will say a couple of slides of introduction and then dive into the demonstration part.
For our agenda today, we'll go through a couple slides of general introduction, then Evan will dive into more detailed content and we'll go into some demonstration videos.
Thank you, Jeff. I'm Evan Burton. I do the data management and a lot of the development for the TSDC. I will walk through the introduction and procedures and will give you a brief overview of the processing we do for the data and then dive into the videos.
So to begin, the Transportation Secure Data Center's goal is to securely archive and provide public access to detailed transportation data. The detailed transportation data usually consists of a survey of some sort with a GPS component. The GPS component allows us to do vehicle analysis on the data sets, as well as a number of other things.
Most data sets like this aren't publicly available due to the expense to collect, the difficulty in interpreting the raw data, as well as the privacy issues associated with locating a person over periods of time. So what we've tried to do is set up a structure that allows us to provide this data to the public while maintaining the privacy of the participants.
So, a little bit of background. Jeff and I are both from NREL. Most of the data center operates at NREL while most of the funding comes from the Federal Highway Administration (FHWA), specifically the Office of Planning. With the FHWA oversight as well as some DOE oversight and NREL oversight, we have included industry, academia, and us (government) to give oversight on how to define our procedures and how to ensure that we're not violating privacy or going out of bounds of what data we can provide.
There are a number of example uses for this data—any type of transportation research has a broad spectrum of applications, and this can be applied in some way to the majority of them. So for example, you can do transit-demand modeling with this. You can do evacuation planning with this. You can analyze the spread of disease with this kind of data. You can do any other analysis looking at movement and where people are with the data we house.
The security procedures we have in place for the data that we're working with—first, we establish a memorandum of understanding (MOU) with the provider of the GPS data set. Once we've established that MOU, they transfer their data sets to us in whatever format they have them in. We convert them to our format and transfer them to our master server, which is in our main building. The server room, which you can see in the picture at the bottom right, is room-key access only. So you have to have a security badge to get in, and an escort is required.
So we store our full data sets and do all of our processing on these servers, and then transfer them out to the different areas where we provide the data for public access. In this area we do all back-ups to ensure we don't lose any of the data sets that are provided to us.
Once we transfer all the data onto the servers, we go through a data-processing routine that we've been developing over time and it's slowly becoming more mature. You'll begin to see that in the data sets we put out more recently.
So what we do is divide the processing routine into two different steps. The first step is a drive cycle processing routine, which has been developed over the years at NREL for all of our vehicle-based data. We feed it through the drive cycle processing; it filters it and breaks it up into trips, micro trips, and days; and then spits out calculations. To handle the additional data sets that come with most of the survey data that we are provided, we have had to set up additional processing routines that handle vehicle configuration information, the original data set's trip identification tables, and then in some cases we have wearable components also with that trip component for the wearable devices.
So to get through drive cycle processing or to get through all processing steps, we ask six basic questions of the study. The most important question is: Is the data above 0.25 Hertz? Or is there a sample collected every four seconds? If it's under that, we still feed it through additional processing, but the data cannot complete our drive cycle processing so we don't generate any sequencing or calculation results on our side for that.
Most of the additional questions deal with the additional portions of the data sets provided to us. So we ask: Is there additional data left? If the answer is yes, we move on and say: Is vehicle configuration indicated? If so, we go through that processing and move on to trips, pass the raw data through unfiltered trip processing, and move on to the wearable portions. I'll go more into detail about this in a second.
The drive cycle processing begins with point filtration. The raw point data we're getting is GPS based. So the speed, while it's usable, has some noise, so we apply a filtration method that identifies outlying acceleration and speeds and then also identifies where there are gaps in recoding relative to the sample rates. So we interpret all that, get rid of the noise, and generate a pretty smooth speed profile for the vehicle, which you can see in the graph below.
In this process we flag anything that we change and we also convert the lat/long and data provided with the GPS sample to a geometric point representation, allowing us to map the data. After it goes through filtration, we move to sequencing.
Sequencing refers to the order in which the events occur or how the vehicle behaved over time. Our sequence definitions are micro trip—that is when a vehicle goes from 0 miles per hour to 0 miles per hour, and I believe the ongoing zeros that follow or precede a stop are appended to the beginning of the micro trip. For trips, we define that as any time when no data is collected for three minutes; a day is 4:00 a.m. to 4:00 a.m. to account for people driving at midnight.
A hierarchy exists between the sequences. If you look over to the right, there are several graphs. Each speed profile is given a color and broken into discrete categories by day trip, micro trip. You start the day, you see it is all one color; you have a full vehicle day. In the trips for this day, there are five separate trips illustrated by five separate colors. You move all the way down to micro trips, and you see a wide array of colors, meaning that there are multiple micro trips within a trip. But you can also see that a micro trip can be a single trip and furthermore a trip can be a vehicle day. So potentially a single micro trip could be a single vehicle day—it's very unlikely, but it occurs.
Once it is sequenced, we then move on to calculations.
Generating results—in calculations we take sequences and break up the point data, order it using time, and then draw a line representing the path of travel for that sequence. In the top right, you see several pieces or several groupings of point data or a group of point data with a speed identified using color. And what this is doing is the vehicle is going or coming, taking a right, taking another right, going down the street, taking a U-turn, and coming back down. We build the line using that path of travel and then order by time—what you see below is a line segment generated from the points. The triangles along the line segment illustrate the instantaneous direction of travel for the sequence. For each sequence, we generate more than 250 attributes related to the speed profile of the vehicle in that sequence. You can see a map at the bottom of several vehicle days where one vehicle day has been selected; the information relating to it is within the window below. And you can see there is data associated with each feature in our database.
So that is our drive cycle processing routine. It ends up with 250 calculations for each of the three sequencing scales we have.
The additional TSDC processing that I touched on earlier includes three separate steps, one of which is duplicated for both vehicle and person trips. The first one is an EPA map-vehicle match where we take, if provided, the vehicle configuration data and link it to vehicle type as well as the average fuel economy for as close a match as we can make to the database.
For the person data, we upload and assign a unique person identifier; we assign our own identifiers but maintain a link between old identifiers and new identifiers. We upload the person into the database and then (using either person data or trip data or person-trip data or vehicle-trip data) we feed it through unfiltered trip processing, which is similar to our calculations—draws a line but doesn't generate the 250-plus results. It results in a set of quality-control statistics to assess the original data set. These are all dependent on the expansive data provided in the original study.
We maintain a link between the original study and what we do with the data using the identifiers that we assign as well as the identifiers assigned by the study. So for example, in our Atlanta data set, each household is identified with a sample number; the sample number might be 8,042. There could be multiple vehicles within that household, and the vehicles are enumerated relative to the households. So if there are two vehicles in household 8,042, you would see vehicle number 1 and vehicle number 2. Using the combination of the two, they are converted to an original vehicle identifier that corresponds to our NREL vehicle identifier, acting as a crosswalk between original and processed data.
That might be a little confusing, and I can touch on it in the question section, but we definitely want to make sure that we’re providing the original studies and adding value to those studies with the intent of allowing comparison across studies—normalizing data to make it usable for comparative analysis.
Our master database feeds data to our different areas. We process all the data and, as you can see, these are all the data sets that we have in the map—what's up there on the map might not be represented in all areas; it's what we have as our base data sets. And depending on agreements with MPOs or data providers, we provide different levels of access.
As I mentioned earlier, some of them have wearable add-ons. The Atlanta and Chicago data sets—that we currently have—contain wearable add-ons. They all have survey data included with them to tell you about the individuals as well as vehicles and any other diary information recorded with the study.
So how do you access this data? We have two separate areas where you can go to get the data in various forms. The first is our cleansed download-data area, which is a website that contains anonymized versions of the original data sets as well as our processed results. To anonymize them, we remove any sort of spatial reference or any identification of where that person is—we remove any sort of classification information that can be used to locate or identify a person. So if it's a home or work or other trip, we provide it. If it's to this elementary school, we don't provide it. We then post the data in a useable format for people to download. Anybody can download the cleansed data following registration. You can see our URLs right there in the sub-bullets.
So the second access area—which provides full access to the data sets—is a secure portal with controlled access. It is a VMware client that you can download onto your computer and then work through it on a virtual computer housed on our servers at NREL. So what you're essentially doing is logging onto our servers to access the data, but we control what you can take out to make sure that no data can be removed without having to go through us first.
In this environment we've provided several tools for working with the data, most of which are open source. We've also included some GIS proprietary software. Once a user goes through the process of gaining access and applying to a secure-access area, they can then remove aggregates once they go through all their analysis and we ensure that the aggregates don't have any identifying information.
I'll let Evan continue with demoing the public-access data-download area and then we'll move on to the controlled-access area.
Right now, we're moving on to some videos.
So what we're going to walk you through here is registration and our main website for data download of public data sets. On our website there is some general information about us—fact sheets that you can download. But if you go ahead and click on the data itself, you'll come to this screen where it asks for registration. If you've pre-registered, you'll have a password—it is sent to your email immediately after registration and you'll be able to log on. What you see on the right is our data-use disclaimer—by checking that box, you're agreeing to it. Once you go through that process and can log in, you can go in here, download data sets, and download the original travel survey final report. We have data dictionaries that explain all the attributes associated with each zip file you can download, and we have each of our studies in here as you move along. I think right now we have four different studies, one of which has nine different subsets.
If you click on here, you can download the zip file and unzip it—that's just straight forward web access for cleansed versions of our data.
Moving on to the controlled-access area, which is much more interesting and where you have mostly full access to the data (we've removed names, addresses, stuff like that).
Just to interrupt one second—this is Jeff. The previous webinar that we have posted goes through in more detail the process for applying for access to this controlled-access area, but this is where we have spatial data and so additional protection is required. But after a fairly painless application process, you're given an account to log in. Again, I'll refer you back to that other webinar for more details about our procedures and how you go about applying.
After you go through the application process and are accepted, we give you access to our controlled-access area where you first go to for downloading the VMWare software that we provide. Through VMWare, you can log on to our virtual desktop. At the same place you can download the software, we provide you with a user name and password to log in. Once you go through the log in, a screen will pop up asking you if you want to log on to the desktop. If you were previously logged on and got disconnected, you can go back here and it will say “reconnect to desktop” and it will save your session.
Once you hit connect, a window will appear on your desktop, which is what you see now. There is a user agreement when you log in—it's the same thing that goes along with our public-access website. You're just agreeing not to try to identify anybody.
So once you log in here, you see this desktop with a bunch of tools and a bunch of storage areas. The first storage area I'll talk about is the P drive. The P drive is where we share data from our studies, so these are subsets of our master database. Each of those folders is protected, depending on what you requested access to. So if you ask to look at the ARC data set, you can only see the ARC folder.
And the ones we're looking at right now are in a “my documents” folder, which is where users can actually save and store their information. The P drive is write-protected. The “my documents” folder is where you can work with subsets of our master database. Moving forward, to connect to the actual database—because the database is separate from the desktop itself (it's an extra security precaution we have in place)—you have to open a putty window. It connects this desktop to our database. The only time you're working directly with the database, you have to click on that shortcut to PG Admin. It's on your desktop and it'll open these windows. Once those windows are open, you can minimize them and move on to accessing the database through PG Admin, which is our graphical user interface that works directly with the database.
You go ahead and connect to the database, and depending on your user name, you'll be allowed access. Right here we just have a test account set up and it's asking for the password. You click through that warning (assuming you get the password) and you can see all of our databases. Right now we have Atlanta Regional Commission (ARC), Chicago Metropolitan Agency for Planning (CMAP), Texas Department of Transportation, Southern California Association of Governments, and Puget Sound Regional Council (PSRC) posted up here. What we'll mainly focus on right now is ARC and CMAP—that is what I'll be covering for the remainder of this presentation.
So once you're in PG Admin; this is where you can perform SQL queries. What you see is a table that is part of a schema. You can think of a table and a schema as being similar to a folder and a file; it’s just the structure that is defined and it doesn't have any storage-capacity issues.
You can click through all these and see all the tables. For each table, there are hundreds of attributes and we have commented out (within the section that you see on the bottom right) all the information that you want to know about each attribute.
So that's our general data access—how you begin to work with the database. What I'll move to now are some simple SQL queries that demonstrate how you might want to work with the data.
So again, you see on the screen our putty shell—you have to be connected to that to be able to connect to the database. In PG Admin, what we're going to be working with are ARC and CMAP. The data, the tables I'm going to query in this example are the trip results, the points table, the day results, and then the additional census data we provide—specifically, the roads table.
Using those tables and using this interface, you can build and populate simple queries. Go ahead and write them out or have the software write them for you. We have lots of attributes to use. You can go in, check boxes, do whatever you want, and now you can see that SQL language is populated and you can run it. So this is hitting the CMAP data set and returning data for us to view.
On the P drive I've provided example codes for users to work with. If somebody is unfamiliar with the software or the programming language we provide, we go ahead and post it up here for you to begin to work with and begin to understand what is available. Most of the software we have for you is well documented online, so I would refer you there if you have any questions regarding SQL or QGIS or any of this stuff I'm going to cover.
In our example code, we have several SQL queries set up. The first one extracts hour of day for each trip, the total distance traveled for each trip, the number of stops on each trip, and the total average speed. It orders it by VID and provides all the results. You can export those results from the database to a CSV file in your “my documents” folder. I'll show you this several times in this demo. What you're seeing is extracting data from our secure database and being able to work with it like it's on your desktop.
A second query we're working with is a little more complicated—it has a “where” statement. This is asking to return all the days where the total distance traveled is greater than 100 miles. This might be useful for something like looking at the range of electric vehicles or something along those lines. So, again, we're going to export the data set as a CSV and then work with it using Excel or any of the other stuff we provide.
So one final query I'm going to walk through—a spatial query using the roads and the points data. What this is doing is saying that for vehicle 458 in our CMAP database, return all roads within about 50 feet of those points, which is indicated by the ST_DWithin function. It will return several options indicating the line ID from the census roads data as well as all the speed information that it is linked to.
The ST_DWithin is a PostGIS function; we'll get into that in a second. Once you go through this, you will see all your results in your “my documents” folder. What you're seeing is a file containing the data in a different format from the database, but the same data.
What I have mocked up, to give you an example of something you might be able to do with that data, is just a simple Excel graph that illustrates the data that we just extracted. You see the average speed along the X axis and on the left graph you see the total miles traveled along the X axis. And as you might assume, as you increase speed or as you increase the total distance traveled, you increase the average speed of the trip.
The stops versus average speed on the right-hand side are a little different. But, again, this is just an example to show you how to pull data out of the database and then work with it in Excel—nothing too fancy here. I'm sure everybody is very well versed in what you can do in Excel and its limitations.
So moving on to something a little bit more interesting than just your standard SQL, we have Python. So this will walk you through the Python package, which will generate similar graphs to what I just showed. The Python software or Python environment we have set up is part of Python XY, a scientific package. We use the Spyder interface to access all the different Python functions and programming language functions that come with Python. So Spyder is opening. On the left-hand side, you see where to put in your scripts or call in your saved scripts. On the top of the right-hand side, you see some information where (if you're using a function) it will tell you what variables are associated and what you can put into that function and what that function does. And the bottom-right here shows you the results of the query or the results of a script.
So what we're going to do is load a saved mocked-up example that I have set up for us. Again, it's on the P drive and the example code is under the Python XY section. We're going to generate some scatter plots using those same queries from before. This code at the very top shows you we're calling in several Python extra packages and then we're using the query that you see below to send it to the database and return the data.
The connection information is below that query and we use the Psycopg Python package. This specific query is hitting the CMAP database. All we're doing (instead of writing that data out to a CSV file) is pulling this data into virtual memory to then manipulating it using Python.
So that's what all this code is down here—we're just taking the data from the database and placing it in lists. The code below that takes the lists and returns a Matplotlib graphic; Matplotlib is a Python package that allows the database to graph the data.
So we're going to spit out two very simple graphs here—same ones that we spit out in the Excel documents; you see the same distributions.
The next query we set up is something a little bit more exciting and something that's not easily done in Excel. What this section will do is iterate over the hour of day indicator and plot each hour of day using a different color for each of those graphs. You can see different portions of the sample and it gives you a little more information to go on. And you can potentially build this into things like animations. With Python, the sky's the limit with what you can do.
So, again, we have the same two graphs, but this time it's assigned a color gradient based on hour of day. Not much to see there, but it shows you the power of Python relative to Excel. So what we're going to do now is connect it to separate databases simultaneously and run the same exact script for both of them and generate four different graphs. As you can see, we're connecting to ARC and CMAP simultaneously—sending the same queries, returning the data, handling it the same way, but plotting it in two different ways.
So what's about to pop up are four different graphics with the same color gradient as before, but the top will show you the Chicago data, the spread of data across the graphs, and then the bottom is the Atlanta data set.
So what this really shows is that the processing we apply to the data allows you to do comparisons across data sets. While they have all the original data provided for the study—the analysis and all the stuff that comes along with it—we want to be able to compare across all the data or all the data packages we receive, which is what this allows you to do. And what you're seeing right now is that you can save these graphics as images and put them in a report and extract them or we can extract them and transfer them to you if need be.
So that covers Python more or less. There is a lot more that can be done with it, but that gives you the basics. And finally, the fifth and final video I'm going to walk through shows you what you can do with GIS within the controlled-access area.
So again, you need to have your Putty tunnel open. You double click on Quantum GIS, which is open-source GIS software that allows us to connect directly to the database, whereas most GIS software might not be set up to do that. What you see here is—you open it and we have all of our databases and you can connect to them. All it does is ask you for your username and password, which we provide.
And what we've done in the example folder that I mentioned several times—we've added a QGIS map document that works on both CMAP and ARC. But because the two data sets are separated into two databases, you have to log in twice, which is what you'll see me do. If you're provided that information or if you apply to access each of the databases together, your username and password will work for both. If you're denied access, they won't work. It'll only work for the ones you've asked for—so say you ask for Atlanta, you'll be able to see Atlanta.
In this map, you see all the states colored by the number of households. We have census data to back up more advanced analysis (seen in map). What I'm going to do here is zoom in to our Atlanta data set or at least the census data that we have to go along with it.
This is place information with households—number of households in each place. And then you drive down and you can see that we have the road data for that area of interest.
Moving to the Chicago data set, you see the same things. We can zoom in and you'll see that we have the roads present for people to work with and do analysis with. All this is fed directly from our data server. If you want to directly access the data from the database, you can click on that—the blue cylinder that I just clicked on—and it will pop up here, assuming you've logged in. And you have all your options for what you can call to the map. You click on that option and you can build a “where” clause from that table.
So what we're going to do is duplicate the query that I performed earlier where we only want to return days of travel where the distance is greater than 150 miles. You can build these queries by clicking through the fields (they'll populate the clause area) or you can pull samples in and do anything you want—that was just a test to make sure your query runs. So we'll go ahead and run it and it will pull in all the day results where the vehicle traveled over 150 miles into the map.
You can change the coloring on the map—I'll go ahead and assign a color gradient based on the total average speed of travel. You can define the classes and the color gradient assigned.
So once you do that, you can change what your classification system is and manipulate anything you want. You click “apply” and then click “okay,” and what you'll see is all those vehicle days over 150 miles assigned a color gradient by average speed of travel. We'll go ahead and zoom in on that. If you want to look at the data associated with those features, you can use the “identify” tool and select one of the features—what will come up is all the attributes associated with that feature as well as the values relating to those attributes—so what you're doing is interactively accessing and viewing the data.
You can then select portions of this data set. We're going to select several days by using the bounding box and then export those days as a Shape file, which is basically the standard GIS file format.
So you can go ahead and save that in your “my documents” folder. And once that is saved, you can add metadata—you can give notes on what you're exporting and different things like that. What you can do through QGIS is pretty similar to what you can do through PG Admin where you can view the data and take out chunks as needed. Or, you can just work within the interface we provided you.
So we've exported a Shape file from that selection. The last thing I'll go over is the ArcMap software, which is proprietary GIS software that we have for users. The reason we don't use it as our main software is because it doesn't directly access our open-source database and it requires an intermediate (like QGIS) to get data to and from. Without moving up to Arc GIS server-software packages, we can't really get the data size that we want to function well using it.
What we're going to do is call in a Shape file we created and add it to the map. This functions very similarly to QGIS, but the big difference is that there are a whole bunch of tools built into this that users can access—there is a search tool here where you can put in a keyword and it will give you the tool that you want.
So if you wanted to run a buffer analysis, you put in “buffer.” This window pops up and you identify the feature you want to work with, fill out the rest of the field, and run the analysis.
So I'll go ahead and do another keyword search; just the word “select” returns pages and pages of options for tools to use and each of those tools has an explanation.
So that is a general overview of what we have for users and the background on our projects. And that does it for our demo portion.
So before we open it up to questions, I'll recap since Evan went through a lot of material very quickly.
We started out talking about the web downloadable-data area where you can submit (through a simple form) a request for a log-in account for that area and get it instantaneously and download data that you can work with that has no spatial information attached to it. So the privacy issues with the data have been addressed that way.
But for analyses for which that data is insufficient, we have this secure-access portal for controlled access to data that you can get an account to by filling out an application and then you can connect to it from the comfort of your own desk remotely. We have controls in place to prevent data download, but have lots of tools on the site itself for working with the data—Evan went through a number of those tools.
There are a few others on there that we didn't spend a lot of time on, but you could see there were example pieces of code to get you started with a number of different tools. And in addition to those tools already there, if you have a tool or a reference data set that you would like to use with some of the data that we're providing, we can allow you the opportunity to load that in so that it's accessible through your account for working with the data.
So with that little recap, I'd like to open the floor for questions.
So this is Elaine. I don't have a question; I have a comment. Evan did go through a lot of material. I think what's important to say is that once you apply for permission to use the secure data center and you're approved, then you're not just thrown into this and just, well, here is the sample code and go at it. Evan will be available to assist you if you're not familiar with some of these packages. He's not going to do GIS 101, but in terms of familiarizing people through the system, he's able to provide that assistance.
Yes, that is correct. And to identify Elaine—Elaine is at the Federal Highway Administration and has been one of the project managers on the DOE side of this effort.
If anybody ever has any questions I should be able to answer them very quickly. I've built all these data sets from the ground up (or from source data up), so I'm very familiar with all of it and shouldn't have any problems helping you out.
Are there any other questions as far as the tools or anything that's available here?
Yeah, question from Ford: Do the studies include both warm weather and cold weather data sets?
The Puget Sound specifically covers a very long period of time, so you definitely have multiple different seasons as part of that data set. Evan, can you click over to the Chicago data?
So the Chicago data was mentioned earlier—the sample for each vehicle is seven days, but it was a rolling study with data collection between March and November. So for different vehicles, the data was collected at different time periods. Information on when the data was collected is available along with the data, so you could parse that out.
Another question from Ford: Do you have a good distribution of time during the day? Because if you drive at 11:00 versus rush hour, you get different driving profiles.
Yes, we certainly do. All of these studies collected at least one full day of travel, and in many cases more than one full day of travel. And so you have whenever that travel occurred and there are certainly vehicles that drive at all different times during the day. You can plot out distributions of when vehicles were driven; that is something that we've done ourselves with analyzing the data.
Yes, if you think back to the Python example I showed, you could see that that broke all the travel up into hour of days. And you could see that there was a pretty good distribution across time of day as well. We've looked at stuff in hour of week, minute of week, and second of week and stuff like that—we have a pretty good sample across those times. It is the months that I think you get a little problem with, because we only had seven days of sample or two days of sample or so on.
So because the GPS data has the detailed date and timestamp, you could pull in a weather data set by a specific day and add it to the TSDC and then join it so you would know if it was snowy, icy, or rainy.
Yes, that's right.
Here at Ford we're going to have to sign off. We're thankful for what you've shown and pretty impressed with how you've gone about it.
Okay, great. Thank you.
Thank you all for participating. Again, feel free to contact us—you can go to the website or contact me directly at Jeff.Gonder@nrel.gov. Feel free to send us a note if you have any follow-up questions or you're interested in applying for data access or contributing data sets.