Solution for Data Security Challenges Faced by Smart Grid Evolution - Video Text Version
Below is the text version for the Solution for the Data Security Challenges Faced by Smart Grid Evolution video.
Erfan Ibrahim: Good morning. This is Erfan Ibrahim of the Smart Grid Educational Series from NREL and today, we have Venkat Subramanian from Dataguise who's going to be speaking about the data security challenges in the smart grid, and also discuss some possible solutions for those challenges. So, I wanted to start by talking a little bit about data security at a high level, and then we can get into the discussion with Dataguise. The security of data is important at three particular instances. One – way the data is being generated wherever it is being generated, the other one is the transmissions of the data as it goes from where it's generated to where it's stored, and the third one is where it's stored – whether in small quantities or large quantities.
And in each one of those instances, there are certain security controls that need to be in place in order to make sure that the sensitive data has confidentiality, that the data that was generated is actually the one that's being used later – in other words, has integrity – and that the data is available to the business applications when it's needed, and that there be reliability of that data over long periods of time, and that there is accountability of where that – or who – generated that data or modified the data after it was generated. So, all of these non-functional attributes of data security are very important, and all of them are important at each one of those three phases – from generation to transmission to the storage. The challenge that we have in smart grid is that all of a sudden, there's so many players that have entered the smart grid ecosystem that we are moving away from the highly centralized, vertically integrated model where one entity – namely the utility – owned all the data to now, an ecosystem that has – even within the same utility – different business units that are dealing with generation, transmission, and distribution. And then, added to that, is the complexity of independent power producers, aggregators, and then, of course, the customers who are now becoming prosumers rather consumers. So, in this cocktail of players, there is a lot of data exchange occurring with different levels of cyber security adherence, implementation.
Compliance is there in this energy area on bulk systems – like generation and transmission [inaudible], but the moment you get into distribution and IPP and aggregators, regulation occurs at a state level with public utility commissions, city governments, and co-ops. But there isn't any consistency across a nation in the implementation of security controls. So, it is a responsibility of each, as a donor, to go beyond compliance and actually focus on security and reliability and think of the appropriate technologies, the business processes, and implement the appropriate policies so that all these non-functional attributes of security – namely confidentiality, integrity, availability, reliability, accountability – are all respected and consistently implemented. I have been bringing you a variety of technologies that address different aspects of data security. We had Agile BQ present, and they showed a way that you can encrypt with a method that is a lot more secure than even AES.
We had Elusive Networks presenting on how you can go to the next generation of deception whether they have real assets or they're not real. And then, we have also had presentations on how to secure data in large quantities. Now, we have to think about this. If you just do large encryption of data, it is not very efficient because the block encryption, whenever you want to run a query of any kind, you have to decrypt entire blocks to get access to data. So, that's not very efficient.
And selective column encryption is a method that has been around for 10-15 years, but that only applies to structured data, not unstructured data. So, there are these challenges that exist, but how do you make applications that's smart, but nimble? How do you get data to the right people at the right time? And then – but at the same time, make sure that the wrong people don't get their hands on it. Or, if they jump over such a number of hoops, they can get to all the crown jewels.
So, Venkat here today is going to be talking about a new approach to granular control of data for security so that people who are authorized, get information that they need, and others who are trying to get access to information don't get on something. So, with that background, I'd like to welcome Venkat Subramanian, who is the CTO of Dataguise, which is a Freemont, California database company, and Venkat has extensive experience with data, having worked at Oracle and in other functions. So, Venkat – welcome.
Venkat Subramanian: Thank you, Erfan. So, I'm assuming everybody can see the screen, and let's get started. So, just to kind of level set, I thought I'd start with a couple of basic things here. So, the operation on the grid – starting from generation, transmission, over high voltage, and then getting into the substations and come there into the feeder line and then the local transformer into the consumers' household, for instance. And obviously, on the [inaudible] side, the voltage is up for high transmission, and then finally, the voltage is lowered to 120 to 240 for household consumption and so on and so forth.
This basic layout has been there around for the longest time. And now, look at that – let's look at that in the context of the traditional power grid. So, it's a simple slope of generation to delivery. It's almost like a single way that everything travels across this network, if you will. The controls are centralized.
Communications are compartmentalized and, for instance, if a relay were to trip somewhere causing disruption, notifications sometimes had to come from the affected consumer. There may not be built in systems that would activate an alarm about it. Not saying that it's universally true, but I'm saying part of it was there are certain segments of it where that was the case. And fixing was manual as well. So, with a few options, it could actually take a while.
So, in the inspection of the health of the system and various parts of it, there's also manual. So, in this process, data collectors were transmitted over to a central location such as the utility and then, it would handle from there, for planning purposes and working out the details as to who's gonna supply how much of energy and so on and so forth. So, it kind of worked all right, as long as no disruptive event happened. As I mentioned, the manual nature of how some of these constructive [inaudible] have to be handled, just meant the disruptions could actually affect not just the household, but also industries and so on and so forth. Of course, we could not forget the great blackout in the Northeast a while ago, which actually kind of snowballed into covering an entire segment of the Northeast region.
So, there is nothing built in to handle a situation in the traditional power grid. Now, moving on to the so-called smart grid. So, idea here is to turn the network into an interconnected system with multi-way communications built in – new technologies with smarts to handle disruptions and automation. So, you're moving kind of from a highly-centralized environment, as Stephan was actually talking about it in this introduction, to more of a distributed system. So, integrate a whole lot of small power generators, for instance, of various forms including renewables, including power generation – so, even consumer rooftops – and making that part of this where some of these are actually not even predictable sources of energy and how to make the entire system work.
That's kind of real challenges for the smart grid, but actually, it opens up possibilities that didn't exist with a traditional grid. And the goal is, of course, efficient delivery of energy and the distributed system with real-time communications and can controls across the grid. So, in this context, even a household can be an independent provider of power, and there are possibilities that these consumers, over a period of time, may even be able to trade power amongst themselves using maybe like block chain technology. And there are some things that are being discussed, and some people are actually running this to see how this will all work out. As I'm actually talking about this, I do need to mention something here.
You may be able to tell by this time that I'm not from the energy sector. I'm a data guy. I'm a data guy from Dataguise – that makes sense. And I also just realized that I hadn't, you know, given enough of my background. So, let me just take a moment here.
So, I'm an engineer. I've worked with databases for the very longest time. Started off with Ingress – I don't know, again, how many of you have database backgrounds that you [inaudible] Ingress – it's no longer there – and then worked there for 6 years and 13 years ago or so. So, kind of cut my teeth in data and sometimes introduce myself as a lifer on data. So, my focus today is actually gonna be about data, so if you think of it in this fashion – it's going to be a perspective from a data centric standpoint.
So, let's just quickly look at the smart grid of energy resources. I've just got a couple of things in back in the previous slide. So, the conventional energy sources – and it covers much of the generated energy today – and obviously, there is move towards other distributed energy resources and renewables and so on and so forth. And when you see this percentage, for instance, of coal – 33 percent and so on and so forth – obviously, those numbers have declined rapidly, and of course, the cost of natural gas, being what it is, that's kind of taken the stage and maybe a step going towards less carbon footprint ways of generating electricity and so on. So, I'm not going to go into that.
That's not the focus of my presentation. On the distributed energy resources side, of course it is the small combustion turbines and so on – the solar wind and so on – and also, this looks like nuclear reactors make a comeback in some way with small nuclear reactors that probably can even be buried under these substations and without having to generate in one place and they deal with all the loss of energy you know, that's transmitted and so on and so forth. So, the idea here with smart grid that allow for these kinds of smaller sources to also provide energy into the grid. So, in the process now, we have far more players than we had in the past. That, in and of itself is a challenge, simply because with fewer players, larger players, it's easy to develop processes – make a consistent system across the board and make it all function.
With the distributed energy resources, where many of these are actually coming up almost like a daily/weekly basis, there are challenges there what kind of data they should be collecting and how they're gonna communicate. There may already be some standards that have been created, but certainly, those need evolution as well. So, moving on – of course, quickly on the benefits. I don't want to spend too much – the folks on the line know far more than I do. Obviously, the reliability of power and also, being able to handle the demand in a real-time fashion given the amount of communication, the sensors, the load that's being sent.
It's not that you need to always use a day old, a month old, a year old to project what we need. You need it today and a plan for it is something you can literally do hour to hour. So, that's one advantage. The second one is – it actually allows the ability of newer providers to come in with lower carbon footprint, and we already mentioned a couple of them from the alternative energy sources. And then – the lower cost to consumers, of course.
So, in the past, it was all one way. The utility provided the service, the consumers consumed and they paid for whatever the bill told them that they actually pay. So, now, with the smart grid, the consumers now also have access to information about zero utilization and the different cost factors, depending on the time of day, that you use – basically, electricity and so on and so forth. So, the reason I'm mentioning all of these is that now we have far more players wanting to interact with the system, and so that actually poses another level of challenge. That means now you have to share the content – obviously, you're not sharing the same content – but that's really not the entirety [inaudible].
So, [inaudible] office, as meaningful as they are, shared with different sections of people who participate in the grid, and each of those use cases have their own set of data that you would consider important – as to Erfran's point – because they're non-functional and currently required for maintaining the fidelity and integrity system. And so how that would actually be done is a challenge that smart grid brings up. So, of the benefits that we've been talking about – that certainly challenges – and I'm gonna actually gravitate a little bit towards now, you know, the challenges from a data perspective. So, now about smart innovations and data. So, with sensors at every level – I mean, so, I was talking to Erfan a little while ago – he was talking about having sensors within – in the high-tension transmission line that you can actually start to sense the sag of the line.
And also, with the big data type of process that one can do now, you can also project at what point – it's not just sensing what it is today, but actually using the source of data – a particular segment of the line – you're going to project when that sag may be reaching a level threshold that's considered not optimal. And then, so, a lot of planning could actually happen in how to handle that, and maybe that's one of the things that if a present – in the early 2000s, that entirely blackout in the Northeast could have been avoided and just be used sufficient [inaudible]. But anyway, so then, with SCADA controlling the devices and acquiring huge amounts of data, now, we're actually talking about a volume of data that's never been seen in the energy sector, because it is – and every day, more and more devices and sensors are actually getting added, each one of them generating data. It's not generating data once a day, once a month – sometimes several times within an hour or within a minute based on what they [inaudible]. It obviously has to do with a substation – it probably needs more like a microsecond.
Because you don't want to fire the starting up of stations and not be handled for the next 10 minutes because that's the next time the sensor could come alive to sense what's going on. And then there's certainly variety with the various different elements that are part of the system. Each one's creating a different type. And the velocity – this is kind of where I also was referring to – which is the number of these things – and the IOTs are becoming a big deal and everybody's talking about it, and particularly, the energy sector, it makes a lot of sense. Because all the way up to and including the households and industries and whoever the major consumers of electricity are – having these sensors helps in planning and reacting to changes in demand, planning and also reacting in near real time to any kind of distributions that may happen so that you can bypass the portion that is open to actually still continue to provide electricity to the consumer at the other end.
And also, there is varying quality. I mean, this is a challenge all the way around because the data of similar nature – if they don't have the same quality – how to make sense across the data that's collected by various segments, various players – that could also pose challenges to properly design and run the grid. So, as I've been talking about, the thing that keeps coming back to my mind is it's becoming – this whole energy sector – particularly the [inaudible] of smart grid – is becoming a data driven enterprise. It's all about data from the sensors – from users – users finding – they're all looking at their own uses and doing things with it to optimize their bills, for instance, and so on and so forth. So, in a way, IT is meeting OT.
The operation technology is getting met by the information technology. So, operational intelligence, analytical predictions, and so on, is something that IT is good at and it's required here, but supposed I control the whole SCADA part of it that has quite a lot of data and the historical information and the data warehousing and doing critical analytics actually becomes again part of what is possible. And then, the demand side management – which is the consumer facing part – I mean, I've been talking about how consumers have access to their own uses of energy and so on and so forth. So, given all of that, the danger that one could actually project is kind of reality in the sense that it is vulnerable, because there are so many points of entry, there are so many devices at play, and so much of it is also controlled not necessarily completely centrally as it was in the past, but at least for each segment, there's a centrality to how it is done, and obviously, any kind of a bad actor getting in there could actually send the wrong single to do bad things. So, as you can – and I was actually just going through the kind of [inaudible] action, and I was a little surprised to know that energy sector has you know, leads in this aspect.
Nothing to write home about, but that's because of the open nature of the system. And it seems like the bad actors find that this may be a good place to play the game. So, as we discuss it, it's obvious that data is your biggest asset. It's also becoming the biggest vulnerability. So, how do we actually balance the two so that you can actually get a smart grid that works, but, at the same time, it is secure, the integrity is maintained, the availability that Erfan is talking about is present and the fidelity of the content is maintained and so on and so forth? So, how can one secure the smart grid?
A couple of things that Erfan actually talked about in terms of the communication and collection and so on and so forth. As important as they are, that's actually not gonna be the focus of my presentation today. It's more about storing the data and retaining the data and how to control who has access to what of the content and so on and so forth. So, obviously, with securing the smart grid, there are regulations for standard ways of handling the data and the controls, and then there are general best practices for each entity which is part of the smart grid. And normalizing the data – which is, again, you know, when there are such differences in the way the data's connected and shared, then obviously, all the players have to share the data in some form for them to operate together as one unit from a customer standpoint so that normalization is required.
Ensuring the fidelity of the data – and this needs to be done with each player, and with strong authentication, authorization mechanisms. You actually have – these are just standard operating procedures, right? You want to limit a certain number of people to be able to authenticate into a particular system and then you control what they can see through authorization mechanisms. And the next one – and again, Erfan talked about as well – is using encryption as a mechanism to selectively encrypt content – and I'll get a little more into details of our – what is actually done in terms of how much the data can be shared and use roles of the individuals that they play to control exactly what they get to see through role-base access controls – they're usually called RBAC. And then, it's not important to just talk about these as mechanisms that we can use, so workforce – which is actually making the grid and making it operate in an efficient fashion – needs to be trained and developed.
And so, it needs to be more policy and process driven and tools driven so that the training is one part. The system often guides the workforce to do the right thing. And then finally, consistency so that different players that are part of the system – they have differently set security postures. That actually causes a challenge. Because the strength of the system is really limited by the weakest link.
So, that means that everybody needs to raise the level and there's consistency across the board so that there is no such thing as the weakest link. So, the goal is, ultimately, of course, to ensure the integrity and the availability of the system. So, regulatory compliance is one who is government and other agencies, you know, find ways to bring back consistency. So, I just put a few maybe regulatory compliance regimes that are actually present across different industries – HIPAA for health care and PII for generally Personally Identifiable Information, which also applies to the energy sector, especially from the utility standpoint, as much as how they deal with the consumers and so on. Of courses, the PCI, which is the Payment Security Card – you know, the industry data security standard and so on.
But what's of interest to you is this NERC-CIP. I can tell you exactly what it is that is expected of you in the manner in which you operate and handle data and so on and so forth. So, the vision is to practice secure business. Right. So, what does that really mean?
And the vision is the ability of the enterprise to safely and responsibly leverage the value of the data assets, because data becomes the major contributor to larger ways that the smart grid is actually gonna operate. And then, in this context, I have to say that even a grocery store chain today is a data driven company. So, energy sector being where it is, it's absolutely very, very important to acknowledge that that's the case and deal with it appropriately. So, what does safely and leveraging the value of the data assets – what does that do? It helps gain new insights and better operations, eliminating breach exposure, and then driving the integrity of the entire system.
Now, in addition to that, there are IT level changes that are going on as well, and that also adds to some of the challenges because the amount of data that you're talking about and also, the level of access that is required to be nimble and be able to take advantage of the data that's currently available – which wasn't a few years ago – as it became important. So, what used to be an IT led system of record reporting is now actually becoming much more democratized. So, the business units themselves want to be able to, in real time, get access to data and be able to slice and dice data and beam as much of it as they can – in the case of control systems, this is actually important because of the decent life environment. Four more entities need access to actually make the grid function. So, that's one.
The second one is from [inaudible] an analytic standpoint, the moment the word "analytics" got mentioned, the names that popped in our mind is like, Teradata, Netezza, and so on and so forth. They're very, very traditional data warehousing environments. The difficulty with them is that the schema that you could use is something that you have to have some notion as to how you're gonna use the data, so the schema respects that. In a manner of speaking, you can say you were kind of straight-jacketing yourself that if yes, new ways of slicing and dicing data came up, it would actually have made it harder for you to actually implement that. The second part of it, of course, is they were very expensive.
So, to be able to do more with the data, you need more analytical capability, but going the same direction as previously – with the case of the Teradatas of the world – actually made it a bigger challenge in terms of path and other related matters. So, the open source – open systems – you know, Hadoop and so on – I also mentioned a couple of more technologies in that regard – so, that's one of the other data transformations that's actually happening in the manner in which the data's stored and [inaudible] and utilized. And now, you've actually talked about two things. One is more people need access, and we also talked about in more ways of slicing and dicing the data to open technologies, both of those add up to needing far more capacity in computer power. So, as I'm sure many of you will recall who are part or who are aware of exactly how this happened – from the time some compute resources are required to them to the time it is procured and then made available, it takes sometimes weeks and months.
And that doesn't necessarily address your needs in a timely fashion. But, even after that is done, the problem is some of these requirements for compute power may be not 24 hours, 7 days a week; it actually comes and goes. So, you may end up with a system that's set up for handling the maximum capacity – the spikes this could happen in terms of the compute power and so on – but the rest of the time is actually seeing their idle's not used. And, on top of that, this entire IT infrastructure management is not a core competency pretty much in everybody on this line and most of the vertical other industries that IT talk to. So, that actually is kind of nudging people to more seriously consider the public cloud.
But there's this one concern that transcends all the three transformations that I've talked about – like, democratizing the data, moving off of the traditional warehousings into more open systems, and now, moving off of one premises into the cloud – particularly, in this case, the data was in the four walls – [inaudible] secure – than putting it up in a public cloud. So, all of these have one common concern, and that's actually about security. So, the reason I had brought that up is all the data that we're handling in the context of smart grid and also other IT related changes that are happening in terms of storing, receiving, and analyzing and reporting and acting on the data – there are actually changes that are happening there as well. So, anytime we actually talk about a secure environment in an IT setup, the first thing that actually comes up is always the perimeter security, because they're the first thing that pops in your mind is also the hacker who's actually trying to break into your system from the outside and needs to get to the data that you have [Inaudible]. But when you look at the bulk of the breaches that have happened, most of them happened because of bona fide users from within the system who inadvertently – or maybe for some reason, they were unhappy with the station, and they do something stupid.
So, it's those actually that cause more breaches than the outside people. That doesn't mean that you don't want to protect yourself from outside hackers – that's not the point. So, well, you need that as well to protect the environment – with all the capabilities that we can put in, you know, to perimeter security. I know Erfan's organization, for instance, is actually testing and making sure that this perimeter level security are secure enough for the environment and also, the communication between these entities that happen are often done over some secure communication channel and the best practices around these. But beyond that, you need to then take the data that's actually stored, because that's gonna become Joe's.
You need to protect them and you also – protecting them doesn't mean that you lock them in Fort Knox and throw away the key. You have a huge need – an increasing need for data. So, you need to find a way that balances the need to democratize the access and also reach into security. And one other thing that Erfan mentioned is the whole fine level/volume level encryption and decryption. Efficiency, of course, is one consideration.
The other consideration is if – just for a simple example – a particular file contains element type one and element type two, both of which are considered for your purposes, so, if a user needs access to the content and the user only has access to one or none of these two, then the user cannot be provided access to the entirety of the content. So, that means that again, tying your hands back behind in terms of how well you can share the content. So, there's not enough granularity built into the system where you could more freely share the content, and with the right content visible to the right people. So, that's important. So, in a nutshell, perimeter level security's important.
Volume and file level encryption is a good thing to have, because if somebody were to break in, as Erfan mentioned, and jumping through multiple hoops and getting to the content, if the entirety of file vaults are encrypted, that's also one level of protection. But when it comes to retrieval and access and sharing, that level of protection is an all or nothing proposition. I think that's explained. So, we need more granular protection for it. So, having talked through all of these things so far, the question that begs an answer then is – so, what should we do? Right?
So, if at the cell-level – and what I mean, "cell", I mean, individual elements. For example, I just gave you element one and element two that are sensitive in a particular file that you can identify, you know, where those are and then protect those assets appropriately though masking and encryption – and I'll actually go over exactly what those mean and how they work in a couple of minutes. And then, as you can protect those assets appropriately and then provide controlled access – but don't stop; just providing access – monitor the uses of the content. Who's accessing when, what, and what purpose and so on and so forth so that you can also have some knowledge – for accountability purposes that Erfan actually talked about at the top of this discussion as well. So, if you can do those things, what you've essentially done and the results from that is you've enabled employees and trusted partners – in this case, the smart grid – there are actually many, many partners working with one another, so it's important that they're able to make the data driven decisions.
You have enabled that to happen. So, effectively, you lowered their risk of sharing and accessing data. You increased the value of the process. So, we've been talking in kind of generalities about what Secure Business Execution's all about and a few ways of doing it. Let's now dig a little deeper as to what each of these things mean.
So, as far as data's concerned, there are four main things that one needs to do. One is – know your data. That is – where are the crown jewels? Where are the critically important data stored and are what files and so on then contained? Where is the density?
What are the hard files in terms of density of this content and so on and so forth? So, know your files. Know the content. The second one is not enough know the files themselves; you also need to know who has access to that. There could always be some mismatches about you know, who has had what access.
The wrong people have been provided access; the right people don't have it and so on – so, being able to actually correct those authorization issues. The third one is – so, just within – just the two, you know where the sensitive content are. Given that they're all in clear text mode, you can only provide access to a certain number of individuals or entities and you're actually doing that. That doesn't give you the maximum value of the data. As I mentioned before, everything is actually becoming data driven, and in the process, more and more people need access.
So, as I give you the example of element one and element two, the granular level of access is important. Data's actually provided in the form of protection. So, there are two ways of doing it – masking and encryption. Again, and in a couple of minutes, I'm gonna talk a little more about those. So, now that you have – you know your data, you know who's authorized you access, you also protected it, and in a controlled [inaudible] protection, share the content.
It's now you have a good, secure system in handling the data. But, as the saying goes, you trust the system, but then you gotta [inaudible]. You know, somebody almost 35 years ago, used that [inaudible], so it's like, [inaudible] verify that. And that's where monitoring comes in. And I'm gonna talk more about those in just a minute as well.
So, let's actually take what the – what the detection really means. As Erfan talked about, there are structured storage mediums such as relational databases and so on where the entirety of columns are pretty much homogenous with the aspects of content they have. So, a particular column may have a customer ID. Another column may have a certified number and so on and so forth. And protecting that, by and large, is not that difficult, because the DBA who designed the schema to begin with would know where those sensitive elements are and so they will [inaudible] the direct level – the active control on it.
But bulk of the data today – especially ones that you are dealing with that's actually coming from all these different sensors and/or SCADA actually provides back from the various sources – and we talked about the variety of data and so on and so forth – a lot of them are increasingly unstructured. Unstructured causes a big challenge. If it is structured – if I can actually just sample 1,000-10,000-100,000 rows of data, and given that the content tends to be homogenous within a column, you can actually tell what it is. They don't have to read a billion records to do that. But, when it comes to unstructured, every token in every file needs to be examined to make sure or to understand what the content is.
So, that needs a different level of ability to actually be able to tell, unlike the database part. Because the context is not always clear. The data can actually present it in multiple different forms. As you know, clarified numbers can be written as running digits or straight preparative and so on, and so when you look at it even visually, you see that separated into multiple tokens and so on and so forth. How can you make sense of such content to put together the right set of tokens to then able to identify where those are, and so on and so forth?
This is a challenge. So, you should be able to find that in the kinds of those patterns – the content – from the patterns of the strings. The next one is – there are words like August and June and so on and so forth which can be part of many different data elements. They can be part of address – as a street name. They can be part of a name – first and last. They can also be part of a data.
There are other data elements that could be part of many other things. So, just taking the token and matching it to one type of element is not going to necessarily tell you exactly what the overall entity is. So, that's where finding this in grammar and saying, "I have a first name and a last name" or "There's a last name, first name." Names themselves have about 12 different formats in which it can be written. Being able to tell that from the grammar – that's important.
Addresses – you know, all around the world, there are about 22 standard known addresses and address formats. Again, being able to find them – and here's the other challenge; addresses don't necessarily get all written one after the other, as one piece to be discovered. If you were to write a formal letter to someone, you would actually have the street address in one line and at the end of this line, you may actually say something about the location you're writing this from, and then you'd have this address line two – the city/state – and in the same line of the other end, you may have data and so on and so forth. So, for a human being watching that, you know, you can see quickly that the address elements that are bunched together – because there's one below the other and one on end that file – the others are actually at the other end. But, for a computer to actually read it, reading left to right and top to bottom is actually going to find all these things interweave amongst various different elements.
So, being able to separate those and making sense and calling out what an address is – that's a big challenge. So, we do that with – it can be done using some form of grammar, right? So, then there are also patterns in context. There are lots of elements, and the simplest one that I always take as an example is a zip code. There are five digits that you can find up to Brazil when you go to any kind of unstructured content, and some of which may even match known and valued zip codes and so on and so forth.
So, by virtue of the fact that a five-digit number happens to match a zip code – a valid one – it's not enough, you know, to call that a zip code. So, you may want that to be discovered more as a dependency on other types that can confirm for you that that's what it is. Because you really want to avoid false positives. If you get a lot of false positives, that in and of itself throws you off in terms of how well you want to share the content, because a false positive is going to limit the access if you didn't do this right. Well, since I mentioned false positives, I also have to mention false negatives.
When doing all of these, you should avoid false negatives much more than false positives, because false negatives obviously give you a false sense of security. So, anyway, back to the discussion. So, being able to find other elements that could be present as dependence. In the case of Cisco, you can say, "I'll find them if it is present around, you know, some other address element in this vicinity." And then there is patterns in combinations.
I use the financial service companies, for instance, because anybody – I can, for instance, create a million credit card numbers which are all valid. I can create one for American Express and Visa and so on and so forth. It's really not a problem. By themselves, there's some level of risk, but not very much. But if those are found with a cardholder name or expiring date or the CVV code, now we're really talking about the secure – the sensitive content being present.
So, one could actually define such combinations and say, "Don't remember if you find the credit card numbers in isolation. Only report when you find them in proximity to these names and CVV code and so on and so forth." So, you can define – you should be able to do that. That's a requirement as well. Especially with unstructured content for the amount of data that you are collecting.
E-commerce – we're working with an e-commerce site right here in the Bay area – they have crossed – they crossed the one petabyte mark a little while ago, and it's counting. They just keep chugging away. In some sense, the more data that they generate and more [inaudible] actually store, sometimes, it becomes even hard to find a utility for it without knowing the content. This is where the detection part is important as well, right? And then finding – and knowledge.
Certain verticals have certain keywords and such that better describe what is actually happening. You can say if it's an EMR, which is an Electronic Medical Record where a doctor just examine the patient and describing exactly what he or she has observed, there are certain terminologies they tend to use. In that context, finding certain elements is possible to be more accurate. Given that vertical, using [inaudible] associated with it, and then finding elements to begin with a suspected [inaudible] of a certain type – confirming that they all are – or rejecting that they're not – is part of how you do that within [inaudible]. And then there are – obviously, there are standard elements.
We've been talking about credit card numbers and names and addresses, dates, and so on and so forth. These are pretty standard. But pretty much every customer – every industry has certain sets of elements that are sensitive to them. And even within an industry, every entity tends to have their own ways of managing data and what they would consider to be sensitive and privacy related and so on and so forth. So, you need the ability to add your own set that you want to find out and know where they're exactly located.
And then the next one is machine learning. With all the things I've talked about, improving the ability to accurately identify sensitive elements within the content that you're dealing with – with all that's done, you could even tune yourself to say, "I'm going to avoid false negatives at all cost." As I mentioned before, false negatives gives you a false sense of security. But, in the process, you may end up finding that you're remotely matched with something that looks like a sensitive element, so then maybe you got a false positive – and which we talked about various ways to actually decrease that number. But, for you to be able to make it to an acceptable number, you usually need to have a different tool to do that, which is where the machine learning part actually comes in.
And just a – maybe you can think this is a plug for a product – is of the things that we actually use to minimize the false positives. So, what happens is any product that you actually look at in [inaudible] industry, which addresses this particular need of identifying and detecting sensitive elements. They all have this one common downside, which is false positives, right? And so, the product itself is going to be genetically developed across the board for multiple industries for multiple customers, obviously. And so, it actually cannot try and limit as to what it's going to do, which may have, in a whole new scale in the context of a financial institution, but not – you cannot take that and apply that in the energy sector and expect good results here.
So, it's obviously something that is [inaudible]_. We know that affects the data of each of the entities of individual customers and the entities we were talking about. Your needs may actually differ. So, that's where machine learning really comes in. So, here, any product that you're looking at comes with the ability to find sensitive elements wherever they may be hidden, because that's your idea.
Because if you knew where they are, you're not gonna need these tools. With the amount of data there is collecting, it's not humanly possible to eyeball these and figure it out. So, you really need a computer app that is gonna do this in an automated fashion, but it has to be accurate to the [Inaudible]. So, that's where the machine learning comes in. So, what – again, take this for a pitch maybe, but how we do this with machine learning is that we use a set of data that the customer provides to us which is representative of the content that they already have – of all the content that they have – and use coupling technologies to group these into clusters, and then, the next step is from those clusters, bringing the keywords.
So, that back of words – going with that sensitive element is now one set of thing that you go identify it, and what actually – the reason this is important is that the genetic product is gonna find it where it thinks it's gonna find a sensitive element. Even some of the built-in abilities to avoid false positives. But there will be false positives. So, by doing this clustering and coming up with a set, you're also going to be able to have the customer, in this case, look through the content that's been identified to point out which ones of those are actually false. And now, whatever's been marked as false is giving you a better signature to then go back and use to avoid identifying those falsely as [Inaudible].
So, what you do then is turn that into a training set that actually marks these occurrences to be false, and then you train the detection tools to then avoid these. And what you've essentially done is you had a broad based approached to find sensitive elements of any types that you would suspect may be the case, and then use the input that the customer based on the content, and also, the inspection – the comments that you get about the what is right and what is wrong, what is true and what is false to then train the tool to do the right thing in that context. So, now you actually have a tool that, to begin with, had a broad capability to identify sensitive elements, but now it's been trained to handle data that best suits your needs. So, we talked enough about the unstructured content. Let me just spend a minute on the structured content – typically, you know, relational databases.
So, as I mentioned before, obviously, most of the content within relational databases, the column content I'm talking about tend to be homogenous. So, you can just use samplings as a mechanism, and as long as you look for enough content by reading enough number of rows so that you don't end up processing just null values by non-null values – because data's king, and that's the one that's gonna tell you the whole story. And obviously, the column name gives you some sense of what that could be – such as address and name and so on and so forth. But do keep in mind that we start off with the intention of using these columns for specific purposes. For a period of time, we kind of divert from that and end up using some of the columns for purposes other than what it was intended to be used when you began.
So, analyzing the data and making the call actually is more important, not just looking at the column name alone. Because just using column name, a DBA would be able to tell you which columns contain address, which column names contain social security numbers and credit cards and so on and so forth. And right from that schema alone, that DBA can now tell you, with 100 percent certainty and accuracy exactly as to what these column content are. That's only one part of the problem, because just because we say the relational databases are structured type of data stores, it doesn't mean that they don't have any unstructured content. And, for instance, we work with an insurance company that actually maintains the details of an incident – like an accident – and so, somebody describing the accident could include the name, the driver's license number, and a few other information about the parties that were involved in an accident.
So, different roles, different incident reports, which is all going into this common column, can include different sets of sensitive elements, and obviously, some of them may not contain any. So, even in structured content, you still have the need to be able to detect sensitive elements and unstructured content as well. And so, in a nutshell, that's the detection part of it. It is pretty complex, and it needs that for it to be accurate. There's also one more requirement here.
You can actually make this extremely complex, and the data comes in at a particular velocity, but this one actually takes its own sweet time, and it may not be able to keep up with the workload, which is where many different ways exist of handling it. Obviously, the provider of the solution has to optimize this, particularly in the context of the environment in which its running or a particular customer is running, but it may also need more [inaudible] to be able to handle more of the content not in a single [inaudible] fashion, but currently, across multiple data within that data source. So, that's where the technology [Inaudible] comes in – multi ways of handling data and so on and so forth come in. And in some instances, it's not enough to say that this data classification can actually happen after the data's been stored. For instance, Dataguise actually had some customers – I can think of two in particular – that there's a credit card number that is coming in and it's going to be stored in the data link in a Hadoop environment.
If they allowed that to happen, then that Hadoop environment becomes PTI compliant – becomes compliance required – and that could actually be an additional cost for them. So, being able to find these in-flight as the data is being adjusted and doing the right thing in terms of maybe masking it or removing it or encrypting it and so on and so forth are also ways of not only just detecting it, but also, for business purposes, curated in a way that suits their needs. So, we talked about detection – let's actually take a look at what that really means. So, here we have in front of you a formal document, and if you were to do the detection that I'm talking about – and let's say, in this case, you're interested in street addresses, names, phone numbers, e-mail ID, so on and so forth, and the detection capability should be able to find these, obviously, in your environment, your structured environment, you may have other types of data that may be part of such documents and files, and you may need to be able to find those. Either these are all part of the standard products – you know, from somebody like us or any other vendor that you look at – and obviously, the product – anything that you look at should have the extensibility to be able to add to that what is of interest to you.
So, that was unstructured. Let's take a quick look at the structured content – just looking at a dimensional database. So, if you were to run this – I just kind of highlighted the columns that are actually sensitive. Here's address. Here's – Cardholder ID, in this case, turns out to be social security number, and the part time – you know, first and last name – and so on and so forth.
So, that's about detection. We also talked about not only knowing where the sensitive elements are, but also knowing who has access to them. That's where the whole authorization entitlement comes in. So, you really need to have the ability to know who has access to those directives and files that you want to contain elements of interest to you – you know, which columns in which tables in which databases contain sensitive elements again? So – and also again, are these elements in clear text?
Are they encrypted or not? Because for certain business use cases, you may decide to leave it in clear text. So, knowing in what form it's in the data stores that they're present in – that's actually important. And of course, then knowing who has access to them and especially in what mode. Who has access to the database, this particular table, and this particular column?
Or [Break in audio]. So, just briefly, we talked about the need for detecting sensitive elements in non-structured content. [Inaudible due poor audio] to be able to find them. Particularly, unstructured is a challenge. And then, we also talked about the structured content in relational databases where the tendency is for data to be mostly homogenous.
But, to be very clear, that doesn't necessarily mean that all of it is structured in this fashion. So, there could be unstructured content. I kind of talked about an insurance company maintaining internal relational information where there could be people's names and phone numbers and so on and so forth. So, now, going ahead from there – so, let's actually look at an unstructured content. This is – you're seeing a formal letter, and you're interested in finding sensitive elements in them – the names and addresses and so on and so forth.
The detection capability actually finds those. And I mentioned before that addresses – finding addresses, one of the challenges is that it can also be written in multiple line. In this example, there is additional content on the right which would throw off a typical computer application. But having the ability to find these kind of elements in whatever form they're expressed is an important requirement for detection capability. In relational databases, you know, again, I'm just showing you the examples of homogenous type of content. These are a first name, last name, and social security numbers, and full addresses and so on and so forth. So, kind of in the interest of time, I'm going through these a little bit faster than I was previously, but –
Erfan Ibrahim: Venkat, can you put it in full screen mood, please? Thank you.
Venkat Subramanian: Oh, sorry. I should have done that. Thank you for pointing out. So, that's on the detection part of it. So, now that you know where the sensitive elements are and with what density, the next thing is you want to know who are authorized access to them.
So, these, in directives and files or in tables and columns, you want to know if the content there is in clear text or encrypted or masked, and actually, the encryption and masked – masking being protection mechanisms – I will actually talk about those in just a minute. But you want to know in what form the data's actually present, and now, also to have – who has got access to this content. Typically, it's in UNIX file systems – for instance, as Posix level file system permission, the owner, group, and other is one and so forth. That may not be granular enough. Platform end is now adding another access loop on top of that to provide even more granular information as to who has got access in what mode.
So, knowing that level of detail opens up the possibility to actually align the right users to the right content more actively. So, that's about the entitlement and authorization. The third one now is that – with entitlement and authorization, you kind of align the right people with the right data, but it also still constrains you to what extent you can actually share the content. Because everything in these files and databases are in clear text, so providing access to a particular file gives the access to the entire [inaudible] to the user. That means you can – you know, conversely, you can only provide access to files through a user where the user can see the entirety of the files, otherwise, you cannot, right?
So, that constrains you. It's probably something that doesn't help you for the business use cases you're putting into it, which is where protection helps you. So, with protection, you can do masking as option. Masking usually is a one-way application scheme where you cannot go back to the original content, which is one of the selling points for it. Giving somebody access to it does not reveal the original content at all.
And there are a couple of typical user cases with masking. There's a test app is one where you have a new app – a new version of an application, and you want to test it, what better way to test it than with production data which actually includes all the idiosyncrasies and nuances of what is found in that environment other than using synthetic data, which is particularly not possible to provide through the same kind of content to test you in application modality. The second use case is that you can also monetize the data. There are customers of ours who actually do that and medical insurance company provides such information about pharmaceuticals as to what region of the country is purchasing what kind of medication and so on and so forth. Obviously, the names, addresses, and particular things that would point to individuals are all stripped off the content through masking, but obviously, the region of the individual cannot be, so you can still mask it without losing the region, but the address may not be exactly the same.
So, that's a second use case. The third one – this was actually becoming more popular. For many [inaudible] type of force, you don't necessarily need access to the sensitive elements, as long as the content fully represents – statistically and otherwise – the underlying original content. So, that's actually another use case. The reasons these are important is that as the need for providing access to more and more people comes up, you should have multiple different options, and this is an extremely safe option to use wherever it is appropriate.
There's a credit card company that uses our product, but they are actually developing, honing, and defining project action capability. So, rather than do it, there's this army of people doing it on real data, they do it on realistic data that fully represents the real data underneath it. So, until it [inaudible] goes into production, it is actually test level production data, but without the exposure risk. So, the next – as an extension of masking, there could also be redaction. So, some columns may not be necessary in the environment in which the data's shared.
That means you can nullify it, except in unstructured content, nullification is not an option because nullified content doesn't tell you the difference between did content exist here and it's been nullified because we not see it or if content never existed there. So, by replacing it with Xs, for instance, tells more of the story better. Onto encryption. So, encryption – and also, encryption through role based access control gives you the ability to control access with the real content, and so also, encryption can be done in three different ways essentially. We already talked about the filing [inaudible] level – which platforms [inaudible] actually doing, but we also talked about how we can all or nothing proposition, because when you enter something that – Erfan talked about as well.
So, once the application or the user opens the file, the entirety of the file is actually visible to the user. By being more granular and doing this at an element level, now the viewer can see the rest of the content that is not that important, but the important elements are protected, and the user may not get to see any of them because they're not authorized to see that. But there are also certain use cases where element level may be a little too granular – because there are customers of ours who believe that maybe thousands of sensitive elements in a particular record makes an entire record sensitive. So, it's not just the element alone. So, that option also gives you control over at what level you want encryption, and also provide access.
As far as the encryption capabilities, Erfan also talked about another company that presented an encryption mechanism and so on. Not quite a pitch for a product, but I just have to mention that our product comes with standard AES encryption capabilities, but it can integrate with any other – call it [inaudible], call it Agile PQ, call it Voltage, we can pretty much work within any of that. That level of flexibility's important, because large enterprises tend to standardize on different technologies, so any new solutions that come in should actually have the flexibility to fit into the environment. And we've seen this around it. I just might want to actually say, "You don't want to cut your feet to fit the show" and how that should really work in the environment without having you making changes to it. So, now, on the encryption capabilities, I'm now showing you exactly the same file that we started with – on unstructured – and now – sorry; you're just starting with masking.
Sorry. So, in this case, you're replacing content with white content. And I also talked about doing this in a consistently innovative [inaudible] being replaced with College Door, and every occurrence now going forward called [inaudible] should be replaced with College Door to maintain that consistency. Onto encryption. Again, this is AES encryption, so it looks a little gobble-dee-gooky.
But you see that all of the sensitive elements have been detected, have all be encrypted. It could be AES; it could be any other [inaudible] library of your choice. Now, in the database use case, I'm looking at masked content. We saw these entire rows of first names and last names and social security numbers and so on and so forth, and here, you see that they have actually been replaced with like content – names of names, social security numbers with different source – valid social security numbers, but different ones – and addresses with addresses and so on. So, now you can also look at structured content, which can also be encrypted in certain technologies.
In this case, I'm gonna show you Hadoop, which, for a lot of users, moving from relational database of their environment tend to actually come there with expectations of using their existing applications and using SQL to access the content. So, HIVE is a mechanism and there are others that provide such access. So, here is a use case where the sensitive elements have been encrypted in the structured format, and it's been loaded into a HIVE stable in Hadoop, and different users are actually accessing the same content. Here is an example of a user with no access privileges to see social security numbers, the names, and so on – that query returned the encrypted content as a result. That exact same query run by a user with access privileges actually gets to see the real content behind it.
And this is not to say that this is unusable for the user. This level of access may be sufficient for some use cases. So, you kind of balance the need for users to access the content, and also with protecting the data. So, you are much more in a finding fashion, and provide access at the right level to the right users. So, now, we talked about encryption and masking.
I just want to make sure that it's not looked at in the [inaudible] position. On the left-hand side, you see a lot of [inaudible] use cases; on the right-hand side, it would be the elements that are typically referred to as sensitive and privacy related and so on. So, this is about smart metering, and you're essentially using this for core purposes of projecting or going over to see how much electricity that different users in particular areas are using and so on and so forth. You really don't need and even addresses and phone numbers and anything that reveals the personally identifiable information, but you may say that address alone is sufficient for this purpose, and say, "I'm gonna encrypt it still so that it's not visible, but I'm gonna mask the rest of it so that I don't have to apply access privileges, because for this use case, it is not needed." So, what I'm saying here is – a judicial combination of masking and encryption – maybe it is best option for you given a specific use case.
Okay. Here is another example on say, clinical trials. Let's say in this case, somebody has determined that just name and medical test may need to be encrypted, the rest can all be masked. Keep in mind that masking does not mean blindly slapping a new value on top of another value, because for clinical trial purposes, the reason for – the person has gone through that – is from the age of the individual and so on is important. So, even though you're masking the date of birth, you may still want to retain the age of the individual intact.
So, it could be some random date within, say, 90 days on either side of the birth date or whatever. Like, something that's actually meaningful for that specific use case. So, now we actually talked about knowing your data, knowing who has access, and also protection and, in a controlled fashion, share the data. So, the next thing is you want to monitor who is accessing the content, who's making the changes, and so on and so forth, and you want to probably do this for a couple of reasons. One is – you really do want to know because these are [inaudible].
The second one is – maybe – certainly, you want to catch any kind of rich operations as it gets started, right, and pick the individuals out. So, that's a possibility. This is one of the things that we have worked on, so let me kind of explain how this thing works. So, the data classification tells us exactly where the sensitive elements are with what density. So, as an enterprise, you can actually define your policies around that, because the typical monitoring capabilities light it up like a Christmas tree, because they actually report it all on Sunday, and then you – it's your job to then figure out which is actionable and which is not, right?
That is, in and of itself, is a huge amount of work. But using the classification information in a more targeted fashion at these alert policies, you're able to actually get alerts that are actionable by definition, right? And also, you may be able to set the alert based on combinations of data types – the density, and also the users coming in, the groups that are coming in, and so on and so forth. And as a product company, I would say that the first iteration of it does exactly what I just talked about, but one of the things that we're working on, we've worked on, and it's actually imminent for release is the ability to build profiles of users accessing the content over a period of time so that we know exactly from what IP range is or what time of the day, what kind of content with what volume, and also, in what mode and so on and so forth. It's actually we would build that over a period of time, and then use that information to then trigger the [inaudible] anomalies.
And the anomalies can be figured out with using the user's old history to see if there's any anomalies there, but then there are also classes of users here where groups actually have very similar responsibilities and also, very similar profile of our data access comparing the users' behavior to that group that the user belongs to, but also actually provide [inaudible] anomalies. The third one is that it's not just the group the user belongs to – across the enterprise for simply the manner in which individuals do their work, there may also be some commonality, and so there's a way of clustering those to say that these users seem to have a similar profile when it comes to data access. Again, using all those three mechanisms helps you actually find any anomalous situation, so in a way – there's a point that Erfan was making at the top of the discussion – people may have, you know, jumped through all the hoops, got into the data, and in this case, maybe even got through decryption part, and gonna get access to real data and clear text and so on and so forth, but then, you know, any kind of an anomaly behavior is gonna make that individual stand out as a sore thumb. So, that breach that may be just in the offering could actually be nipped right there in the bud. So, that's – and there are two sides to the breach.
On any enterprises that I talk to, for instance, it's not about if a breach happens. The companies also want to be prepared as to when a breach happens, how do they react. As you would observe with all of the news that's been around about the different breaches that have happened, it takes companies months several times to figure out that a breach has happened. But once they've figured it out, it takes them more weeks and months to figure out who been affected, because that was not something that they knew about. So, if you are actually able to know your [inaudible] and turn on monitoring as to who's accessing and what time frame and what IP and so on and so forth, you're able to find anomalies and facts that act.
And also, with the IP and the time and so on and so forth, you may even be able to pinpoint the individual. It could even be a bona fide user right within your company doing this, which is much harder. It's not about a hacker coming out from the outside who actually doesn't have a – actually stand out a bit more as a sore thumb, but it's actually much harder when it's somebody from within your ranks. And so, that's one side of the breach. On the other side of the breach, since you knew what of the content was accessed and which of those had sensitive elements and so on and so forth, you're able to then better determine the damage, better determine who may have been affected, and take corrective action.
So, the breach detection need not take months, and certainly, the after the breach has been detected, the actions they're going to have to take also – does not have to take months. So, ideally, this is like, first of all, using the information on the [inaudible] that you've done, and then you can do this across multiple repositories – these databases – to do point systems and so on and so forth. This particular slide is more attractive of exactly where we are with it, and we often have cloud support for that – the Google cloud storage and Azure and so on and so forth, but obviously, it's not showing all of the rest of the station – rational databases and others. Those are actually on a little map, and we already talked about using the profile to determine anomaly behavior. In quite – quickly, in a nutshell, this is kind of how it works.
The detection actually picks up all the metadata about the content, and then, based on the policies that have been set, the part of the metadata that pertains to that then gets into the monitoring with a data manager. And so, as user access happens, using complex event processing, we're able to compare with known profile of user and figure out if there's any anomalous situation and be able to alert that. So, one thing I should also mention is why only stop with immerse? There could be thresholds beyond which you may want to take immediate action as well, so that's kind of also in the works to actually address that so that if there's a very strong suspicion that this could be a breach in the making, then it may be better to cut off access and then do the auditing and figure out exactly what you want to do. So, that's on the monitoring.
So, the goal here is – in the smart grid situation – you actually have data that's collected and stored right – within an entity. It's also data that is shared within entities. We talked about the consistency in the approach and the posture from a security standpoint, because you also talked about the fact – the strength of this whole security of the system is as strong as the weakest link in it. So, having this common approach and a consistent approach – and it's also not about just training individuals; it really should be a process and tools that kind of guide them along. And finally, it's all about the data.
It's all about actually knowing where they are, who has access, but better still – be able to monitor and figure out – and also, the monitoring also tells you who has made the changes and so on and so forth. You [inaudible] audit that as well. So, in the interest of time, I'm actually gonna finish this in just a minute. So, obviously, whatever we talked about, you should apply across multiple data stores. And there's a lot of moves from on-premise IT infrastructure into the cloud.
And again, those technologies need to be covered by any solution that you're looking at. I'm assuming that some of you are already thinking about public cloud in your rollback. So, I actually have an example, so I'm going to stop here, Erfan, so there's time for question and answers.
Erfan Ibrahim: Yes. Thank you very much, Venkat, and once again, my apologies for an act of nature that disrupted our electricity in Mountain View. And the irony is that the laptop had a battery and I was on a cellphone and it had a battery – the only thing that didn't have a battery was the wireless access point, and that disconnected. But, whatever, so – maybe I should invest in a UBS system for the apartment.
Okay. so, I would like – for those of you who are still online – to post a couple of questions. We have Yosir Geriah who asks – and we'll just go for about 10 minutes, because I want to be cognizant of your schedule – what are the tools to make out of this data?
Venkat Subramanian: So, the only thing is, I've got several different meanings in this context. So, if by audit we talk about who is doing what to data, that's one of the things that you can – you should be able to do. But, especially given the amount of data that you are collecting and is ever increasing with IOTs and so on, if it is purely looking at across the entirety of the data access, it's going to be non-practical, which is why identifying your crown jewels and then making sure that you're aware of who's accessing what content – that's actually an important part of it. So, by being able to set those alerts to say who's accessing what, even though they're authorized to use – of course, if unauthorized use, if there's more than one attempt made, obviously, that's going to raise an alert up itself and action needs to be taken. But simply having the alerts turned on means you can then see exactly who's accessing and who's made changes and so on and so forth.
That will actually help you figure if ever there's any disruption in the system that can be cause for user activity, you can also pinpoint who may have actually caused the disruption. It could actually just be inadvertent, but just knowing that actually helps.
Erfan Ibrahim: That's great. Any other questions online? So, while people are going to post questions, let me ask you, Venkat – what types of other technology companies are you approaching so that your technology can be integrated with other capabilities to help verticals like health care and telecom financial services and even the energy sector address their data security challenges? What are [Break in audio] natural partners?
Venkat Subramanian: So, first of all, the platform vendors are natural partners. When you talk about Hadoop – so, we are partnership with Horn Works and Fill Data and Map-R and so on and so forth. The platform [inaudible] provide the environment for people to store or retrieve data, but it's not their core competency to do that level of data centric security that we do. So, I mention that first, because unlike relational databases that have been matured, been around a long time, and have basic, really stringent security built into the product – because also, the entry and exit or retrieval is also just through the one door – it also makes it possible. These newer technologies came from the business requirement side, and didn't quite actually think about security from beyond that.
So, that's one area that we do work very, very closely with them. The second one is, once you have done the encryption, for instance, the requirement should not be that I will decrypt it and then point my application to it – any kind of analytical tool that you're doing. And there are also data set tools that do the grinding for the users to set up the environment to [inaudible] – their queries and so on and so forth. So, partnering with those to actually make sure the content is encrypted in real time – because I didn't get to talk about how many different ways encryption can be done, and this is one which is much more natural as far as the use itself. The third one is, we're also partnering with some customers.
I talked about monitoring. So, we work with a very large e-commerce site that is actually interested in doing this, so rather than build this – because it sounds like a good thing do from a [inaudible] perspective, we actually built it to suit the customer needs. So, there are a couple of other customers – another customer's actually moving to the cloud, and we're working with them to enable that to happen in a fashion that suits their need. And then, here's why – everybody feels comfortable with the data residing within the four walls of your own IT infrastructure. Moving it to the public cloud seems like something that's alien to you, and also seems unsafe.
First of all – one comment here – public clouds are far more secure as a platform than the IT infrastructure that the customers tend to have. But that's the visceral reaction that one has, which is the other way around. So, we are working with Amazon and Google and Azure, and these are not just partnerships as any small independent [inaudible] for them to have. We are actually building a few things to be part of the ecosystem. For instance, on [Inaudible] once you had done the classification, where they spin up a cluster and move the content in there for analytical purposes.
We're actually – you know, we're building a data broker that as user, Erfan comes in and is spinning up a cluster, is pulling data. Rather than blindly move all the data and then restrict access, the data broker will only move the content that Erfan can see so that we don't waste resources moving data back and forth. So, those are just examples.
Erfan Ibrahim: Okay. The next question – I think I can answer this. "What are the challenges in smart grid versus traditional IT cloud?" So, this was asked by Yosir Geriah of [Inaudible]. So, the challenge is that in smart grid, you have a geographically distributed infrastructure, and increasingly, we are getting data being produced further and further out in the field, in usually unmanned facilities or just in open with free access for people to reach the electronic. This is very different from the traditional IT cloud model, where the data assets are kept in data centers with physical security and limited access to those assets.
So, from a cyber security perspective and a cyber physical systems' perspective, the layout is very different. Additionally, the IT cloud usually occurs in very high speed networks. So, you have gigabit ethernet, you have a concept of an enterprise service, but, for an application layer – but – where data can be exchanged at very high speeds back and forth between systems. And, in the case of the cloud, you know, you have the virtualization, so you have multiple machines and you have the ability to use processing capability across multiple machines to support your application versus the smart grid, where a lot of processing is occurring on physical devices. Many times, the networking is like, meshed networks with not very high speed connectivity, so the layout is very different.
Now, securing the data, you still use the best practices as Venkat has described – everything from perimeter security to the granular controls that you put on the data itself. But, understanding the use cases and understanding the layout of smart grid and IT [inaudible], will help you understand what are the threats. The threats are very different for the two different ones. So, Venkat, you want to add anything to that?
Venkat Subramanian: Oh, no. You're pretty thorough with it. It's fine.
Erfan Ibrahim: Okay. Next, Ravi Kumar Ageli asks, "Can you give some examples about what types of conflicts/events that are involved in the conflicts/events processes GEP?"
Venkat Subramanian: Yeah. So, a user accessing a particular file – I'm gonna take a simple example – in and of itself may not be a problem, and accessing five of those may not, in itself, be a problem. The mode in which they do the volume of data that's been moved and the place that they're coming from to do this and so on and so forth – it's actually the sequence of these events there together indicates much more than just looking at each one in its isolation. 'Cause in isolation, each one of those may seem like an innocuous action that the user's profile may even match, but it's that combination that actually indicates there's an anomaly. That's why I talked about complexity in processing.
Erfan Ibrahim: Very good. So, at this time, we don't have any more questions, so what I'd like to do is Venkat, do you have any final comments that you would like to share?
Venkat Subramanian: The only comment is I – obviously, I'm not from this sector, but I did the best I could to learn as much as I could, but certainly, it's an extremely small percentage of what needs to be known. So, to the extent possible, I tried to map it to exactly where this matches your environment, and if there are any specific examples or specific environments you have – questions about how things could be handled – I would be more than happy to follow up. So, my e-mail address is Venkat@Dataguise.com. More than happy to actually take any questions and respond.
Erfan Ibrahim: Yes. Do you have a slide where your e-mail address is there that you could go to so they could see it? Is it on the deck?
Venkat Subramanian: Let me do this.
Erfan Ibrahim: No problem.
Venkat Subramanian: No, I'm just doing that. One second.
Erfan Ibrahim: Okay. Very good. So, I wanted to share some final comments about this before I talk about the next webinar. Venkat has shown you that data is not a monolith and the value of data varies depending on which data you have. And you have to have tools that give you granular control of the data and who accesses it, and then, have the ability to monitor who is accessing it.
So, all of that is very important, and these more traditional techniques of just block encryption or just selective column encryption, while they do add value, in the smart grid, with so much unstructured data coming in, it's important to have this capability. So, Venkat, I really appreciate you pushing the envelope on how to secure data, and this will be very valuable in smart grid because we are also moving from a highly-centralized data architecture to this highly distributed architecture. Yet, the best practices for securing data don't change. We still have to apply the same due diligence, whether we have data in one central repository or all over the place. You still have to have perimeter security.
You still have to have granular control on the data and also monitor. So, thank you very much, Venkat, for presentation today. Our next presentation is going to be on March the 3rd, and we're going to have Singularity Networks presenting on March 3rd, and they're gonna talk about this platform that they have developed which really streamlines communications and access to resources. So, I'm really looking forward to their presentation on March the 3rd. I was able to preserve the recording of the first part of this presentation, so – and now we have the second part, so you're gonna actually get two links.
And you'll be able to see this entire presentation, and you will also receive the slides in PDF format as soon as Venkat sends it to me. Venkat, please make sure that the version you're sending has your e-mail address on it. And, with that, I will now – oh, we do have another question here. Okay. "May I have a question? There may be different formats of data generated by the different vendor dependent application. So, how can you achieve seamless interoperation across the applications in smart grid?"
I can answer that question. The first part that Venkat talked about was the normalization of data, and that's critical, because the object models are so different across the different protocols that are in use that you can't really make much sense at a data level unless you normalize the data and bring it into a format where these tools – like the Dataguise – can work on it. So, Venkat, you want to add some more?
Venkat Subramanian: No, no. If the change is really not happening at the source, you're probably gonna need some kind of a gateway that could make that happen so that the two entities communicating see the data in the form that's native to their environment. But in the long run, it might actually make sense to have more common ways of expressing it. The second one is that even if these ways that these are set up different, the product should actually be able to find the variety with which the data's expressed – to be able to find them and map them to the particular types that they all belong to. That's actually part of any kind of detection capability, because that's also some way that you can also improve the quality of the data.
Because somebody has to identify the various forms, and if you're using a tool to do that for you after the fact, then that has to be done by two in terms of finding them all and mapping them to the one. So, we don't – from a Dataguise perspective, we don't do anything with data quality management, but we certainly can be a part of the process of doing it, because we can identify the common ones in various forms, and maybe there are tools that exist that can take care of it and modifying it to be for the particular type and improving the quality of it.
Erfan Ibrahim: All right. Well, thank you, very much. We've gone 17 minutes over our time, but because of the disruption, I think that's understandable. So, Venkat, thank you very much. At this time, I'm gonna end the recording and then the webinar.
I appreciate everyone's patience and your interest in the subject. Have a good one. Bye-bye.