Lately we are being asked more and more about micro-segmentation using predictive analytics based on data, especially when we are looking at network level data for our customers. The promise of real-time changes, based on patterns, bounced against predictive analytics algorithms are just too attractive to ignore for the CIO or CISO.
At the root of the request is the promise of AI technology. The promise of solving massively complex problems that would normally take an army of engineers are what they are looking toward. Exercising the resource and investment, of the organizations data, to define usage patterns which depict typical user and system communications across the infrastructure as a solution to create better security is what they seek.
For the IT department the concept of segmentation is an effective and simple one: Create a walled off environment that gives access to only the users whose job functions make it necessary to access the systems where the data they need to do their job is housed. Simply create an Access Control List (ACL). Easy stuff, right? I can hear every network administrator out there right now saying, “this is kindergarten level IT man.” Ok that’s fair, and in an organization with 50 or 100 users its’ easy enough to do. But what happens when you have 1M+ devices, 30K+ servers and 500,000 users, 1000 applications that were created to support the business, by the business owner, who assumed internet level communications were always there and wide open. Oh, and the person needing the data may change day after day? Gets complicated and almost unmanageable right? Scale is the great humbling equalizer in IT.
Enough about the craziness we see every day when it comes to accessing data and applications, and who needs access, and who doesn’t. If the problem of creating a micro-segmentation strategy seems unsurmountable and just too unwieldy to be manually administered, then let’s apply some data science using AI concepts and see what can be accomplished in the short term; let the data do the work for you.
Start out by gathering pipelines of data form the various sources that are doing active and passive discovery of devices on your network. Make sure each source is tapping into the key information available to it. Pour that data into a single source of truth for your exercise. There are two key methods out there, Inmon and Kimball methodologies, use what best suits your environment and resources. A popular method we have used lately has been creating a constant stream using Apache Kafka into Azure or AWS. Weigh what data you choose against you budget, network performance (in the case of queries to your gear), other sources that may already be running and gathering the same information (try not to duplicate effort), and your initial business goal. If you try to grab it all the possible data the first time around you may never finish what you started out to accomplish either because the cost become prohibitive or you don’t have the pipes to support it. Your goal is just to get your data in one place, so you can feed the engines of analysis.
You’ll want to know what kind of device each “thing” is, who’s using it, and what’s running on it (operating system, software, version numbers, location, etc.). Create a single source of truth containing not just what’s on the network today, but how it’s changing over time. Think big data. The linchpin of this data is going to be how you uniquely identify each device. The different sources doing discovery will each bring a piece of the puzzle and you need to put it all together to get an accurate picture.
Where are my users who are accessing my application? Having a single question will allow you to keep your focus on the right data. It is likely you will discover much more than the answer to the single question but keeping focus is key. A visual analytics tool can also be very helpful here as you begin to explore your data. What do you notice based on time of day, day of the week, type of device, or location? The goal is to see what sticks out upfront. Knowing the baseline outputs in your data and if they match the key business drivers will get the support you need to develop your program.
Now that you have your data in one place feed the engines by pouring it into an AI platform. Platforms such as Tensorflow can support time series forecasting to identify patterns and keras to iterate the model in a rapid prototype fashion. Interesting patterns will emerge. You can obtain the when, for how long, and how much users consume among other valuable info. Even those pesky applications that are being accessed by what seems like a random employee at a random time begin to tell a story. Is the data accessed by a user or another system? The traffic patterns will emerge that support typical traffic and usage patterns which will help you answer the big questions and build out your micro-segmentation strategy. Once you have the hang of it you can even create what if scenarios by creating your own datasets and inserting them into the model.
Micro-segmentation will be most effective to your business if you have the right data about what’s happening on your network before trying to implement it. Not only will the data help you figure out exactly what needs to be done, but it will also show how it’s working for you. Letting the data work for you is a much more effective strategy, as it “will allow you to learn from history and not be doomed to repeat it.”