This guest post was contributed by Nilesh Dherange, CTO at Gurucul
Today, organizations are under constant threat from cybercriminals looking to make money or steal sensitive data through social engineering, ransomware, brute force attacks on a network, and much more. To combat these threats, organizations and their Security Operations Center (SOC) teams rely heavily on solutions such as Security Information and Event Management (SIEM) systems for threat detection, investigations and response (TDIR).
But as attackers work to hide their objectives using new resources and tactics, having comprehensive data sets is paramount for SOC success. This includes collecting as much data as possible from any and all architecture and data sources. This forms the foundation that enables SOC teams to better process, extract, monitor, and analyze the security-relevant data that is required to reduce manual efforts and speed response.
In this article, we'll explore some data challenges experienced by the SOC, the role data plays in making better security decisions, three key elements of data (the 3 Vs), and the difference between rules-based and trained Machine Learning. Let’s dive in.
The SOC Data Challenge
Networking and security stacks have changed dramatically over the past five years making it even more complex to gather and analyze data. A key challenge has been the mass migration to the cloud, which has impacted security tool scalability, compliance (such as GDPR), misconfigurations, introduced new complexities around multi-cloud architectures (and de-centralized architectures), and more. It’s created less reliable data, resulting in even more alerts flooding the SOC. Another challenge is that the cost of data ingestion has risen. This puts organizations in the dangerous position of having to tradeoff between cost savings and visibility. And finally, rules-based architectures have constrained inputs of data into security solutions, throwing away what could be valuable data.
How Data Enables Better Decision Making
The fact is, without a full complement of data, the SOC may struggle to identify if a threat is real. Time is of the essence, and missing data is like a missing piece of a puzzle. Once a threat is confirmed, it’s up to the SOC teams to investigate it and take remediation actions. Investigation is a manual, complicated task that often requires coordination with other teams to get the necessary data. The better data and telemetry threat analysts can get automatically, and the more consolidated that is, the easier their job will be and the faster the investigation will go.
More context around threats accelerates the investigation process and ultimately makes the organization safer. Additionally, reporting for security compliance usually involves pulling data from a variety of solutions. Collecting all those data sources in one solution saves time and money.
Incomplete data means blind spots in the network where attackers and threats can lurk undetected. Overall, more data means better decisions and better security if (this is a big “if”), the threat detection tools have good enough analytics to filter out false positives and present the data in a way that’s useful without being overwhelming.
The 3 V’s of Data:
When talking about data in threat detection, it can be helpful to look at the 3 Vs of data: Variety, Volume, and Velocity
Variety – Collecting a variety of different types of data from across the network is one of the key elements that gives a threat detection system the context it needs to generate accurate alerts. The traditional rules-based approach for threat detection relies on typical data sources that include log data, packet capture data, Netflow data, and perhaps endpoint data (when it comes to XDR). These different data sources often require custom parsers that have yet to be built, so having a TDIR system that can ingest, analyze and sort them out-of-the-gate, without the need for custom parsers, is critical to maximize immediate visibility into threats and eliminate blind spots.
Volume – In an ideal world, the more data that gets ingest, the better (and faster) the SOC can detect threats, and the more targeted the response can be. Detecting threats is like piecing together a puzzle from a box of multiple puzzle pieces in real-time. The more contextual information that can be provided to a SOC analyst, the faster that puzzle can be solved. However, rules-based ML threat detection tools, like a flow-chart, only accept a fixed set of pre-determined inputs (i.e., data sources). This doesn’t deliver the required visibility to effectively find all existing threats, let alone new ones. As a result, SOCs are looking to leverage trained ML threat detection tools, which processes large volumes of data faster (no conditions need to be met), learn from that data, and deliver true real-time information back to teams.
Unfortunately, how solutions are priced can impact this decision. For example, many SIEM products charge based on the amount of data ingested, which makes it more difficult for cost-conscious organizations to bring in the data they really need.
Velocity – The rate at which a threat detection system can ingest data also matters. Having a system that can ingest large amounts of data quickly, while also analyzing and separating what data goes where, is a huge aspect of successful TDIR. This is where trained ML has an advantage again, because of its ability to analyze and deliver real-time information without needing to meet conditions. Not having the confines associated with rules-based data ingestion allows a threat detection solution to gather all data, extract the security information needed, turn it into metadata, and analyze it for more contextual awareness of events and risks.
Rules-Based Threat Detection Vs. ML:
As I’ve touched upon, the type of analytics used in threat detection solutions make a big difference in its ability to process data. To elaborate on that further, over the last several years, the industry has used ML as an umbrella term, oftentimes lumping rules-based analytics in with trained ML, which learns about the environment and how to better use data over time. This has created confusion about what the capabilities are when it comes to threat detection.
Rules-based ML doesn't require a lot of inputs and is based on a prebuilt flow chart or pattern of defined inputs for defined outputs. Essentially, the security software is looking for a specific threat based on predetermined inputs. The problem with rules-based threat detection is that it doesn't adapt to variants of different threats, meaning hackers or threat actors can make slight changes in malware to bypass threat detection systems.
Threat detection systems that utilize trained ML on the other hand, embrace and allow for adaptability, using learned behavior over time. The key to proper, effective threat detection with ML, is the amount of data and context it’s given. By providing the system with more data, the ML can have more context and ultimately provide more effective and accurate alerts for SOC teams. This is where trained ML has the upper-hand on rules-based systems and proves to be a more effective approach for SOC teams. Given the context of threat detection, this differentiation is important. Rules-based threat detection is a step-by-step process for a defined output (with specific data types), while ML takes any data and learns from it to provide context to make better decisions.
Data plays a pivotal role in threat detection today. Without it, organizations and their SOCs lack the visibility needed to identify existing and new threats. As we’ve reviewed, the ability to ingest data often comes down to having the right solution, prioritizing visibility over cost, and using the right analytics and ML models to process that data for teams. However, complete visibility still has its challenges as infrastructure and architectures change. But having a more complete understanding of the issues, options, and possibilities can help teams to better evaluate how they approach their security posture and the security tools they add to their stack.
###