Using an internally developed machine-learning model trained on log data, the information security team for a French bank found it could detect three new types of data exfiltration that rules-based security appliances did not catch.
Carole Boijaud, a cybersecurity engineer with Credit Agricole Group Infrastructure Platform (CA-GIP), will take the stage at next week’s Black Hat Europe 2022 conference to detail the research into the technique, in a session entitled, “Thresholds Are for Old Threats: Demystifying AI and Machine Learning to Enhance SOC Detection.” The team took daily summary data from log files, extracted interesting features from the data, and used that to find anomalies in the bank’s Web traffic.
The research focused on how to better detect data exfiltration by attackers, and resulted in identification of attacks that the company’s previous system failed to detect, she says.
“We implemented our own simulation of threats, of what we wanted to see, so we were able to see what could identify in our own traffic,” she says. “When we didn’t detect [a specific threat], we tried to figure out what is different, and we tried to understand what was going on.”
As machine learning has become a buzzword in the cybersecurity industry, some companies and academic researchers are still making headway in experimenting with their own data to find threats that might otherwise hide in the noise. Microsoft, for example, used data collected from the telemetry of 400,000 customers to identify specific attack groups and, using those classifications, predict future actions of the attackers. Other firms are using machine-learning techniques, such as genetic algorithms, to help detect accounts on cloud computing platforms that have too many permissions.
There are a variety of benefits from analyzing your own data with a homegrown system, says Boijaud. Security operation centers (SOCs) gain a better understanding of their network traffic and user activity, and security analysts can gain more insight into the threats attacking their systems. While Credit Agricole has its own platform group to manage infrastructure, handle security, and conduct research, even smaller enterprises can benefit from applying machine learning and data analysis, Boijaud says.
“Developing your own model is not that expensive and I’m convinced that everyone can do it,” she says. “If you have access to the data, and you have people who know the logs, they can create their own pipeline, at least in the beginning.”
Finding the Right Data Points to Monitor
The cybersecurity engineering team used a data-analysis technique known as clustering to identify the most important features to track in their analysis. Among the features that were deemed most significant included the popularity of domains, the number of times systems reached out to specific domains, and whether the request used an IP address or a standard domain name.
“Based on the representation of the data and the fact that we have been monitoring the daily behavior of the machines, we have been able to identify those features,” says Boijaud. “Machine learning is about mathematics and models, but one of the important facts is how you choose to represent the data and that requires understanding the data and that means we need people, like cybersecurity engineers, who understand this field.”
After selecting the features that are most significant in classifications, the team used a technique known as “isolation forest” to find the outliers in the data. The isolation forest algorithm organizes data into several logical trees based on their values, and then analyzes the trees to determine the characteristics of outliers. The approach scales easily to handle a large number of features and is relatively light, processing-wise.
The initial efforts resulted in the model learning to detect three types of exfiltration attacks that the company would not otherwise have detected with existing security appliances. Overall, about half the exfiltration attacks could be detected with a low false-positive rate, Boijaud says.
Not All Network Anomalies Are Malicious
The engineers also had to find ways to determine what anomalies indicated malicious attacks and what may be nonhuman — but benign — traffic. Advertising tags and requests sent to third-party tracking servers were also caught by the system, as they tend to match the definitions of anomalies, but could be filtered out of the final results.
Automating the initial analysis of security events can help companies more quickly triage and identify potential attacks. By doing the research themselves, security teams gain additional insight into their data and can more easily determine what is an attack and what may be benign, Boijaud says.
CCA-GIP plans to expand the analysis approach to use cases beyond detecting exfiltration using Web attacks, she says.