Machine Learning and Data Security
The need for better data security was recently put high on the global cybersecurity agenda by the EU General Data Protection Regulation (GDPR) that took effect on May 25, 2018. This regulation enforces all companies having access to the personal data of the EU citizens to adopt more secure approaches to managing customer data, protecting against its accidental loss or unlawful destruction, theft, and unauthorized disclosure.
Notwithstanding these regulatory efforts, though, the state of data security across the world deserves better. According to Thales Security, 67 percent of enterprises across the world have at least one incident of a major data breach or network attack annually. As more innovations in digital technologies end up in wrong hands, hacker attacks are becoming more and more sophisticated and destructive. Because of this more companies pin their hopes on the AI/ML cybersecurity innovation. However, how can Machine Learning actually be leveraged to improve cybersecurity and data security, in particular? This article tries to answer this question.
Use Cases for Machine Learning in Cybersecurity and Data Security
Many experts cast doubt on the ML’s capacity to radically improve data security. Common reasons for this skepticism are the difficulty to obtain high-quality security data for training ML algorithms, data privacy concerns associated with training ML models, ML algorithms’ ‘black-boxedness’ that can backfire in sensitive security contexts, and growing flexibility of attackers. Even though many of these complaints do really make sense, it’s always more productive to see the positive rather than the negative side of things. There is no doubt that ML-based cybersecurity is making real change and its fruits are already tangible. Below is a short summary of the cybersecurity domains that have been positively affected by the ongoing AI revolution.
Malware and spyware are installed on the user’s computer to obtain sensitive private data like digital credentials (logins and passwords) for identity theft, credit card data, and other personal records. Until very recently, software companies used digital signatures to protect against malware, however, this approach is becoming less effective today. Hackers found ways to bypass signature-based protection by making slight modifications to malware and taking advantage of the largely reactive approach adopted by anti-virus providers. Under these circumstances, more companies are now turning to the ML-based malware detection.
An ML-based approach to malware detection typically uses Supervised Machine Learning that derives a malware detection model from the vast array of labeled examples of malware (Trojan horses, spyware etc.). To create such a model, we first need to collect a training data set from available malware examples, cleanse, and prepare it for the training. Eventually, we can feed this data to the learning algorithm designed to identify recurring patterns and features in the training set. If we’ve done everything right, we can expect that the ML algorithm comes up with some abstract representation of malware patterns which can be thereon employed to detect new malware. A similar approach is already effectively used for spam detection based on the vast amounts of labeled spam examples.
Tel-Aviv-based Deep Instinct security startup is a good illustration of how these approaches can be used in practice. This Israeli company creates powerful neural networks that learn malware patterns directly from the malware’s source code instead of trying to figure out virus signatures or using other heuristics. Looking into a source code for malware patterns is a paradigm shift that moves us from the high-level human comprehensible explanation of malware to the low-level machine representation based on subtle malware features that are almost impossible to notice for human experts.
Data Protection in the Cloud
Data protection in the public cloud has been a challenging task for cloud providers due to the cloud’s inherently shared nature stemming from the way data centers and networks are virtualized and the cloud’s exposure to the Internet.
One example that illustrates the challenge is using clouds to store data. Conventional cloud security systems have to use many hard-coded rules, continuous monitoring and manual intervention to secure data in such storages. Nowadays, however, this approach became less effective due to the exponentially growing volumes of data stored in the public cloud. ML-based insight generation, predictive analytics, and automated control are seen by many as a powerful alternative to older techniques. One example of this ML innovation is Amazon Macie - an ML system used by Amazon to secure data in its S3 storage. The system dynamically analyzes all attempts to access private data and flags various anomalies like large amounts of data being downloaded, unusual login attempts, or data transferred to an unexpected location. In addition, Macie classifies sensitivity of the data using various metadata fields, file content as well as the source code. Using data sensitivity scores, the system then runs regular security checks for the most sensitive data and alerts owners in case of data breaches or if the data was accidentally exposed. Amazon’s new security paradigm for data protection is more dynamic and flexible than earlier approaches. It also takes advantage of AI-generated insights and predictive modeling to prevent attacks rather than just react to them.
ML and Behavioral Revolution in Data Security
To protect digital customer data, companies have traditionally used account credentials and various security checks of names, addresses, emails, dates of birth, security questions etc. Recently, two-factor authentication, fingerprints, and iris scan were added to the list. Today, with the advances in Machine Learning, we can put yet another important variable - user behavior - into the security equation. As it turns out, behavior makes a big difference to security. In contrast to personal data that can be accessed via public record, purchased on the dark network or stolen, user behavior cannot be easily obtained or mimicked.
This feature makes ML security systems that model user behavior the next revolution in the data security. The user’s purchase patterns, common login locations, and even browsing preferences can be used by ML models to develop a powerful behavioral model that can distinguish between the user and the malicious hacker trying to impersonate him. For example, London-based Onfido software company has demonstrated how to use this innovation in identity verification. It developed a Facial Check that prompts users to film themselves as they are performing some random movements. Then, using Machine Learning the system compares the recorded video with the image of the user’s face extracted from the user’s identity document. This way the system ensures that it is you and not a malicious user who is trying to access the account or website.
Behavioral intelligence is now also widely used in the online payment protection. For example, in 2016 Mastercard introduced its Decision Intelligence system that applies Machine Learning to evaluate the trustworthiness of transactions. The technology behind the DI examines how a specific card account is used over time to detect normal and abnormal spending patterns. It leverages such details as customer value segmentation, location, merchant, device data, time of day and type of purchase made to assess the behavioral conformity and score a transaction’s risk exposure to prevent malicious transactions.
As this article demonstrates, ML can significantly enhance digital data security through the learning of user behavior, identifying normal and abnormal access patterns, and finding low-level patterns of malicious software. However, in order to reach the innovation’s diffusion stage, ML-based cybersecurity has to overcome at least two challenges:
The risk of hacking ML security algorithms. If attackers manage to figure out how the algorithm is set up and where it takes its training data from, they can succeed in feeding misleading data that will skew the algorithm’s perspective. If this happens, the algorithm can become a zombie AI in the hacker’s hands. To overcome this challenge, human operators will need to play a greater role in overseeing how ML security works and intervening if something goes wrong. This would, however, require developing better interfaces for human-AI interaction and better interoperability of ML solutions.
Cybersecurity AI can be exploited by hackers to create malware.
As we know, technology is a double-edged sword and more so in case of Artificial Intelligence. For example, we know that modern Deep Learning approaches such as Generative Adversarial Networks (GAN) work by mimicking the real-world data. Hypothetically, hackers can use such GANs to design malware that would be hard for other algorithms to detect. This will lead us into the era of the battle between ‘good’ and ‘bad’ AI. For example, recently, researchers from the cloud security company Cyxtera created an ML-based phishing attack generator trained on over 100 million of effective historical attacks to optimize it for the generation of effective scam links and emails. It was found that using the generator hackers could bypass ML-based systems more than 15 percent of the time, whereas they normally succeed in only 0.3 percent of their attempts.
These are just a few challenges that need to be addressed by the cybersecurity community to speed up the pace of ML-based data security. However, the expected benefits of the ML-based data security are obviously worth the effort.