Regular Expressions Are Hard
How to avoid common pitfalls.
Last updated
How to avoid common pitfalls.
Last updated
From insufficient security fixes to ReDoS, regular expressions are hard to get right. Yet, they are integral to modern software security and development. Hopefully this article helps you avoid common pitfalls before it's too late!
Let's begin with a story of how an innocent regex change led to a security vulnerability.
A few months ago, I was looking into a particular rich text editor when I noticed that it supported an interesting integration — PlantUML. This was an interesting project that allows users to write UML as code, and have a web server turn that code into a graphical UML diagram.
What immediately caught my eye was how utterly complex the project was, given its seemingly simple use case. The more complex any software is, the more difficult it is to ensure security. Going over to the Preprocessing page of the PlantUML documentation would show a treasure trove of builtin functions with interesting security implications.
While most of the sensitive functions like %getenv
were blocked in the default security profile of the web server, %load_json
was not. This allowed us to read any local JSON file and confirm whether any file exists on the filesystem. This turned out to be an oversight (and is assigned CVE-2023-3431), since the !include
and !theme
directives (which also enable local file reading) were subject to the security profile checks.
Additionally, this function allows fetching from a URL, so SSRF was also present.
Great, so only PlantUML instances running on the default security profile were vulnerable, right? I wanted to see if I could break one of the higher-security modes, so I looked into the ALLOWLIST
mode. With this profile, only allowlisted URLs can be reached, limiting the impact of SSRF.
Taking a closer look at where this check is performed (source), we see that each URL is first cleaned through cleanPath
, then checked against the allowlist with startsWith
.
Normally, using startsWith
allows a trivial bypass using the user information part of a URL, which contains basic authentication credentials.
However, PlantUML attempts to remove the user information portion of the URL before performing the startsWith
check.
This should have been a good thing, but the regular expression used ruined everything. Consider the following regex that captures the user information in the 2nd group and the actual host in the 3rd group.
It assumes that the user information part always contains the characters [-_0-9a-zA-Z]+
. So if we use https://plantuml.com@evil.com
, there is no match! In fact, the regex fails to perform its intended function since the format for user information in URLs is <username>:<password>@<host>
and the regex does not contain :
.
So, back to basics — a simple https://allowlisted.url.com@evil.com
bypass would allow us to reach any arbitrary URL.
But how did this vulnerability come about? In attempting to fix a previous issue, the PATTERN_USERINFO
was changed, introducing the limited set of characters that would match the user information part of the URL.
Regular expressions are hard to get right. But more importantly, don't reinvent the wheel! Java already comes with a URL class that has been tried and tested to perform standards-compliant URL parsing.
Using the getHost
method, one can get the hostname of the URL, ignoring other parts of the URL like user information. This hostname can then be matched against the whitelist — simple!
Don't make your life harder. Regex-based whitelists and blacklists are hard to get right.
Bug bounty programs often classify Denial of Service (DoS) issues as out of scope. This is even more likely for ReDoS issues, a specific subclass of DoS that exploits the fact that most regex implementations may reach extreme situations that cause them to work very slowly.
This is due to a regex engine feature called backtracking, an algorithmic technique that brute forces every combination in order to solve a problem. When a "dead end" is encountered, the algorithm simply traces its steps back to previous nodes and explores other unvisited nodes.
In the context of regular expressions, these dead ends are simply non-matches. Take the regex ^(a+)+$
. A non-match would be aaaaX
. Because of the nested quantifiers, backtracking becomes exponential with more a
s. This is called catastrophic backtracking.
In a bug bounty programme a while back, I found an exposed Elasticsearch API that allowed me to run any query on the Elasticsearch instance. The Elasticsearch data wasn't particularly sensitive, so I had to find another way to escalate the impact.
I found this post on the Elasticsearch forum which I thought was pretty interesting.
A developer mentioned that "it is possible for a sufficiently determined malicious user to write searches that overwhelm the Elasticsearch cluster and bring it down". Huh.
I couldn't find any online resources on how to do this, so I had to try to figure it out myself. Eventually, when exploring the Elasticsearch documentation, I found the scripting module. The scripts were well-sandboxed, and I couldn't find a way to escalate this into an RCE.
Thankfully, though, Painless scripts allow us to run regular expressions. The following would run a script that simply checks if aaaaaaa
fulfills the regex /a{0,7}{0,10}$/
. As the server URL encodes many characters, I was only able to work with the {...}
quantifiers to increase time complexity.
In this script, only 23 steps is needed in the regex algorithm to find the match.
But when the last character in the test string is changed to an x
, the string aaaaaax
will cause the algorithm to take 74,380 steps instead.
Now, what if we used /((a{0,7}){0,7}){0,10}$/
? Regex101 detects catastrophic backtracking and gives up.
In this particular program, the difference in computational complexity becomes very noticeable once we look at the took
attribute of the response JSON.
I ended up reporting the following query, which caused the server to take more than a minute to respond. This was around a 3,000x amplification from the original query time of 20ms.
Clearly, we could continue adding more {0,7}
to the regex to strengthen the payload until it crashes the Elasticsearch service. To respect the program rules against performing DoS attacks, I did not test any payloads stronger than this.
Alas, this adventure only served to fuel my ongoing gripe with the status quo of classifying all DoS issues as out of scope in bug bounty programs.
It's understandable that companies and organizations don't want people spamming their infrastructure to report DoS issues. But when you have an exponential amplification vector, it's dangerous to ignore.
The traditional CIA model has three problem areas, and Availability is one of them. What's important is leverage. How much leverage does the attacker have over your resources?
If an attacker can use a single HTTP request to bring your server to its knees, that's a very high-leverage DoS vector. On the other hand, if the attacker has to use significant resources, like a botnet, to stage a DDoS, then that's a very low-leverage DoS vector. I definitely agree that low-leverage DoS vectors should be out of scope!
I hope that more people adopt this view and become more accepting of DoS vulnerabilities in application security. In this particular case, it is definitely a high-leverage vulnerability that deserves attention. Unfortunately, the team will likely never fix it.
Node.js is a single-threaded, event-driven JavaScript runtime. Simply put, everything happens on a single-threaded "event loop". When things happen (such as a new request), a callback is triggered. This fills up the event queue, which the event loop works to clear.
All of this is to say, if a regular expression is being tested in JavaScript, nothing else can happen until the test is complete. For example, if a Node.js web server is handling a request, and is using a regex to validate one of the request parameters, no other requests can go through until this is done.
This has client-side implications as well — there is no way to push work off to a background thread to keep the UI responsive, everything has to happen in the event loop.
It gets even more interesting when modern desktop apps are built on frameworks like Electron that run on Node.js. Recently, I came across a very complex URL validation regex in an Electron app.
This regex was tested every time the user clicked on a link, so sending the following link to a victim can cause their application to hang for several seconds.
Client-side DoS vectors like these are also hard to ignore — who doesn't remember the viral WhatsApp bug that crashed the app whenever someone sent an evil message?
Writing regular expressions is hard. On one hand, security-oriented regular expressions need to be able to catch all edge cases and potential bypasses. On the other, we need to avoid overly complex regular expressions that introduce ReDoS vectors.
On delicate attack surfaces like URL parsing, it's almost always better to go with existing parsers instead of trying to reinvent the wheel. For instance, JavaScript's URL class is WHATWG URL Standard compliant, which means that it parses URLs exactly how a standards-compliant browser would.
It is way better to prevent XSS, for example, by using new URL(dirty).protocol === 'javascript:'
instead of trying to use a regular expression to catch javascript:
URLs, simply because there are many ways to write the same URL. Your custom regex might catch javascript:alert()
, but does it catch all of the following URLs? It might be hard to say.
JAVASCRIPT:alert()
\x00javascript:alert()
java\r\nscript:alert()
java\tscript:alert()
javascript\r\n:alert()
If regular expressions have to be used, it's often wise to avoid the following patterns to prevent ReDoS:
Nesting quantifiers, like (a+)+
Non-mutually exclusive alternations, like (.|a)*
Non-mutually exclusive quantifiers in sequence, like a.*?b.*?c
I hope that this has been an interesting read!