URL-OVERLONG-UTF8
| Test ID | MAL-URL-OVERLONG-UTF8 |
| Category | Malformed Input |
| Expected | 400 or close |
What it sends
A GET request with raw overlong UTF-8 bytes in the URL path. The bytes 0xC0 0xAF are an overlong encoding of / (U+002F).
GET /\xC0\xAF HTTP/1.1\r\n
Host: localhost:8080\r\n
\r\nThe two bytes after / are 0xC0 0xAF – an illegal two-byte UTF-8 sequence that decodes to the ASCII forward slash character.
What the RFC says
Raw bytes 0xC0 and 0xAF are not valid URI characters. URI paths are limited to pchar:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"— RFC 3986 Section 3.3
All unreserved and sub-delims characters are ASCII (0x21-0x7E). Bytes 0xC0 and 0xAF fall outside this range and are not percent-encoded, so they violate the URI grammar.
Additionally, RFC 3629 requires rejection of overlong encodings:
“Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.” — RFC 3629 Section 3
The bytes 0xC0 0xAF are an overlong UTF-8 encoding of U+002F (forward slash /), which must be encoded as the single byte 0x2F.
Why it matters
Overlong UTF-8 sequences encode characters using more bytes than necessary. If a server decodes 0xC0 0xAF as / during path resolution, it can bypass path traversal filters (e.g., ..%c0%af.. becomes ../../). This was the basis of the infamous IIS Unicode directory traversal exploit (CVE-2000-0884).
Deep Analysis
Relevant ABNF
request-line = method SP request-target SP HTTP-version
origin-form = absolute-path [ "?" query ]
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"RFC Evidence
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"– RFC 3986 Section 3.3
“Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.” – RFC 3629 Section 3
“The security threat is very real. […] a widespread virus attacking Web servers in 2001 relied on the mishandling of overlong UTF-8 sequences to compromise vulnerable systems.” – RFC 3629 Section 10
Chain of Reasoning
The raw bytes are not valid URI characters. Bytes
0xC0and0xAFboth fall outside the ASCII range used bypchar. Neither matchesunreserved(limited toALPHA,DIGIT,-,.,_,~, all below0x7F),sub-delims,:,@, orpct-encoded(requires a leading%). The request-target violates the URI grammar at the byte level, independent of any UTF-8 interpretation.The bytes form an overlong UTF-8 encoding of
/. In standard UTF-8,U+002F(forward slash) is encoded as the single byte0x2F. The two-byte sequence0xC0 0xAFuses the110xxxxx 10xxxxxxpattern with the value bits00000 101111=0x2F. This is an overlong encoding: it uses 2 bytes where 1 byte suffices.Overlong sequences MUST be rejected. RFC 3629 Section 3 requires that implementations “MUST protect against decoding invalid sequences.” Overlong encodings are explicitly invalid because they violate the shortest-form requirement. A conforming UTF-8 decoder must not accept
0xC0 0xAFas equivalent to0x2F.CVE-2000-0884 exploited exactly this pattern. Microsoft IIS on Windows decoded overlong UTF-8 sequences in URLs, allowing
..%c0%af..to be interpreted as../../. This enabled remote directory traversal, giving attackers access to files outside the web root. RFC 3629 Section 10 explicitly references this class of attack, noting “a widespread virus attacking Web servers in 2001” exploited overlong UTF-8 mishandling.Two layers of defense apply. First, the bytes fail the URI grammar (they are not valid
pchar), so a strict URI parser will reject the request before any UTF-8 decoding. Second, even if a server attempts UTF-8 decoding, RFC 3629 mandates rejection of the overlong sequence. A server that accepts this request has failed at both layers.
Sources
- RFC 3986 Section 3.3 — URI path and pchar grammar
- RFC 3629 Section 3 — UTF-8 decoding requirements
- RFC 3629 Section 10 — UTF-8 security considerations
- CVE-2000-0884 — IIS Unicode directory traversal