Daniel Lemire's blog
How does your URL parser handle Unicode?

Most strings today in software are Unicode strings. It means that you can include mathematical symbols, emojis and so forth. There are many different versions of the letter ‘M’, for example: the Roman letter M (U+004D) is semantically different from the Roman numeral Ⅿ (U+216F) while they both often have the same visual representation. John Cook has an interesting post on Unicode Stegonography: you can possibly use this ambiguity to hide messages in plain view. E.g., if you need to warn someone that you are in danger, you could send a text with the Roman numeral M. Normal people reading the text would not notice the difference.

What about URLs like Microsoft.com? What if you replace the Roman letter by a Roman numeral, is it still the same domain?

It is. URL parsers are required to normalize the URLs which involves, among other things, replacing look-alike letters with Roman letters if they are to be compliant with the WHATWG URL specification.

But do they? Do the URL parsers actually do this hard work? Let us check.

Java. I could not get the standard Java library to return to me the host. It simply returns a null String.
 String url = "https://microsoft.coⅯ";
 URI uri = new URI(url);
 String host = uri.getHost();

C#. The .NET library seems to just returns the domain as-is with the Roman numeral.
string url = "https://microsoft.coⅯ";
Uri uri = new Uri(url);
string host = uri.Host;

PHP. The standard PHP interpreter just returns the domain as-is, with the Roman numeral
$url = "https://microsoft.coⅯ";
$parsed_url = parse_url($url);
if ($parsed_url === false) {
 echo "URL could not be parsed.";
} else {
 $host = $parsed_url['host'];
}


Go. Go also does not do normalization.
urlString := "https://microsoft.coⅯ"
parsedURL, err := url.Parse(urlString)
if err != nil {
        fmt.Println("URL could not be parsed:", err)
        return
}
host := parsedURL.Host

Python. You guessed it: no normalization. It happily returns the Roman numeral.
url = "https://microsoft.coⅯ"
parsed_url = urllib.parse.urlparse(url)
host = parsed_url.netloc

JavaScript. JavaScript does it correctly. It will convert https://microsoft.coⅯ to https://microsoft.com.
const url = "https://microsoft.coⅯ";
const urlObj = new URL(url);
const host = urlObj.hostname;

C++. C++ does not have a standard URL parser, but if you use the ada URL parser, you will get correct results. If you are using the Node.js runtime environment, the underlying parser is the C++ ada URL parsing library.
auto url = ada::parse("https://microsoft.coⅯ");
if (!url) { /* failure */ }
std::string_view host = url->get_host();


source
 
 
Back to Top