Posts Tagged ‘confusables’

Microsoft wins legal dispute over Bing.com IDN lookalike

November 3rd, 2010 by

A couple years ago I tried registering IDNs (Internationalized Domain Names) that were visually identical or similar to popular sites like mozilla.org, bing.com, and google.com. What I found was that I wasn’t the only one doing this. For me, it was just to demonstrate the possibilities for visual spoofing in modern user-agents, similar to what we saw in 2005 with the paypal.com spoof.

I don’t think this recent legal decision made the news anywhere, but Microsoft filed a complaint that a registered domain name www.bıng.com was confusingly similar to its www.bing.com brand. In case it’s hard to see, the issue here is with the dotless ‘i’ in the lookalike domain. In that domain, the registrant used Unicode character U+0131 LATIN SMALL LETTER DOTLESS I in place of the usual U+0069 LATIN SMALL LETTER I in bing.com.

Microsoft won the case on valid merits, and as far as we know there was no harm done. That is, I haven’t heard any news of a phishing attack that utilized this domain name. It’s easy to imagine the extent of harm possible through a phishing/luring/schmoozing/whatever attack that utilizes confusing IDNs across the context of email clients, web browsers, and other user-agents. A well-thought attack could be surprisingly effective.

Detecting malicious URL obfuscation techniques in spam

October 12th, 2010 by

URLs offer loads of fun for pranks, hacks, and spam.  The reasons are numerous and inherent in their structural and visual complexity.  Add IDNs to the mix and the fun-factor just doubled.  But this isn’t about IDNs.  It’s recently been noted by Symantec that spammers are using the soft hyphen character to obfuscate URLs and bypass anti-spam filters.

It’s a neat trick that plays into the widely divergent implementation details of this specific character.  In Unicode the soft hyphen is U+00AD but its problem handling in browsers and email clients involves some confusions around its specification in other character sets such as ISO-8859-1 as well as HTML 4. 

The fun shouldn’t stop with soft hyphens though.  There seem to be many interesting ways content inspection filters could be bypassed using characters with special meanings and others with special transformative properties.  I haven’t taken the time to do any thorough testing here, but my IDN and IRI spoofing test page has some examples of what I’m talking about.  If you think of the test cases as plain string content instead of IDNs you can imagine some of the other ways which content filters might be confused.

Looking at the Normalization tests on that page one can see that valid Unicode characters like the Ⓞ get normalized (as hyperlinks) to a Latin small letter ‘o’ by Web browsers through a standard process defined by IDNA2003, namely stringprep with a nameprep profile applied.  That’s just the tip of the iceberg, and still more possibilities for abuse exist.

These issues are why we created the UCAPI library for detecting string confusability.  I wonder how many content inspection products are looking at strings in this way?

IDNA2008 hits the standards track – visually confusing strings remain a threat

August 31st, 2010 by

After many years of engineering efforts, the Internationalizing Domain Names in Applications (IDNA) protocol had a major update released from its original 2003 standard. Although named IDNA2008, it hit the standards track in August 2010. It’s worth noting in section “4.4 Visually Confusable Characters” of RFC 5890:

It is worth noting that there are no comprehensive technical solutions to the problems of confusable characters. One can reduce the extent of the problems in various ways, but probably never eliminate it.

Taken out of context this may sound hopeless, but the RFC goes on to reference Unicode TR36 as providing a set of suggestions for mitigating string confusability. It’s in this vein that Casaba has built UCAPI which provides an implementation of the Unicode Consortium’s suggestions as well as defensive techniques from our own learnings.

I can imagine that we will one day see a wide-spread attack that leverages string confusability – or maybe – we won’t see it because it’ll blend in so well as to be undetectable.

New registrations of Internationalized Domain Names are expected to increase radicallly over time as ICANN has opened up ccTLD support for Unicode and IDN, as well as gTLD. As more TLDs become provisioned in native scripts, it’s expected that they will support the expansion of many more internationalized domain names.

What are registrars doing now to protect customers from lookalike attacks on their brand? Is it their responsibility? Who’s is it? Many organizations including ICANN are making suggestions, but is anyone listening?