Archive for the ‘Unicode’ Category

Handling Unicode when marshalling from .Net to a platform invoke

April 22nd, 2008 by

By default, the .Net runtime will marshall a string (and files in a value type) as a LPStr to a platform invoke (p/invoke) function. By default the .Net framework and runtime handles strings as UTF-16. That's two bytes representing a single Unicode 'code point', and more familiar, a single character. An LPStr on the other hand, is an ANSI character, so in order to convert, the runtime will perform a best-fit conversion to the classic windows-1252 code page. This conversion is well-documented here:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This might not be so surprising to people in tune with Unicode, but it's can lead to huge security problems when security filters are at risk. For example, if you're performing HTML filtering or file canonicalization, you need to perform so after the conversion to LPStr.

This default marshalling behavior is documented at: http://msdn2.microsoft.com/en-us/library/system.runtime.interopservices.marshalasattribute(VS.71).aspx

To properly and more safely deal with this, you can use the MarshallAsAttribute class to specify a LPWStr type instead of a LPStr. For example:

[MarshalAs(UnmanagedType.LPWStr)]

Because LPWStr is a pointer to a null-terminated array of Unicode characters, this ensures the Unicode code points are preserved across the marshalling.

I18N input validation whitelist filter with System.Globalization and GetUnicodeCategory

April 24th, 2007 by

Maybe you’re building internationalized code and wondering how to build a whitelist filter that will support all the different character sets your planning to support. If you support more than ten, especially some of the larger east Asian sets, this might seem like an unwieldy or tricky process.
Well luckily it’s easier than most people would think. Building a good input validation filter can be simplified with .Net’s GetUnicodeCategory. But use the method from the System.Globalization namespace as the other one in System.Char looks like it may become the subordinate.

With GetUnicodeCategory you can simply build a whitelist supporting the character categories you want to allow. So get away from thinking you have to write a regEx filter and list out all the character ranges you want to allow in each character set, it’s much simpler than that!

The Unicode standard assigns ever character to one of about 31 categories. They make sense too, for example Other Control charactes (Cc) , Lowercase Letter (Ll), Uppercase Letter (Lu), Math Symbol (Sm). So for example you might want to only allow letters, numbers, and punctuation in your whitelist. This could be achieved with the following snippet:


char cUntrustedInput; // the untrusted user-input
UnicodeCategory cInputTest = CharUnicodeInfo.GetUnicodeCategory(cUntrustedInput);
if (cTestCategory == UnicodeCategory.LowercaseLetter ||
cTestCategory == UnicodeCategory.UppercaseLetter ||
cTestCategory == UnicodeCategory.DecimalDigitNumber ||
cTestCategory == UnicodeCategory.TitlecaseLetter ||
cTestCategory == UnicodeCategory.OtherLetter ||
cTestCategory == UnicodeCategory.NonSpacingMark ||
cTestCategory == UnicodeCategory.DashPunctuation ||
cTestCategory == UnicodeCategory.ConnectorPunctuation)
{
// character looks safe, continue
}
else
{
// character is not allowed, fail
}

Not too bad eh.