[Mono-bugs] [Bug 464128] char* should be in ANSI encoding when passed to C runtime rather than Unicode

Thu Oct 14 06:39:01 EDT 2010

https://bugzilla.novell.com/show_bug.cgi?id=464128

https://bugzilla.novell.com/show_bug.cgi?id=464128#c5

Kornél Pál <kornelpal at gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW
       InfoProvider|kornelpal at gmail.com         |

--- Comment #5 from Kornél Pál <kornelpal at gmail.com> 2010-10-14 10:38:59 UTC ---
Created an attachment (id=394922)
 --> (http://bugzilla.novell.com/attachment.cgi?id=394922)
Utf8AnsiConflictTest.cs

Although Jon is on the right track the current bug report is refers to native C
code rather than marshaling in managed code. (For Jon's example the solution I
think is to set UnixEncoding.Instance to Encoding.Default on Windows.)

A very important difference between Linux (and Unix) and Windows is that Linux
is using char* to represent strings while Windows is using wchar_t*. Windows
interprets wchar_t* as being UTF-16 (there was no UTF-16 when Windows 2000 was
released, it uses UCS-2).

char* on Linux may vary by system but most recent distros and installations use
UTF-8. (File names for example may use different encodings that may cause
problems but that's another story.)

Windows has a system setting referred to as the ANSI code page that specifies
what charset char* is in encoded. Important to note that the ANSI code page is
never UTF-8, it always is a legacy non-standard MS code page, like Windows
1252. (TextInfo.ANSICodePage has a nice DB of ANSI code pages of locales.)

Furthermore there is nothing in char* (except content of text files) on
Windows. When you call an API that takes char*, it gets converted to wchar_t*
using ANSI to UTF-16. Even file names are stored in Unicode on NTFS and vfat.

Mono (native C parts) mostly is using char* that contains UTF-8 that is a very
good and protable design. The only problem is that sometimes it calls C runtime
functions. char* is the same but Mono passes UTF-8 that the C runtime
interprets as being in ANSI and converts to UTF-16.

As long as you use ASCII you will not notice this problem since ANSI code pages
as well as UTF-8 are usually ASCII compatible so the result is the same.

If you however use non-ASCII characters conversion will corrupt strings for
sure.

This even may lead to security problems although I am not aware of any specific
security issue.

The attached Utf8AnsiConflictTest.cs shows that external resource file hash is
generated incorrectly by SRE of Mono on Windows because of encoding mismatch.
The same test works fine on Windows. Note that this particular bug is in
mono_sha1_get_digest_from_file. fopen is called that expects ANSI and UTF-8 is
passed. Because of another bug, not exception is generated, the error is simply
ignored and invalid hash is written to the module.

This is a general problem (although most likely not a critical one) that is not
specific to fopen or SRE either. The solution is not to call any Windows API or
CRT function that takes char*. Instead UTF-8 should be converted to UTF-16 and
Windows API and CRT functions that take wchar_t* should be called.

-- 
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
You are the assignee for the bug.