[Mono-dev] [PATCH] Use Unicode argv on Windows
Kornél Pál
kornelpal at gmail.com
Tue Apr 1 04:31:27 EDT 2008
Hi,
The main problem is that Windows and Linux too two different ways on
implementing Unicode support.
Windows has two set of APIs. Unicode (UTF-16, originally before Windows 2000
it was UCS-2) and ANSI (using system default code page that only can
represent a subset of Unicode and is usually different from the OEM (DOS)
code page).
Linux on the other hand tends to use UTF-8 instead of old code pages but has
less restrictions on code page being used.
If you use char* on Linux you will get full Unicode support using UTF-8 and
if you are aware that other encodings may be used as well (like the case for
Mono) you may try to convert them to UTF-8 like mono_utf8_from_external
does.
The problem is that Windows (NT kernel) uses UTF-16 internally and ANSI APIs
are only wrappers around Unicode APIs. If you use ANSI APIs you won't get
the original data as char* but you will get the Unicode data converted to
the system default ANSI code page. This means a lot of Unicode character
will probably be lost. And because Mono later converts this back to UTF-16 I
see no reason to lose characters in encoding round-trip.
Ideally Mono should use UTF-16 on Windows but that would need a lot of code
changes. Using UTF-8 no character loss will occur. UTF-8 - UTF-16 round-trip
causes a little performance overhead but I think Mono's main target is
Linux.
glib takes UTF-8 on Windows and passes it as UTF-16 to Windows API that
makes Mono Unicode-aware on Windows as well.
Note however that libc (CRT) functions like fopen use ANSI code page on
Windows so Mono should not use then and should prefer glib.
Also note that the on Linux the main assembly is loaded using the char*
specified in argv without any conversions but aguments of icalls are
converted to UTF-8 and they load assembles using UTF-8. Among with some
other encoding inconsistencies this causes encoding conversion bugs in Mono
on both Linux and Windows.
I think using using UTF-8 for char* would be the solution. Input data (argv)
should be converted to UTF-8 and output data (file names) should be
converted to MONO_EXTERNAL_ENCODINGS on Linux and UTF-16 on Windows.
Kornél
----- Original Message -----
From: "Robert Jordan" <robertj at gmx.net>
To: <mono-devel-list at lists.ximian.com>
Sent: Tuesday, April 01, 2008 12:55 AM
Subject: Re: [Mono-dev] [PATCH] Use Unicode argv on Windows
Hi Kornél,
I understand why you fixed it this way, but I think that "fixing"
strenc.c would produce less #ifdef clutter and it also has the
nice side effect of not breaking the embedding API :)
It's just a matter of setting MONO_EXTERNAL_ENCODINGS=default_locale
either with g_setenv or SetEnvironmentVariable in mini.c:mini_init()
for PLATFORM_WIN32.
Robert
Kornél Pál wrote:
> Hi,
>
> Currently mono.exe uses ANSI arguments that are encoded using system
> default code page (ACP). Mono however uses UTF-8 and tries to convert
> them using MONO_EXTERNAL_ENCODINGS.
>
> This patch takes the Unicode (UTF-16) command line arguments and
> converts them to UTF-8. This way there is no need to modify other code
> to use UTF-16 and the arguments still are in Unicode.
>
> I also made strenc.c non-Windows in this patch because
> MONO_EXTERNAL_ENCODINGS should not be used on Windows at all as it uses
> UTF-16 internally and if we really need UTF-8 then we should convert
> from UTF-16 rather than from ACP.
>
> I would prefer to move argument conversion using mono_utf8_from_external
> to main.c as well that would make code more clean but that would require
> mono_runtime_run_main being called with UTF-8 arguments. If that is
> acceptable I'll include that modification in the patch as well.
>
> Kornél Index: mono/mono/metadata/object.c
> ===================================================================
> --- mono/mono/metadata/object.c (revision 99452)
> +++ mono/mono/metadata/object.c (working copy)
> @@ -2671,6 +2671,9 @@
> basename,
> NULL);
>
> +#ifdef PLATFORM_WIN32
> + utf8_fullpath = fullpath;
> +#else
> utf8_fullpath = mono_utf8_from_external (fullpath);
> if(utf8_fullpath == NULL) {
> /* Printing the arg text will cause glib to
> @@ -2684,19 +2687,27 @@
> }
>
> g_free (fullpath);
> +#endif
> g_free (basename);
> } else {
> +#ifdef PLATFORM_WIN32
> + utf8_fullpath = g_strdup (argv [0]);
> +#else
> utf8_fullpath = mono_utf8_from_external (argv[0]);
> if(utf8_fullpath == NULL) {
> g_print ("\nCannot determine the text encoding for the
> assembly location: %s\n", argv[0]);
> g_print ("Please add the correct encoding to
> MONO_EXTERNAL_ENCODINGS and try again.\n");
> exit (-1);
> }
> +#endif
> }
>
> main_args [0] = utf8_fullpath;
>
> for (i = 1; i < argc; ++i) {
> +#ifdef PLATFORM_WIN32
> + main_args [i] = g_strdup (argv [i]);
> +#else
> gchar *utf8_arg;
>
> utf8_arg=mono_utf8_from_external (argv[i]);
> @@ -2708,20 +2719,27 @@
> }
>
> main_args [i] = utf8_arg;
> +#endif
> }
> argc--;
> argv++;
> if (mono_method_signature (method)->param_count) {
> args = (MonoArray*)mono_array_new (domain,
> mono_defaults.string_class, argc);
> for (i = 0; i < argc; ++i) {
> +#ifdef PLATFORM_WIN32
> + gchar *str = argv [i];
> +#else
> /* The encodings should all work, given that
> * we've checked all these args for the
> * main_args array.
> */
> gchar *str = mono_utf8_from_external (argv [i]);
> +#endif
> MonoString *arg = mono_string_new (domain, str);
> mono_array_setref (args, i, arg);
> +#ifndef PLATFORM_WIN32
> g_free (str);
> +#endif
> }
> } else {
> args = (MonoArray*)mono_array_new (domain,
> mono_defaults.string_class, 0);
> Index: mono/mono/mini/main.c
> ===================================================================
> --- mono/mono/mini/main.c (revision 99452)
> +++ mono/mono/mini/main.c (working copy)
> @@ -1,8 +1,30 @@
> #include "mini.h"
>
> +#ifdef PLATFORM_WIN32
> +
> int
> +main ()
> +{
> + int argc;
> + wchar_t** wargv = CommandLineToArgvW (GetCommandLine (), &argc);
> + char** argv = g_new0 (char*, argc);
> + int i;
> +
> + for (i = 0; i < argc; ++i)
> + argv [i] = g_utf16_to_utf8 (wargv [i], -1, NULL, NULL, NULL);
> +
> + LocalFree (wargv);
> +
> + return mono_main (argc, argv);
> +}
> +
> +#else
> +
> +int
> main (int argc, char* argv[])
> {
> return mono_main (argc, argv);
> }
>
> +#endif
> +
> Index: mono/mono/utils/strenc.c
> ===================================================================
> --- mono/mono/utils/strenc.c (revision 99452)
> +++ mono/mono/utils/strenc.c (working copy)
> @@ -7,6 +7,9 @@
> * (C) 2003 Ximian, Inc.
> */
>
> +/* These methods should not be used on Windows as it uses UTF-16
> internally. */
> +#ifndef PLATFORM_WIN32
> +
> #include <config.h>
> #include <glib.h>
> #include <string.h>
> @@ -214,3 +217,5 @@
> return(utf8);
> }
>
> +#endif
> +
> Index: mono/msvc/mono.def
> ===================================================================
> --- mono/msvc/mono.def (revision 99452)
> +++ mono/msvc/mono.def (working copy)
> @@ -711,10 +711,7 @@
> mono_type_stack_size
> mono_type_to_unmanaged
> mono_unhandled_exception
> -mono_unicode_from_external
> -mono_unicode_to_external
> mono_upgrade_remote_class_wrapper
> -mono_utf8_from_external
> mono_valloc
> mono_value_box
> mono_value_copy
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
_______________________________________________
Mono-devel-list mailing list
Mono-devel-list at lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list
More information about the Mono-devel-list
mailing list