[Mono-devel-list] culture info patches

Tue Mar 16 11:34:48 EST 2004

On 03/16/04 Jackson Harper wrote:
> TODO: 
> - Still not getting all the data into tables, lots of String.Emptys
> right now
> - Only generate the CultureMap in CultureInfo.cs for supported cultures

No CultureMap should be create at all, especially not one for each
supported culture. It's just lots of initialization code that needs to
be jitted and run every time a program starts. The data should
be in static arrays in the C runtime and an icall needs to be added
to do the lookup.

> CultureInfo.cs: Construct culture infos using the new from_lcid method
> unless an env var is set.

When the patch is ready, icu support for cultureinfo should be
just removed, no need to switch back with and env var.

> #ifndef MONO_METADATA_CULTURE_INFO_TABLES
> #define MONO_METADATA_CULTURE_INFO_TABLES 1
> 
> static const DateTimeFormatEntry datetime_format_entries [] = {
> {{"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"}, {"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", ""}, "", 0, "", {"Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"}, 0, "MMMM d, yyyy h:mm:ss a z", "MMMM d, yyyy", "h:mm:ss a z", "", {}, "", "dd/MM/yy", "h:mm a", "", ""},
> {{"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"}, {"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", ""}, "", 0, "", {"Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"}, 0, "MMMM d, yyyy h:mm:ss a z", "MMMM d, yyyy", "h:mm:ss a z", "", {}, "", "M/d/yy", "h:mm a", "", ""},
> {{"dom", "lun", "mar", "mi??", "jue", "vie", "s??b"}, {"ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic", ""}, "", 0, "", {"domingo", "lunes", "martes", "mi??rcoles", "jueves", "viernes", "s??bado"}, 0, "d' de 'MMMM' de 'yyyy HH:mm:ss z", "d' de 'MMMM' de 'yyyy", "HH:mm:ss z", "", {}, "", "dd/MM/yy", "H:mm", "", ""},
> {{"???", "???", "???", "???", "???", "???", "???"}, {""}, "", 0, "", {"?????????", "?????????", "?????????", "?????????", "?????????", "?????????", "?????????"}, 0, "yyyy'???'M'???'d'???' H:mm:ss:z", "yyyy'???'M'???'d'???'", "H:mm:ss:z", "", {}, "", "yy/MM/dd", "H:mm", "", ""}

I assume the names are in utf-8? I don't know if we can depend on all
the C compilers to do the right thing here, so it's probably better if
the no-ascii bytes are emitted with \x + hexadecimal value.

> static const CultureInfoEntry culture_entries [] = {
> {0x0009, "English", "English", "English", "eng", "en", -1, -1},
> {0x1009, "English (Canada)", "English (Canada)", "English (Canada)", "eng", "en", 0, 0},
> {0x0409, "English (United States)", "English (United States)", "English (United States)", "eng", "en", 1, 1},
> {0x000A, "Spanish", "Spanish", "espa??ol", "esp", "es", -1, -1},
> {0x0C0A, "Spanish (Spain)", "Spanish (Spain)", "espa??ol (Espa??a)", "esp", "es", 2, 2},
> {0x0011, "Japanese", "Japanese", "?????????", "?????????", "ja", -1, -1},
> {0x0411, "Japanese (Japan)", "Japanese (Japan)", "????????? (??????)", "?????????", "ja", 3, 3}

This array should likely be sorted on the locale ID.

> --- locales.c	29 Feb 2004 14:48:43 -0000	1.12
> +++ locales.c	16 Mar 2004 07:55:43 -0000
> @@ -19,6 +19,8 @@
[...]
> +static const CultureInfoEntry*
> +culture_info_from_lcid (const gint lcid)
> +{
> +	const CultureInfoEntry *cur_info = &culture_entries [0];
> +	
> +	while (cur_info->lcid != -1) {
> +		if (lcid == cur_info->lcid)
> +			return cur_info;
> +		cur_info++;
> +	}
> +	return NULL;
> +}

Why not do a binary search?

> +static MonoDateTimeFormatInfo*
> +create_datetime_format_info (MonoCultureInfo *ci, int dfidx)
> +{
> +	MonoDateTimeFormatInfo *df;
> +	MonoClass *class;
> +	MonoDomain *domain;
> +	const DateTimeFormatEntry dfe = datetime_format_entries [dfidx];

This code inserts a copy of the data structure (about 200 bytes): just
use a pointer to it.

> +	domain = mono_domain_get ();
> +	class = mono_class_from_name (mono_defaults.corlib,
> +			"System.Globalization", "DateTimeFormatInfo");
> +	df = (MonoDateTimeFormatInfo *) mono_object_new (domain, class);
> +	mono_runtime_object_init ((MonoObject *) df);

Do you really need to call the argumentless ctor? We control all the
code, so if possible we should avoid going to managed code here
and initialize what we need ourselves.

> +static MonoNumberFormatInfo*
> +create_number_format_info (MonoCultureInfo *ci, int nfidx)
> +{
> +	MonoNumberFormatInfo *nf;
> +	MonoClass *class;
> +	MonoDomain *domain;
> +	const NumberFormatEntry nfe = number_format_entries [nfidx];

Same structure copy issue.

> +	domain = mono_domain_get ();
> +	class = mono_class_from_name (mono_defaults.corlib,
> +			"System.Globalization", "NumberFormatInfo");
> +	nf = (MonoNumberFormatInfo *) mono_object_new (domain, class);
> +	mono_runtime_object_init ((MonoObject *) nf);

Idem, about switching to managed code.

> +void ves_icall_System_Globalization_CultureInfo_construct_internal_locale_from_lcid (MonoCultureInfo *this, gint lcid)
> +{
> +	const CultureInfoEntry *ci;
> +	MonoDomain *domain;
> +	
> +	MONO_ARCH_SAVE_REGS;
> +
> +	domain = mono_domain_get ();
> +	
> +	/**
> +	 * It would be faster to search a tree, but the overhead of creating
> +	 * the tree and locking might not be worth it. How many times to
> +	 * people lookup a CultureInfo normally?
> +	 *
> +	 * NOTE: Not using bsearch because it doesn't break on the first match.
> +	 */

I don't understand the comment. bsearch returns when it finds the entry
(though if there are multiple entries with the same key you may need to
check you're on the first one: that is not the case here). Maybe the
issue was with the cultures array not being sorted?

> +	ci = culture_info_from_lcid (lcid);
> +
> +	if(ci == NULL) {
> +		/* Locale for lcid not found */
> +		mono_raise_exception((MonoException *)mono_exception_from_name(
> +					     mono_defaults.corlib, "System", "SystemException"));

Try to avoid raising exceptions from C code: you should return an error
code to the managed code (or an already build exception) and throw the
exception from managed code.

> +	if (ci->datetime_format_index > 0)
> +		this->datetime_format = create_datetime_format_info (this, ci->datetime_format_index);
> +	if (ci->number_format_index > 0)
> +		this->number_format = create_number_format_info (this, ci->number_format_index);

Shouldn't the compare here be >= 0?

> #ifndef _MONO_METADATA_CULTURE_INFO_H_
> #define _MONO_METADATA_CULTURE_INFO_H_ 1
> 
> #define NUM_OF_DAYS 7
> #define MAX_NUM_MONTHS 13
> #define MAX_GROUP_SIZE 5
> 
> typedef struct {
> 	const gchar *abbreviated_day_names [NUM_OF_DAYS]; 
> 	const gchar *abbreviated_month_names [MAX_NUM_MONTHS];
> 	const gchar *am_designator;
> 	gint calendar_week_rule;
> 	const gchar *date_separator;
> 	const gchar *day_names [NUM_OF_DAYS]; 
> 	gint first_day_of_week;
> 	const gchar *full_date_time_pattern;
> 	const gchar *long_date_pattern;
> 	const gchar *long_time_pattern;
> 	const gchar *month_day_pattern;
> 	const gchar *month_names [MAX_NUM_MONTHS];
> 	const gchar *pm_designator;
> 	const gchar *short_date_pattern;
> 	const gchar *short_time_pattern;
> 	const gchar *time_separator;
> 	const gchar *year_month_pattern;
> } DateTimeFormatEntry;

Please, group the patterns together, the separators together etc.
Are we sure the number of date/time patterns is fixed? And do we really
care which is which (as in: which is short-time, short-date etc.)?

> typedef struct {
> 	gint currency_decimal_digits;
> 	const gchar *currency_decimal_separator;
> 	const gchar *currency_group_separator;
> 	const gint currency_group_sizes [MAX_GROUP_SIZE];
> 	gint currency_negative_pattern;
> 	gint currency_positive_pattern;
> 	const gchar *currency_symbol;
> 	const gchar *nan_symbol;
> 	const gchar *negative_infinity_symbol;
> 	const gchar *negative_sign;
> 	gint number_decimal_digits;
> 	const gchar *number_decimal_separator;
> 	const gchar *number_group_separator;
> 	const gint number_group_sizes [MAX_GROUP_SIZE];
> 	gint number_negative_pattern;
> 	gint percent_decimal_digits;
> 	const gchar *percent_decimal_separator;
> 	const gchar *percent_group_separator;
> 	const gint percent_group_sizes [MAX_GROUP_SIZE];
> 	gint percent_negative_pattern;
> 	gint percent_positive_pattern;
> 	const gchar *percent_symbol;
> 	const gchar *per_mille_symbol;
> 	const gchar *positive_infinity_symbol;
> 	const gchar *positive_sign;
> } NumberFormatEntry;

Here, too the fields should be grouped logically. The only reason to
spread the fields could be to save memory in the struct layout, but this
is not the case:-) For example, alternating an int and a pointer
is guaranteed to waste 4 bytes on 64 bit architectures because of
alignment padding. Using two ints at a multiple of a pointer offset
is fine, though. Also, if the integer values fit in smaller integer
types, those should be used. I anticipate with great confidence the
number of days in a week won't overflow a byte in the next few years:-)

> typedef struct {
> 	gint lcid;
> 	const gchar *englishname;
> 	const gchar *displayname;
> 	const gchar *nativename;
> 	const gchar *iso3lang;
> 	const gchar *iso2lang;
> 	gint datetime_format_index;
> 	gint number_format_index;
> } CultureInfoEntry;

AFAIK cultures have a sort of parent-child relationship.
How do you take that into account? Maybe you can add a parent_lcid to
the structure.

> typedef struct {
> 	gint cur_lcid;
> 	gint seek_lcid;
> 	gint idx;
> } DisplayNameEntry;

What is this structure used for?

Also, please ignore the string interning issue, there is most likely
no need for it.

Thanks for working on this stuff.

lupus

-- 
-----------------------------------------------------------------
lupus at debian.org                                     debian/rules
lupus at ximian.com                             Monkeys do it better