[Mono-bugs] [Bug 551615] Korean text (cp949) cannot be decoded

Mon Nov 2 17:05:20 EST 2009

http://bugzilla.novell.com/show_bug.cgi?id=551615

User greg.smolyn at strangeloopnetworks.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=551615#c6

--- Comment #6 from Greg Smolyn <greg.smolyn at strangeloopnetworks.com>  2009-11-02 15:05:15 MST ---
Ok, I have discovered the spot of the bug.

Decoder.Convert() uses an interesting mechanism for determining how many
characters it has decoded.  The current method goes something like this:

- looks at byteArray, startingIndex, and count of bytes to scan
- GetCharCount() for that entire block of bytes
- if there are more chars in that block than the number of chars we actually
want to convert, bit-shift the # of bytes to scan by 1 
- repeat, until chars-for-our-current-blocksize <= chars-we-want
- given the new parameters, actually do a GetChars(), since we have the right
byte block size.

This fails under the following scenario:
- chars-we-want is 1
- the byte array contains [ single-byte char, double-byte char, ... ]
(there might be an extra stipulation about odd numbers?)

What happens?

For example-- say you have 1 ASCII followed by 2 double-byte chars.  
You get a startingIndex of 0 and a count of 5 to scan.
That's 3 chars, but we only want 1, so we bit shift and our new count of things
to scan is 2.
We repeat, and GetCharCOunt() says there is only 1 character in the first 2
bytes.
That is <= the # of chars we want, so we convert and exit.  However, we report
the # of bytes used as 2, since we think there was 1 char made up of 2 bytes.

However, it wasnt a double-byte char, and we counted the actual start of the
double-byte char as a part of the ASCII char.

I'm not really sure what a good fix would be for this.  Ultimately it looks to
me like Convert() really should just try to convert one character at a time,
instead of doing the strange GetCharCount() and using a log(n) algorithm to
determine how many characters there are.   As it stands, there is a workaround
of only feeding the decoder 1 byte at a time, which will probably be more
performant when trying to get 1 character at a time out of the decoder.

I'm happy to attempt a patch, however if I could get some input as to what the
preferred course of action would be, or if someone wants to discuss the design
of this with me so we can come up with the right fix, I'd be very grateful.

-- 
Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
You are the assignee for the bug.