[Mono-list] ASCII bytes to string?
Jonathan Pryor
jonpryor at vt.edu
Thu Jan 10 19:40:40 UTC 2013
On Jan 10, 2013, at 1:28 PM, mickeyf <mickey at thesweetoasis.com> wrote:
> The string itself displays as expected, but shows a length of twice the number of characters, as if String.Length is reporting the number of bytes (UTF16) rather than the number of Unicode characters in the string.
In all likelihood, the string contains non-printable characters. Consider this `csharp` snippet:
csharp> var b = new byte[]{(byte) 'a', (byte) 'b', 0, 0, 0, 0};
csharp> var s = System.Text.Encoding.UTF8.GetString(b);
csharp> s.Length
6
csharp> s;
"ab"
So this is more or less exactly what you're describing; `s` _clearly_ has two characters, yet s.Length is 6!
Except `s` doesn't have two characters:
csharp> [3];
'\x0
There's some null data in there, because our source byte array contained null bytes, and System.String can contain ASCII NUL characters, which `b` contains.
You can confirm/deny this by seeing that `buffFromDrv` actually contains, and see if it has any non-printable data (e.g. ASCII NUL).
Assuming that's the case, what you need to do is not convert "extra" data:
byte[] buffFromDrv = new byte [BIG_ENOUGH];
int bytesRead = stream.Read(buffFromDrv, readPosition, bytesToRead);
string s = System.Text.UTF8Encoding.UTF8.GetString(buffFromDrv, 0, bytesRead);
Or for the above `csharp` snippet:
csharp> var s = System.Text.Encoding.UTF8.GetString(b, 0, 2);
csharp> s;
"ab"
csharp> s.Length;
2
> The documentation for string.length says "number of characters", not "number of bytes",
It's actually neither; String.Length is the number of UTF-16 "code units" stored in the string. This is _not_ the number of "characters" ("code points"), because a code point may require the use of a "surrogate pair", in which case it will take up two `char` values within the string:
http://en.wikipedia.org/wiki/UTF-16
(Normally you don't need to care about this, except when you do...)
- Jon
More information about the Mono-list
mailing list