[Mono-dev] UTF8 encoding/decoding

Gerardo García Peña killabytenow at gmail.com
Mon Apr 29 08:35:08 UTC 2013


Hello,

I am Gerardo García Peña and I'm new in this list.

Some months ago started working with mono in a project which demands a very
precise manipulation of UTF8 (and other encodings) streams. When I started
to write code I observed that the mono UTF8 implementation is very buggy,
while the MS.NET implementation is quite good. Then I started to isolate
the bugs and filled some bugs in the Ximian's bugzilla [1] [2]. They're
still there and unfixed, but I think they are important: an incompatibility
in the text codec subsystems virtually affects any application that need
portability between Microsoft and Mono platforms. Specially from the data
integrity point of view, and in some cases availability security issues
(indexes and counters reported by the conversion methods and throwed
exceptions could make apps running on the Mono environment to enter into
infinite loops, making apps running on the mono runtime vulnerable to DoS
attacks).

The bugs are still there (unresolved), and during this time I have found
some more, so I decided to start patching the UTF8 libraries (and in the
future, if this patch is accepted, I will continue working on other buggy
codec that appears).

The patch that I propose is an important modification of the file
/mono/mcs/class/corlib/System.Text/UTF8Encoding.cs and some minor changes
in other generic classes in System.Text. The targets of my patch are the
following:

  - give a complete and good quality UTF8 coder & decoder implementation,
  - at least it is as much efficient as the old implementation,
  - better error handling and quick resync when bad sequences are found,
  - fix the index field in the Fallback exceptions (it is a key feature if
one
    program want to handle strings with errors),
  - refactorize and make code more maintainable,
  - full compatibility with the .NET implementation (behaviour is exactly
the
    same in front of bad and good sequences),
  - complete some pending or incomplete features (MonoTODO) like
    Encoder::FallbackException::IsUnknownSurrogate() or use of BOM
    preambles.

Please note that in spite of presenting a full-compatible implementation of
this codec with the Microsoft implementation, my changes are not based on
Microsoft's work, and they are totally written from scratch. I have not
reversed any code and the behaviour of my patches has been tunned using an
extensive and exhaustive test case.

My test case uses several public UTF8 test cases and one specific and giant
UTF16 test case built automatically. The test case must be executed first
on the Mono runtime environment and once again on the Microsoft runtime.
The output of the test case are two directories (one for mono, another for
net) documenting the output of (and exceptions thrown) the Convert()
method. Once both executions are finished, it should not exist any
difference between the
two output directories.

The test case is focused only on the Convert() method because it allows to
test any variation of the input. My implementation (and probably
Microsoft's too) is based on two coder/decoder functions that are called by
all the other public methods. Because that reason the best way to test both
implementations is using the method that exposes more directly the internal
decoder/encoder.

I posted the changes and my test suite in a github branch, and I also have
attached them to this mail (if you want to test it quickly without doing
any git operation):

  - mono branch with my patches
    https://github.com/killabytenow/mono

  - test suite
    https://github.com/killabytenow/mono-System.Text.UTF8Encoding-test

To run the test suite, run the makefile and then execute the program
convert.exe in the two platforms. You'll get a 'cnvout-mono' and
'cnvout-other' directories which will contain the output of each test run.
Once they have finished run the 'mkdiff.sh' shell script. This script will
make a 'cnvout-diff' directory, which should be empty if all files are
equal.

I know that it is an important patch because it affects the corlib
libraries which are critical for the Mono runtime. If you have any question
or note about the code, or if I can do anything to improve this patch, I
will be glad to help.

Thanks in advance,
Gerardo García Peña

[1] 10692 https://bugzilla.xamarin.com/show_bug.cgi?id=10692
[2] 10697 https://bugzilla.xamarin.com/show_bug.cgi?id=10697
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20130429/cf5f45ed/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mono-System.Text.UTF8Encoding-test.tar.gz
Type: application/x-gzip
Size: 52263 bytes
Desc: not available
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20130429/cf5f45ed/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf8.patch
Type: application/octet-stream
Size: 63658 bytes
Desc: not available
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20130429/cf5f45ed/attachment-0001.obj>


More information about the Mono-devel-list mailing list