[Mono-bugs] [Bug 55578][Wis] New - SeekableStreamReader in mcs/support.cs mixes byte and character offsets

bugzilla-daemon@bugzilla.ximian.com bugzilla-daemon@bugzilla.ximian.com
Sun, 14 Mar 2004 21:08:32 -0500 (EST)


Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.

Changed by gustavo.giraldez@gmx.net.

http://bugzilla.ximian.com/show_bug.cgi?id=55578

--- shadow/55578	2004-03-14 21:08:32.000000000 -0500
+++ shadow/55578.tmp.18524	2004-03-14 21:08:32.000000000 -0500
@@ -0,0 +1,50 @@
+Bug#: 55578
+Product: Mono: Compilers
+Version: unspecified
+OS: 
+OS Details: 
+Status: NEW   
+Resolution: 
+Severity: 
+Priority: Wishlist
+Component: C#
+AssignedTo: mono-bugs@ximian.com                            
+ReportedBy: gustavo.giraldez@gmx.net               
+QAContact: mono-bugs@ximian.com
+TargetMilestone: ---
+URL: 
+Cc: 
+Summary: SeekableStreamReader in mcs/support.cs mixes byte and character offsets
+
+As stated in the summary, changing the Position property of the stream
+works incorrectly for UTF8 files in general.  This is because the code
+assumes a 1 to 1 correspondence between the characters and bytes, which is
+not true for UTF8 encoding.
+
+Attached is a sample file (taken from MonoDevelop) which can't be parsed:
+
+gustavo@deimos:~/test/mono/stream$ mcs --parse -codepage:utf8
+CombineBrowserNode.cs
+syntax error, got token `CLOSE_PARENS', expecting BASE BOOL BYTE CHAR
+CHECKED DECIMAL DELEGATE DOUBLE FALSE FLOAT INT LONG NEW NULL OBJECT SBYTE
+SHORT SIZEOF STRING THIS TRUE TYPEOF UINT ULONG UNCHECKED USHORT VOID
+OPEN_PARENS TILDE BANG LITERAL_INTEGER LITERAL_FLOAT LITERAL_DOUBLE
+LITERAL_DECIMAL LITERAL_CHARACTER LITERAL_STRING IDENTIFIER
+Mono.CSharp.yyParser.yyException: irrecoverable syntax error
+in <0x007cb> Mono.CSharp.CSharpParser:yyparse (Mono.CSharp.yyParser.yyInput)
+in <0x00079> Mono.CSharp.CSharpParser:parse ()
+ 
+CombineBrowserNode.cs(38) error CS8025: Parsing error
+Compilation failed: 1 error(s), 0 warnings
+
+Of course without the codepage switch (ie. latin1 encoding) works perfectly.
+
+The problem occurs because the tokenizer gets/sets the stream Position when
+it needs to deambiguate close parenthesis (cs-tokenizer.cs:408).  In the
+sample file, that operation is happens exactly at a SeekableStreamReader
+cache boundary (ie. 1024) and thus a file stream reposition is needed.
+
+The second attached file is a patch which fixes this problem. 
+buffer_start, buffer_size and the Position property are still byte offsets,
+and char_count is introduced to keep the number of valid chars in buffer[].
+ The stream encoding is used to map char-to-byte offsets.