[Mono-list] Regular expressions help

Loren Bandiera lorenb at mmgsecurity.com
Tue Oct 11 14:51:29 EDT 2005


I'm trying to write a parser for the Mozilla/Firefox history file. The
format of the file is very ugly.

The first starts off with a comment marking the version:

// <!-- <mdb:mork:z v="1.4"/> -->

Then you get the table that defines what the fields mean:

< <(a=c)> // (f=iso-8859-1)
  (8A=Typed)(8B=LastPageVisited)(80=ns:history:db:row:scope:history:all)
  (81=ns:history:db:table:kind:history)(82=URL)(83=Referrer)
  (84=LastVisitDate)(85=FirstVisitDate)(86=VisitCount)(87=Name)
  (88=Hostname)(89=Hidden)>

After that you start to get into the data:

<(80=http://www.google.com/)(9442=1128471102815097)
  (81=1124134918854512)(82=www.google.com)(83
    =L$00o$00r$00e$00n$00 $00B$00a$00n$00d$00i$00e$00r$00a$00$19 s$00
$00w$00e\
$00b$00l$00o$00g$00)(919E=171)(86=1)>

This is where I start to run into problems. I want to extract that block
of data which appears to be in the format:

<(key=value)(...repeating pattern...)>

I read the file into a string and then get rid of the first line comment.
Next I use the following Regex to get the key table:

Regex keyTable = new Regex (@"\s*<\(a=c\)>\s*(?:\/\/)?\s*(\(.+?\))\s*>",
   RegexOptions.Compiled | RegexOptions.Singleline);

m = keyTable.Match (morkData);

I can then use the Match and parse the table fine. The next thing I do is
create a substring starting from where the key table ends to the rest of
the data I read from the file.

I then use the following Regex to pull out the value table:

Regex valueTable = new Regex (@"<\s*(\(.+?\))\s*>",
   RegexOptions.Compiled | RegexOptions.Singleline);

sub = morkData.Substring (pos);
m = valueTable.Match (sub);

This doesn't work at all. I get a chuck of the data (around 3623 bytes)
but I'm expecting more like 800,000. The strange thing is the last part of
the string I get back is :

"// <!-- <mdb:mork:z v="1.4"/> -->< <(a=c)>"

That shouldn't even be there. I'm not sure where that is coming from.

I get the same output on Mono as I do with MS.NET so it appears the
problem is something I'm doing.

I've tried looking at the some of the other solutions to this problem and
see what they do:

http://www.jwz.org/hacks/mork.pl
http://off.net/~mhoye/moz/demork.py

But that didn't really help. Does anyone have any suggestions on how I can
extract that value data ("<(key=value)(...)>") from the string?

-- 
Loren Bandiera, CISSP <lorenb at mmgsecurity.com>
MMG Security, Inc.




More information about the Mono-list mailing list