[Mono-list] Regular expressions help
Loren Bandiera
lorenb at mmgsecurity.com
Tue Oct 11 14:51:29 EDT 2005
I'm trying to write a parser for the Mozilla/Firefox history file. The
format of the file is very ugly.
The first starts off with a comment marking the version:
// <!-- <mdb:mork:z v="1.4"/> -->
Then you get the table that defines what the fields mean:
< <(a=c)> // (f=iso-8859-1)
(8A=Typed)(8B=LastPageVisited)(80=ns:history:db:row:scope:history:all)
(81=ns:history:db:table:kind:history)(82=URL)(83=Referrer)
(84=LastVisitDate)(85=FirstVisitDate)(86=VisitCount)(87=Name)
(88=Hostname)(89=Hidden)>
After that you start to get into the data:
<(80=http://www.google.com/)(9442=1128471102815097)
(81=1124134918854512)(82=www.google.com)(83
=L$00o$00r$00e$00n$00 $00B$00a$00n$00d$00i$00e$00r$00a$00$19 s$00
$00w$00e\
$00b$00l$00o$00g$00)(919E=171)(86=1)>
This is where I start to run into problems. I want to extract that block
of data which appears to be in the format:
<(key=value)(...repeating pattern...)>
I read the file into a string and then get rid of the first line comment.
Next I use the following Regex to get the key table:
Regex keyTable = new Regex (@"\s*<\(a=c\)>\s*(?:\/\/)?\s*(\(.+?\))\s*>",
RegexOptions.Compiled | RegexOptions.Singleline);
m = keyTable.Match (morkData);
I can then use the Match and parse the table fine. The next thing I do is
create a substring starting from where the key table ends to the rest of
the data I read from the file.
I then use the following Regex to pull out the value table:
Regex valueTable = new Regex (@"<\s*(\(.+?\))\s*>",
RegexOptions.Compiled | RegexOptions.Singleline);
sub = morkData.Substring (pos);
m = valueTable.Match (sub);
This doesn't work at all. I get a chuck of the data (around 3623 bytes)
but I'm expecting more like 800,000. The strange thing is the last part of
the string I get back is :
"// <!-- <mdb:mork:z v="1.4"/> -->< <(a=c)>"
That shouldn't even be there. I'm not sure where that is coming from.
I get the same output on Mono as I do with MS.NET so it appears the
problem is something I'm doing.
I've tried looking at the some of the other solutions to this problem and
see what they do:
http://www.jwz.org/hacks/mork.pl
http://off.net/~mhoye/moz/demork.py
But that didn't really help. Does anyone have any suggestions on how I can
extract that value data ("<(key=value)(...)>") from the string?
--
Loren Bandiera, CISSP <lorenb at mmgsecurity.com>
MMG Security, Inc.
More information about the Mono-list
mailing list