[MonoDevelop] Spell checking in monodevelop
oliver.stieber at ukplc.net
Mon Mar 30 08:05:33 EDT 2009
I checked out all the open source spell checkers over the weekend and hunspell seems to be the one everyone is using. The names of the file in the project looked promising so I thought that improving it wouldn't be too difficult. As it turns out after looking thtough the code and dictionary files hunspell is actually quite a poor spell checker unless you've missed out a letter or jumbled your letters up in which case it's quite good.
Anyhow back to the point, improving the spell checker to a level which would put it on a par or better that google (via pattern matching, re-enforced training based on real world spelling mistakes, quite a bit of stats etc...)isn't actually that hard code wise (I only have one pattern matching algorithm to find and I know exactly what I'm doing), the problem is that it's going to take quite a bit of training to get anywhere near the level that you could call it a 'propper' spell checker because there are no phonetic dictionarys to use as a base data set and even if I could find the data I don't feel like compiling them for all the languages hunspell supports especially when turning a word into it's correct phonetical form isn't that easy. My approach, and this will be a lot better in the long term for the ability of the spell checker to actually suggest the correct spelling and put it at the top of the list, is to write a framework in which the spell checker learns 'spells like' psudo phonetics allowing it to come up with a very high ranking word that should be the correct spelling based on any spelling mistake made in the past that is similar and even spelling mistakes that are similar to the directly referenced spelling errors for the psudo phonetics.
The best bit it that I plan to have a centralized server as well as the client app so that all the data from everyone's spelling mistakes (provided they don't turn data collection off, in which case there not going to much better off than running hunspell because they would need to pull a partial snapshot of the spelling database down from the server on first use) and turns them into a huge knowledge base of spelling mistake patters and words not in the dictionary and user profiles that can be pulled down to any machine with the spell checker in it and group dictionarys so that uses can share their words that shouldn't be in the main dictionary with everyone in their office.
Anyhow, I think I can make a revolutionary spell checker all I need is volunteers to use the spell checker to train it up a bit. May plan was to create a modified version of the firefox spell checker (currently based on hunspell) as an initial prototype that will work no worse than firefox's spell checker (because I'm partly basing this spell checker on hunspells process) but would train my spell checker the psudo phonetics it need to do the job really well. I expect the finished spell checker (short of tuning some weights and trashholds from their arbitary values) in about 3 weeks time, with the firefox plugin not long after that (I don't know xpi so the firefox plugin may take me a week of two to sort out)
Once I've got people up and running with the firefox plugin and ironed out the enevatable bugs in the spell checker I can start integration with monodevelop..
It should also be fairly easy to integrate with open office too as it also uses hunspell at the moment so it's a case of copying over the new library, writing some screens to allow the user to control syncing with the server. And getting openoffice to send spelling corrections back to the spell checker to do the reinforced training.
As you can tell from my spelling in this email a really good spell checker it near the top of my Christmas list.
More information about the Monodevelop-list