Typo Detection


There is nothing worse than when you spend hours, days (weeks?) writing something and then you revisit it and find it peppered with dumb, and seemingly inescapable, typos.

This is one (a rare one). Right?

It really bugs me:

  1. If I see those typos in somebody else’s work I automatically think, “sloppy”, and;
  2. No matter how hard I try my brain just doesn’t see them when I am checking my own words. My brain reads what I meant to write, not what I actually wrote. Perhaps I should be more forgiving when I read other people’s stuff?

Sure there are tools out there to help. Spell checking and Grammarly do their best, but still, things slip through. I am sure there are more options, and at some stage, I will have a look, but at the moment I am kind of swept up in doing-it-my-way 😉

Here is what I have done:

  1. I downloaded all the posts from this blog (there are 108 excluding this one, and including lots of half-baked drafts I never published). You can do that in WordPress as a WXR file.
  2. The WXR format (XML) is not so well documented. There is a bit on it here, and I found this Python Gist that was the bare bones of what I wanted: it extracts all (my) written content out of the WXR file and strips out the meta-data. Unfortunately, it’s in Python (not my strongest, but did a bit at uni so I know the workings) so I had to dust off that cobwebby corridor. I considered redoing it in Java but the XML tools in Java were woeful compared to Python’s stock ElementTree.
  3. The Gist only went halfway to what I was after though. I just want prose – no headings, WordPress shortcodes etc. Also for my processing, I needed sentences arranged on a per-line basis and all in one file. This Gist here is the outcome. It takes the file and creates a ‘pure’ text file for the next stage. The output was a file with 1968 sentences/lines, all written by me (32200 words) – my writing style in a nutshell (I hypothesise!).
  4. A while ago I played with the Stanford Parser, a natural language processor. As the only NLP option I knew of (then) I went back to Java and wrote something that parses each sentence in my text file from (3), tokenizes it and works out each word’s “Part-of-Speech” (POS). The Stanford Parser does more than that,  it forms text into a tree structure that describes the relationships between words, clauses, sentences etc. The problem I found was that the Parser spits the dummy with large text files of ‘prose’. This kind of makes sense as you are asking the algorithm to essentially make a tree (and so make ‘sense’) of the whole body of text. The Stanford Parser also takes its time…I used a multithreading to speed things up but it still takes 5 minutes to chomp through my 1968 sentences.
  5. So, I end up with a little tree (shrub?) for each sentence, but the only thing I use is the assigned POS to each word. I go through the trees/sentences and create a transition matrix. For example, every time a noun follows a verb this cell in the 36 x 36 matrix gets incremented by one. I then form a probability transition matrix from this, by dividing each cell value by the total number of transitions. This is all in Java. I plonked the output into Google Sheets.
  6. Just to visualise the outcome I took to R and made the Heatmap below. The heat map, admittedly, in itself really isn’t that insightful, but I am pretty pleased with it, mainly because of my picking up my programming again.


You will have to google what a ‘model’ POS is (etc etc). Certainly no grammar I know from my latin. To read this heat map, the darker the cell from row to column the more I do that when I write, from one word to the next.

Incidentally, it seems there are POS combos I never use, but one POS I NEVER use (hence the weird formatting glitch in its respective column on the far right: Possessive wh-pronoun (just an overly-technical term ‘Whose’ I think…quite why ‘to’ just got ‘to’ and not dative-or-whatever). I never use the word whose when I write. That is interesting, right? Except now, of course, I have used it twice.

The Next Steps (if/when I get round to them)

  1. Do some sort of cross-validation on the matrix, using my own (blog) dataset: Is it the matrix consistent with my writing style?
  2. Do the same thing to others text/ writer. Does my matrix differ markedly from, say, Ernest Hemingway’s? (It will!) And are his matrices consistent across different texts?
  3. Can I use my matrix to check for typos, which rather depends on the (1) and (2)?
Series NavigationTypo Detection Part II Other Writers >>