Typo Detection Part V: Comparing Matrices


In the last couple of posts, I considered whether the Frobenius distance/norm or Pearson correlation coefficient were appropriate ways to compare my 4 POS transition matrices. Also, I have produced similarity matrices and scatter plots as more visual ways of trying to garner anything useful – I think the scatter GIFs are probably the most effective. For the similarity matrices one, if one the authors’ matrices is \mathbf{A} = (a_{ij}) and the other is  \mathbf{B} = (b_{ij}) I just form the similarity matrix as \mathbf{S}_{AB} = \mathbf{A} - \mathbf{B}. No rocket science (or early 20th-century eugenics inspired statistics in the case of the Pearson coefficient. Then I produced heat maps of \mathbf{C}_{AB}.  I further modified the code for the HeatChart java library to make it a bit more visual, with blue as negative, white as zero-difference, and red as positive.


The Results [Queue FANFARE…We got there!]

Now that I have gone through the different ways I am trying to compare my 4 matrices (phew!) here are the outputs:

Frobenius and Pearson

MeHemingwayBaumDarwin
Me1 / 0
Hemingway0.887 / 0.0581 / 0
Baum0.898 / 0.0560.921 / 0.0501 / 0
Darwin0.871 / 0.0660.757 / 0.091
0.822 / 0.077 1 / 0

The first value in the table is the Pearson correlation coefficient,  \rho_p : ‘1’ would be a perfect positive correlation. The second value is the Frobenius distance derived using the Frobenius norm: ‘0’ would be no distance.

It is nice to know, give or take, that the two, \rho_p and  norm_f pretty much give the same conclusion. I am sure if you did the algebra on Pearson Correlation vs Frobenius distance you could see why the differences come from more quantitively.

Animated Scatter Plots

Not sure what you do with these other than look at them at try and decide which plot in each of the four animations is nearest to being a straight line…a perfect line would give you a Pearson Correlation of 1. Taking the animations from ‘left/top’ to ‘right/bottom’ they show:

  • My data vs the remaining others
  • Hemingway vs the remaining others,
  • Darwin vs the remaining others and;
  • Baum vs rest.

The only thing I can really draw from them is they are all pretty different! Maybe Hemingway vs Baum is the “I’m nearest a line” winner?

BTW, I draw the scatter plots using a great little one-class library I found called Simple Plot: it does what it says on the packet, if think you’d agree?

Heat Maps

In these, a blue cell means a negative difference between the first author in the chart title and the second. A blue cell is a positive difference. White is zero difference.

Here I am using the Wordpress ‘Slide Anything’ carousel plugin BTW.

 

Series Navigation<< Typo Detection Part IV: Comparing Matrices with The Pearson coefficient.Typo Detection Part VI: End of the comparisons, for now >>