SAS Programming for Data Mining Applications: Clustering Handwirtten Digits (digitalized optical images)

Saturday, December 12, 2009

Clustering Handwirtten Digits (digitalized optical images)

In this example, we show the a clustering exercise on Optical Recognition of Handwritten Digits Data Set available @ UCI data set repository (Link).

This exercise is a standard application of HOSVD by stacking 8X8 matrix of digitalized bitmap of each letter where 1-mode is digits for each column and 2-mode corresponds to rows while 3-mode is each letter object, a standard Tensor representation. U_(3) from HOSVD of this Tensor gives indication information for clustering. In this exercise, we apply k-means algorithm on the correlation matrix of each row of U_(3), which corresponds to each subject, with each column of right eigenvector matrix to eliminate outlier effect and stablize the matrix for final clustering algorithm.

The clustering result, together with original unclustered correlation data and correlation data sorted by true letter ID, is shown below. In order to facilitate display, we show the confusion matrix of the correlation matrix so that the clustering result is more obvious to identify.

Cross Tabulation of True Class (0-9) and Clustering Result (1-10):

Further thoughts. The confusion matrix display shows that while the letter clusters have shown up but the pattern is not as prominent as we expected. However, it is possible to obtain better result by applying Nonnegative Matrix Factorization (NMF) to the square of correlation confusion matrix. The NMF algorithm will be implemented in SAS soon.

Note: All numerical analyses are done using the prototype code demonstrated at this Blog Entry (Link). This figure is drawn in R.