orngCA: Orange Correspondence Analysis

Correspondence anaysis is used for visualization of contingency tables. In order to make visualization feasible, some preprocessing and calculation have to be done and that is the purpose of this module. Correspondence analysis module computes co-ordinates of the rows and columns of any contingency table, without taking origin of the table into account.

Module contains one class CA which wraps all the mathematical functions and a function input for loading contingency table from a file.

Class can be constructed providing a contingency table as a parameter to the constructor. Contingency table in Python can be represented by nested lists, "list-of-lists" or by numpy types matrix and array. If using function input(filename), a contingency table is stored in filename in the way that each row in the matrix is stored in a separate line, and column values are separated by spaces. After the following sequence

>>> import orngCA >>> data = [[72, 39, 26, 23, 4], ... [95, 58, 66, 84, 41], ... [80, 73, 83, 4, 96], ... [79, 93, 35, 73, 63]] >>> c = orngCA.CA(data) >>> >>> data = orngCA.input('contigencyTable') >>> c = orngCA.CA(data)

variable c contains the CA object.

Class orngCA

Attributes

All attributes contains numpy matrix type and a matrices created in a process of computing generalized SVD.

dataMatrix
Returns contingency table given to the table as a parameter to the constructor.
A
Returns matrix A, whose columns define the principal axes of the column clouds.
B
Returns matrix B, whose columns define the principal axes of the row clouds.
D
Returns matrix D, whose diagonal elements are singular values of the decomposition.
F
Returns matrix F, which contains coordinates of the row profiles with respect to principal axes B.
G
Returns matrix G, which contains coordinates of the column profiles with respect to principal axes A.

Methods

getA()
Getter for matrix A.
getB()
Getter for matrix B.
getD()
Getter for matrix D.
getF()
Getter for matrix F.
getG()
Getter for matrix G.
getPrincipalRowProfilesCoordinates(dim = (0,1))
Returns co-ordinates of the row profiles with respect to principal axes A. Only co-ordinates defined in tuple dim are returned. dim is optional and if omitted, first two dimensions are returned.
getPrincipalColProfilesCoordinates(dim = (0,1))
Returns co-ordinates of the column profiles with respect to principal axes B. Only co-ordinates defined in tuple dim are returned. If dim is omitted, first two dimensions are returned.
DecompositionOfInertia(axis = 0)
Returns decomposition of the inertia across the axes. Columns of this matrix represents contribution of the rows or columns to the inertia of axis. If axis equals to 0, inertia is decomposed across rows. If axis equals to 1, inertia is decomposed across columns. This parameter is optional, and defaults to 0.
InertiaOfAxis(percentage = 0)
Returns numpy array whose elements are inertias of axes. If percentage = 1 percentages of inertias of each axis are returned.
ContributionOfPointsToAxis(rowColumn = 0, axis = 0, percentage = 0)
Returns numpy array whose elements are contributions of points to the inertia of axis. Argument rowColumn defines wheter the calculation will be performed for row (default action) or column points. The values can be represented in percentages if percentage = 1.
PointsWithMostInertia(rowColumn = 0, axis = (0, 1))
Returns indices of row or column points sorted in decresing value of their contribution to axes defined in a tuple axis.
PlotScreeDiagram()
Creates a canvas and plots a scree diagram in it.
Biplot(dim = (0, 1))
Plots row points and column points in 2D canvas. If arguments are omitted, the first two dimensions are displayed, otherwise tuple dim defines principal axes.

Examples of use

Data table given below represents smoking habits of different employees in a company.


Smoking category


Staff Group

(1) None

(2) Light

(3) Medium

(4) Heavy

Row Totals

(1) Senior managers

4

2

3

2

11

(2) Junior Managers

4

3

7

4

18

(3) Senior Employees

25

10

12

2

51

(4) Junior Employees

18

24

33

13

88

(5) Secretaries

10

6

7

2

25

Column Totals

61

45

62

25

193

The 4 column values in each row of the table can be viewed as coordinates in a 4-dimensional space, and the (Euclidean) distances could be computed between the 5 row points in the 4-dimensional space. The distances between the points in the 4-dimensional space summarize all information about the similarities between the rows in the table above. Correspondence analysis module can be used to find a lower-dimensional space, in which the row points are positioned in a manner that retains all, or almost all, of the information about the differences between the rows. All information about the similarities between the rows (types of employees in this case) can be presented in a simple 2-dimensional graph. While this may not appear to be particularly useful for small tables like the one shown above, the presentation and interpretation of very large tables (e.g., differential preference for 10 consumer items among 100 groups of respondents in a consumer survey) could greatly benefit from the simplification that can be achieved via correspondence analysis (e.g., represent the 10 consumer items in a 2-dimensional space). This analysis can be similarly performed on columns of the table.

Following lines load modules and data needed for the analysis. Analysis is started in the last line.

1 import orange 2 from orngCA import CA 3 4 data = [[4, 2, 3, 2], 5 [4, 3, 7, 4], 6 [25, 10, 12, 4], 7 [18, 24, 33, 13], 8 [10, 6, 7, 2]] 9 10 c = CA(data)

After analysis finishes, results can be inspected:

11 print "Column profiles:" 12 print c._CA__colProfiles 13 print 14 print "Row profiles:" 15 print c._CA__rowProfiles 16 print Column profiles: [[ 0.06557377 0.06557377 0.40983607 0.29508197 0.16393443] [ 0.04444444 0.06666667 0.22222222 0.53333333 0.13333333] [ 0.0483871 0.11290323 0.19354839 0.53225806 0.11290323] [ 0.08 0.16 0.16 0.52 0.08 ]] Row profiles: [[ 0.36363636 0.18181818 0.27272727 0.18181818] [ 0.22222222 0.16666667 0.38888889 0.22222222] [ 0.49019608 0.19607843 0.23529412 0.07843137] [ 0.20454545 0.27272727 0.375 0.14772727] [ 0.4 0.24 0.28 0.08 ]]

The points in the two-dimensional display that are close to each other are similar with regard to the pattern of relative frequencies across the columns, i.e. they have similar row profiles. After producing the plot it can be noticed that along the most important first axis in the plot, the Senior employees and Secretaries are relatively close together. This can be also seen by examining row profile, these two groups of employees show very similar patterns of relative frequencies across the categories of smoking intensity.

Lines 17- 19 print out singular values , eigenvalues, percentages of inertia explained. These are important values to decide how many axes are needed to represent the data. The dimensions are "extracted" to maximize the distances between the row or column points, and successive dimensions will "explain" less and less of the overall inertia.

17 print "Singular values: " + str(diag(c.D)) 18 print "Eigen values: " + str(square(diag(c.D))) 19 print "Percentage of Inertia:" + str(c.PercentageOfInertia()) 20 print Singular values: [ 2.73421115e-01 1.00085866e-01 2.03365208e-02 1.20036007e-16] Eigen values: [ 7.47591059e-02 1.00171805e-02 4.13574080e-04 1.44086430e-32] Percentage of Inertia: [ 8.78492893e+01 1.16387938e+01 5.11916964e-01 1.78671526e-29]

Lines 21-22 print out principal row co-ordinates with respect to first two axes. And lines 24-25 show decomposition of inertia.

21 print "Principal row coordinates:" 22 print c.getPrincipalRowProfilesCoordinates() 23 print 24 print "Decomposition Of Inertia:" 25 print c.DecompositionOfInertia()

Following two last statements plot a scree diagram and a biplot. Scree diagram is a plot of the amount of inertia accounted for by successive dimensions, i.e. it is a plot of the percentage of inertia against the components, plotted in order of magnitude from largest to smallest. This plot is usually used to identify components with the highest contribution of inertia, which are selected, and then look for a change in slope in the diagram, where the remaining factors seem simply to be debris at the bottom of the slope and they are discarded. Biplot is a plot or row and column point in two-dimensional space.

27 c.PlotScreeDiagram()

28 c.Biplot()