Updating formula for the sample covariance and correlation

A broad category of analyses can be computed from some form of a cross-product matrix, for example, factor analysis and principal components.A cross-product matrix is a matrix of the form X' X, where X represents an arbitrary set of raw or standardized variables.Another application of correlation matrices is to calculate ridge regression, a type of regression that can help deal with multicollinearity and is part of a broader class of models called Penalized Regression Models.

= "Married, spouse present", lang Isolated = lingisol == "Linguistically isolated", mult Families = nfams Call: factanal(factors = 2, covmat = census Cor) Uniquenesses: poverty no Phone no English on Social Security 0.53 0.96 0.93 0.23 on Welfare working incearn no High School 0.96 0.51 0.68 0.82 in City renter no Spouse lang Isolated 0.97 0.79 0.90 0.90 mult Families new Arrival recent Move white 0.96 0.93 0.97 0.88 sei older 0.57 0.24 Loadings: Factor1 Factor2 on Social Security 0.83 -0.29 working -0.68 sei -0.59 -0.29 older 0.82 -0.29 poverty -0.28 -0.63 no Phone no English 0.26 on Welfare incearn -0.46 -0.34 no High School 0.31 0.29 in City renter 0.45 no Spouse 0.28 lang Isolated 0.31 mult Families new Arrival 0.26 recent Move white -0.34 Factor1 Factor2 SS loadings 2.60 1.67 Proportion Var 0.14 0.09 Cumulative Var 0.14 0.24 The degrees of freedom for the model is 118 and the fit was 0.6019 The degrees of freedom for the model is 102 and the fit was 0.343 Principal components analysis, or PCA, is a technique closely related to factor analysis.

PCA seeks to find a set of orthogonal axes such that the first axis, or , accounts for as much variability as possible, and subsequent axes or components are chosen to maximize variance while maintaining orthogonality with previous axes. Width 0.567 -0.583 0.580 Comp.1 Comp.2 Comp.3 Comp.4 SS loadings 1.00 1.00 1.00 1.00 Proportion Var 0.25 0.25 0.25 0.25 Cumulative Var 0.25 0.50 0.75 1.00 You may have noticed that we supplied the flag cor=TRUE in the call to princomp; this flag tells princomp to use the correlation matrix rather than the covariance matrix to compute the principal components.

A covariance matrix cannot have negative eigenvalues, since a negative eigenvalue means that some linear combination of the variables has negative variance.

PROC CALIS displays a warning if the predicted covariance matrix has negative eigenvalues but does not actually compute the eigenvalues.

Principal components are typically computed either by a singular value decomposition of the data matrix or an eigenvalue decomposition of a covariance or correlation matrix; the latter permits us to use . We can obtain the same results by omitting the flag but submitting the correlation matrix as returned by rx Cor instead: Stock market data for open, high, low, close, and adjusted close from 1962 to 2010 is available at https://github.com/thebigjc/Hack Reduce/blob/master/datasets/nyse/daily_prices/NYSE_daily_prices_

As an example, we use the rx Cov function to calculate a covariance matrix for the log of the classic iris data, and pass the matrix to the princomp function (reproduced from Modern Applied Statistics with S): Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 1.7124583 0.9523797 0.36470294 0.1656840 Proportion of Variance 0.7331284 0.2267568 0.03325206 0.0068628 Cumulative Proportion 0.7331284 0.9598851 0.99313720 1.0000000 loadings(iris Pca) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Sepal. The full data set includes 9.2 million observations of daily open-high-low-close data for some 2800 stocks.

This matrix is similar to the ordinary least squares regression solution with a “ridge” added along the diagonal.

Note: Due to a Microsoft security update, you may find that this add-in disappears from the Ribbon after you close Excel.

As you might expect, these data are highly correlated, and principal components analysis can be used for data reduction.

We read the original data into a file, NYSE_daily_prices.xdf, using the same process we used in the Tutorial: Analyzing loan data with Revo Scale R to read our mortgage data (set summary(stock Pca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 2.0756631 0.8063270 0.197632281 0.0454173922 Proportion of Variance 0.8616755 0.1300327 0.007811704 0.0004125479 Cumulative Proportion 0.8616755 0.9917081 0.999519853 0.9999324005 Comp.5 Standard deviation 1.838470e-02 Proportion of Variance 6.759946e-05 Cumulative Proportion 1.000000e 00 loadings(stock Pca) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 stock_price_open -0.470 -0.166 0.867 stock_price_high -0.477 -0.151 -0.276 0.410 -0.711 stock_price_low -0.477 -0.153 -0.282 0.417 0.704 stock_price_close -0.477 -0.149 -0.305 -0.811 stock_price_adj_close -0.309 0.951 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 SS loadings 1.0 1.0 1.0 1.0 1.0 Proportion Var 0.2 0.2 0.2 0.2 0.2 Cumulative Var 0.2 0.4 0.6 0.8 1.0 The scree plot is shown as follows: Between them, the first two principal components explain 99% of the variance; we can therefore replace the five original variables by these two principal components with no appreciable loss of information.

Sometimes this warning can be triggered by 0 or very small positive eigenvalues that appear negative because of numerical error.


