Title: | Data Sets and Functions Used in Multivariate Statistics: Old School by John Marden |
---|---|
Description: | Multivariate Analysis methods and data sets used in John Marden's book Multivariate Statistics: Old School (2015) <ISBN:978-1456538835>. This also serves as a companion package for the STAT 571: Multivariate Analysis course offered by the Department of Statistics at the University of Illinois at Urbana-Champaign ('UIUC'). |
Authors: | John Marden [aut, cph], James Balamuta [cre, ctb, com] |
Maintainer: | James Balamuta <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.0 |
Built: | 2024-11-08 02:40:23 UTC |
Source: | https://github.com/coatless-rpkg/msos |
The data on average number of births for each hour of the day for four hospitals.
births
births
A double matrix with 24 observations on the following 4 variables.
Hospital1
Average number of births for each hour of the day within Hospital 1
Hospital2
Average number of births for each hour of the day within Hospital 2
Hospital3
Average number of births for each hour of the day within Hospital 3
Hospital4
Average number of births for each hour of the day within Hospital 4
To be determined
This function fits the model using least squares. It takes an optional
pattern matrix P as in (6.51), which specifies which 's are
zero.
bothsidesmodel(x, y, z = diag(qq), pattern = matrix(1, nrow = p, ncol = l))
bothsidesmodel(x, y, z = diag(qq), pattern = matrix(1, nrow = p, ncol = l))
x |
An |
y |
The |
z |
A |
pattern |
An optional |
A list with the following components:
The least-squares estimate of .
The matrix with the
th element
being the standard error of
.
The matrix with the
th element being
the
-statistic based on
.
The estimated covariance matrix of the 's.
A -dimensional vector of the degrees of freedom for the
-statistics, where the
th component contains the
degrees of freedom for the
th column of
.
The matrix
.
The residual sum of squares and
crossproducts matrix.
bothsidesmodel.chisquare
, bothsidesmodel.df
,
bothsidesmodel.hotelling
, bothsidesmodel.lrt
,
and bothsidesmodel.mle
.
# Mouth Size Example from 6.4.1 data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(c(1, 1, 1, 1), c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1)) bothsidesmodel(x, y, z)
# Mouth Size Example from 6.4.1 data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(c(1, 1, 1, 1), c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1)) bothsidesmodel(x, y, z)
are zeroTests the null hypothesis that an arbitrary subset of the 's
is zero, based on the least squares estimates, using the
test as
in Section 7.1. The null and alternative are specified by pattern matrices
and
, respectively. If the
is omitted, then the
alternative will be taken to be the unrestricted model.
bothsidesmodel.chisquare( x, y, z, pattern0, patternA = matrix(1, nrow = ncol(x), ncol = ncol(z)) )
bothsidesmodel.chisquare( x, y, z, pattern0, patternA = matrix(1, nrow = ncol(x), ncol = ncol(z)) )
x |
An |
y |
The |
z |
A |
pattern0 |
An |
patternA |
An optional |
A 'list' with the following components:
The vector of estimated parameters of interest.
The estimated covariance matrix of the estimated parameter vector.
The degrees of freedom in the test.
statistic in (7.4).
The p-value for the test.
bothsidesmodel
, bothsidesmodel.df
,
bothsidesmodel.hotelling
, bothsidesmodel.lrt
,
and bothsidesmodel.mle
.
# TBA - Submit a PR!
# TBA - Submit a PR!
Determines the denominators needed to calculate an unbiased estimator of
.
bothsidesmodel.df(xx, n, pattern)
bothsidesmodel.df(xx, n, pattern)
xx |
Result of |
n |
Number of rows in observation matrix given |
pattern |
An |
A numeric
matrix of size containing
the degrees of freedom for the test.
bothsidesmodel
, bothsidesmodel.chisquare
,
bothsidesmodel.hotelling
, bothsidesmodel.lrt
,
and bothsidesmodel.mle
.
# Find the DF for a likelihood ratio test statistic. x <- cbind( 1, c(-2, -1, 0, 1, 2), c(2, -1, -2, -1, 2), c(-1, 2, 0, -2, 1), c(1, -4, 6, -4, 1) ) # or x <- cbind(1, poly(1:5, 4)) data(skulls) x <- kronecker(x, rep(1, 30)) y <- skulls[, 1:4] z <- diag(4) pattern <- rbind(c(1, 1, 1, 1), 1, 0, 0, 0) xx <- t(x) %*% x bothsidesmodel.df(xx, nrow(y), pattern)
# Find the DF for a likelihood ratio test statistic. x <- cbind( 1, c(-2, -1, 0, 1, 2), c(2, -1, -2, -1, 2), c(-1, 2, 0, -2, 1), c(1, -4, 6, -4, 1) ) # or x <- cbind(1, poly(1:5, 4)) data(skulls) x <- kronecker(x, rep(1, 30)) y <- skulls[, 1:4] z <- diag(4) pattern <- rbind(c(1, 1, 1, 1), 1, 0, 0, 0) xx <- t(x) %*% x bothsidesmodel.df(xx, nrow(y), pattern)
are zero.Performs tests of the null hypothesis H0 : = 0, where
is a block submatrix of
as in Section 7.2.
bothsidesmodel.hotelling(x, y, z, rows, cols)
bothsidesmodel.hotelling(x, y, z, rows, cols)
x |
An |
y |
The |
z |
A |
rows |
The vector of rows to be tested. |
cols |
The vector of columns to be tested. |
A list with the following components:
A list with the components of the Lawley-Hotelling test (7.22)
The statistic (7.19).
The version (7.22) of the
statistic.
The degrees of freedom for the .
The -value of the
.
A list with the components of the Wilks test (7.37)
The statistic (7.35).
The version (7.37) of the
statistic, using Bartlett's correction.
The degrees of freedom for the
.
The p-value of the
.
bothsidesmodel
, bothsidesmodel.chisquare
,
bothsidesmodel.df
, bothsidesmodel.lrt
,
and bothsidesmodel.mle
.
# Finds the Hotelling values for example 7.3.1 data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(c(1, 1, 1, 1), c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1)) bothsidesmodel.hotelling(x, y, z, 1:2, 3:4)
# Finds the Hotelling values for example 7.3.1 data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(c(1, 1, 1, 1), c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1)) bothsidesmodel.hotelling(x, y, z, 1:2, 3:4)
are zero.Tests the null hypothesis that an arbitrary subset of the 's
is zero, using the likelihood ratio test as in Section 9.4. The null and
alternative are specified by pattern matrices
and
,
respectively. If the
is omitted, then the alternative will be
taken to be the unrestricted model.
bothsidesmodel.lrt( x, y, z, pattern0, patternA = matrix(1, nrow = ncol(x), ncol = ncol(z)) )
bothsidesmodel.lrt( x, y, z, pattern0, patternA = matrix(1, nrow = ncol(x), ncol = ncol(z)) )
x |
An |
y |
The |
z |
A |
pattern0 |
An |
patternA |
An optional |
A list with the following components:
The likelihood ratio statistic in (9.44).
The degrees of freedom in the test.
The -value for the test.
bothsidesmodel.chisquare
, bothsidesmodel.df
,
bothsidesmodel.hotelling
, bothsidesmodel
,
and bothsidesmodel.mle
.
# Load data data(caffeine) # Matrices x <- cbind( rep(1, 28), c(rep(-1, 9), rep(0, 10), rep(1, 9)), c(rep(1, 9), rep(-1.8, 10), rep(1, 9)) ) y <- caffeine[, -1] z <- cbind(c(1, 1), c(1, -1)) pattern <- cbind(c(rep(1, 3)), 1) # Fit model bsm <- bothsidesmodel.lrt(x, y, z, pattern)
# Load data data(caffeine) # Matrices x <- cbind( rep(1, 28), c(rep(-1, 9), rep(0, 10), rep(1, 9)), c(rep(1, 9), rep(-1.8, 10), rep(1, 9)) ) y <- caffeine[, -1] z <- cbind(c(1, 1), c(1, -1)) pattern <- cbind(c(rep(1, 3)), 1) # Fit model bsm <- bothsidesmodel.lrt(x, y, z, pattern)
This function fits the model using maximum likelihood. It takes an optional
pattern matrix as in (6.51), which specifies which
's
are zero.
bothsidesmodel.mle(x, y, z = diag(qq), pattern = matrix(1, nrow = p, ncol = l))
bothsidesmodel.mle(x, y, z = diag(qq), pattern = matrix(1, nrow = p, ncol = l))
x |
An |
y |
The |
z |
A |
pattern |
An optional |
A list with the following components:
The least-squares estimate of .
The matrix with the
th element
being the standard error of
.
The matrix with the
th element
being the
-statistic based on
.
The estimated covariance matrix of the 's.
A -dimensional vector of the degrees of freedom for the
-statistics, where the
th component contains the
degrees of freedom for the
th column of
.
The matrix
.
The residual sum of squares and
crossproducts matrix.
The dimension of the model, counting the nonzero
's and components of
.
Mallow's Statistic.
The dimension of the model, counting the nonzero
's and components of
The corrected AIC criterion from (9.87) and (aic19)
The BIC criterion from (9.56).
bothsidesmodel.chisquare
, bothsidesmodel.df
,
bothsidesmodel.hotelling
, bothsidesmodel.lrt
,
and bothsidesmodel
.
data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(1, c(-3, -1, 1, 3), c(-1, 1, 1, -1), c(-1, 3, -3, 1)) bothsidesmodel.mle(x, y, z, cbind(c(1, 1), 1, 0, 0))
data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(1, c(-3, -1, 1, 3), c(-1, 1, 1, -1), c(-1, 3, -3, 1)) bothsidesmodel.mle(x, y, z, cbind(c(1, 1), 1, 0, 0))
estimates for MLE regression with
patterning.Generates estimates for MLE using a conditioning approach with
patterning support.
bsm.fit(x, y, z, pattern)
bsm.fit(x, y, z, pattern)
x |
An |
y |
The |
z |
A |
pattern |
An optional |
A list with the following components:
The least-squares estimate of .
The matrix with the
th
element being the standard error of
.
The matrix with the
th
element being the t-statistic based on
.
The estimated covariance matrix of the 's.
A -dimensional vector of the degrees of freedom for the
-statistics, where the
th component contains the
degrees of freedom for the
th column of
.
The
matrix
.
The residual sum of squares and
crossproducts matrix.
bothsidesmodel.mle
and bsm.simple
# NA
# NA
estimates for MLE regression.Generates estimates for MLE using a conditioning approach.
bsm.simple(x, y, z)
bsm.simple(x, y, z)
x |
An |
y |
The |
z |
A |
The technique used to calculate the estimates is described in section 9.3.3.
A list with the following components:
The least-squares estimate of .
The matrix with the
th
element being the standard error of
.
The matrix with the
th
element being the t-statistic based on
.
The estimated covariance matrix of the 's.
A -dimensional vector of the degrees of freedom for the
-statistics, where the
th component contains the
degrees of freedom for the
th column of
.
The matrix
.
The residual sum of squares and
crossproducts matrix.
bothsidesmodel.mle
and bsm.fit
# Taken from section 9.3.3 to show equivalence to methods. data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(1, c(-3, -1, 1, 3), c(-1, 1, 1, -1), c(-1, 3, -3, 1)) yz <- y %*% solve(t(z)) yza <- yz[, 1:2] xyzb <- cbind(x, yz[, 3:4]) lm(yza ~ xyzb - 1) bsm.simple(xyzb, yza, diag(2))
# Taken from section 9.3.3 to show equivalence to methods. data(mouths) x <- cbind(1, mouths[, 5]) y <- mouths[, 1:4] z <- cbind(1, c(-3, -1, 1, 3), c(-1, 1, 1, -1), c(-1, 3, -3, 1)) yz <- y %*% solve(t(z)) yza <- yz[, 1:2] xyzb <- cbind(x, yz[, 3:4]) lm(yza ~ xyzb - 1) bsm.simple(xyzb, yza, diag(2))
Henson et al. [1996] conducted an experiment to see whether caffeine has a negative effect on short-term visual memory. High school students were randomly chosen: 9 from eighth grade, 10 from tenth grade, and 9 from twelfth grade. Each person was tested once after having caffeinated Coke, and once after having decaffeinated Coke. After each drink, the person was given ten seconds to try to memorize twenty small, common objects, then allowed a minute to write down as many as could be remembered. The main question of interest is whether people remembered more objects after the Coke without caffeine than after the Coke with caffeine.
caffeine
caffeine
A double matrix with 28 observations on the following 3 variables.
Grade of the Student, which is either 8th, 10th, or 12th
Number of items remembered after drinking Coke without Caffine
Number of items remembered after drinking Coke with Caffine
Claire Henson, Claire Rogers, and Nadia Reynolds. Always Coca-Cola. Technical report, University Laboratory High School, Urbana, IL, 1996.
The data set cars [Consumers' Union, 1990] contains 111 models of automobile. The original data can be found in the S-Plus? [TIBCO Software Inc., 2009] data frame cu.dimensions. In cars, the variables have been normalized to have medians of 0 and median absolute deviations (MAD) of 1.4826 (the MAD for a N(0, 1)).
cars
cars
A double matrix with 111 observations on the following 11 variables.
Overall length, in inches, as supplied by manufacturer.
Length of wheelbase, in inches, as supplied by manufacturer.
Width of car, in inches, as supplied by manufacturer.
Height of car, in inches, as supplied by manufacturer
Distance between the car's head-liner and the head of a 5 ft. 9 in. front seat passenger, in inches, as measured by CU.
Distance between the car's head-liner and the head of a 5 ft 9 in. rear seat passenger, in inches, as measured by CU
Maximum front leg room, in inches, as measured by CU.
Rear fore-and-aft seating room, in inches, as measured by CU.
Front shoulder room, in inches, as measured by CU.
Rear shoulder room, in inches, as measured by CU.
Luggage Area in Car.
Consumers' Union. Body dimensions. Consumer Reports, April 286 - 288, 1990.
Chakrapani and Ehrenberg [1981] analyzed people's attitudes towards a variety of breakfast cereals. The data matrix cereal is 8 ? 11, with rows corresponding to eight cereals, and columns corresponding to potential attributes about cereals. The original data consisted of the percentage of subjects who thought the given cereal possessed the given attribute. The present matrix has been doubly centered, so that the row means and columns means are all zero. (The original data can be found in the S-Plus [TIBCO Software Inc., 2009] data set cereal.attitude.)
cereal
cereal
A double matrix with 8 observations on the following 11 variables.
A cereal one would come back to
Tastes good
Popular with the entire family
Cereal is fulfilling
Cereal lacks flavor additives
Cereal is priced well for the content
Quantity for Price
Stays crispy in milk
Keeps one fit
Fun for children
T. K. Chakrapani and A. S. C. Ehrenberg. An alternative to factor analysis in marketing research part 2: Between group analysis. Professional Marketing Research Society Journal, 1:32-38, 1981.
The crabs
data frame has 200 rows and 8 columns, describing 5
morphological measurements on 50 crabs each of two colour forms and both
sexes, of the species Leptograpsus variegatus collected at Fremantle,
W. Australia.
crabs
crabs
This data frame contains the following columns:
species
- "B"
or "O"
for blue or orange.
"M"
(Male) or "F"
(Female).
index 1:50
within each of the four groups.
frontal lobe size (mm).
rear width (mm).
carapace length (mm).
carapace width (mm).
body depth (mm).
Campbell, N.A. and Mahon, R.J. (1974) A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology 22, 417–425.
MASS, R-Package
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
The decathlon data set has scores on the top 24 men in the decathlon (a set of ten events) at the 2008 Olympics. The scores are the numbers of points each participant received in each event, plus each person's total points.
decathlon08
decathlon08
A double matrix with 24 observations on the following 11 variables.
Individual point score for 100 Meter event.
Individual point score for Long Jump event.
Individual point score for Shot Put event.
Individual point score for High Jump event.
Individual point score for 400 Meter event.
Individual point score for 110 Hurdles event.
Individual point score for Discus event.
Individual point score for Pole Vault event.
Individual point score for Javelin event.
Individual point score for 1500 Meter event.
Individual total point score for events participated in.
NBC's Olympic site
The decathlon data set has scores on the top 26 men in the decathlon (a set of ten events) at the 2012 Olympics. The scores are the numbers of points each participant received in each event, plus each person's total points.
decathlon12
decathlon12
A double matrix with 26 observations on the following 11 variables.
Individual point score for 100 Meter event.
Individual point score for Long Jump event.
Individual point score for Shot Put event.
Individual point score for High Jump event.
Individual point score for 400 Meter event.
Individual point score for 110 Hurdles event.
Individual point score for Discus event.
Individual point score for Pole Vault event.
Individual point score for Javelin event.
Individual point score for 1500 Meter event.
Individual total point score for events participated in.
NBC's Olympic site
The data set election has the results of the first three US presidential races of the 2000's (2000, 2004, 2008). The observations are the 50 states plus the District of Columbia, and the values are the (D - R)/(D + R) for each state and each year, where D is the number of votes the Democrat received, and R is the number the Republican received.
election
election
A double matrix with 51 observations on the following 3 variables.
Results for 51 States in Year 2000
Results for 51 States in Year 2004
Results for 51 States in Year 2008
Calculated by Prof. John Marden, data source to be announced.
The exams matrix has data on 191 statistics students, giving their scores (out of 100) on the three midterm exams, and the final exam.
exams
exams
A double matrix with 191 observations on the following 4 variables.
Student score on the first midterm out of 100.
Student score on the second midterm out of 100.
Student score on the third midterm out of 100.
Student score on the Final Exam out of 100.
Data from one of Prof. John Marden's earlier classes
The function fillout takes a matrix
and fills it out so that it is a square matrix
.
fillout(z)
fillout(z)
z |
A |
A square matrix
# Create a 3 x 2 matrix a <- cbind(c(1, 2, 3), c(4, 5, 6)) # Creates a 3 x 3 Matrix from 3 x 2 Data fillout(a)
# Create a 3 x 2 matrix a <- cbind(c(1, 2, 3), c(4, 5, 6)) # Creates a 3 x 3 Matrix from 3 x 2 Data fillout(a)
The data set contains grades of 107 students.
grades
grades
A double matrix with 107 observations on the following 7 variables.
Sex (0=Male, 1=Female)
Student Score on all Homework.
Student Score on all Labs.
Student Score on all In Class work.
Student Score on all Midterms.
Student Score on the Final.
Student's Total Score
Data from one of Prof. John Marden's earlier classes
Sixteen dogs were treated with drugs to see the effects on their blood histamine levels. The dogs were split into four groups: Two groups received the drugmorphine, and two received the drug trimethaphan, both given intravenously. For one group within each pair of drug groups, the dogs had their supply of histamine depleted before treatment, while the other group had histamine intact. (Measurements with the value "0.10" marked data that was missing and, were filled with that value arbitrarily.)
histamine
histamine
A double matrix with 16 observations on the following 4 variables.
Histamine levels (in micrograms per milliliter of blood) before the inoculation.
Histamine levels (in micrograms per milliliter of blood) one minute after inoculation.
Histamine levels (in micrograms per milliliter of blood) three minute after inoculation.
Histamine levels (in micrograms per milliliter of blood) five minutes after inoculation.
Kenny J.Morris and Robert Zeppa. Histamine-induced hypotension due to morphine and arfonad in the dog. Journal of Surgical Research, 3(6):313-317, 1963.
Obtains the index of a vector that contains the largest value in the vector.
imax(z)
imax(z)
z |
A vector of any length |
The index of the largest value in a vector.
# Iris example x.iris <- as.matrix(iris[, 1:4]) # Gets group vector (1, ... , 1, 2, ... , 2, 3, ... , 3) y.iris <- rep(1:3, c(50, 50, 50)) ld.iris <- lda(x.iris, y.iris) disc <- x.iris %*% ld.iris$a disc <- sweep(disc, 2, ld.iris$c, "+") yhat <- apply(disc, 1, imax)
# Iris example x.iris <- as.matrix(iris[, 1:4]) # Gets group vector (1, ... , 1, 2, ... , 2, 3, ... , 3) y.iris <- rep(1:3, c(50, 50, 50)) ld.iris <- lda(x.iris, y.iris) disc <- x.iris %*% ld.iris$a disc <- sweep(disc, 2, ld.iris$c, "+") yhat <- apply(disc, 1, imax)
Finds the coefficients and constants
for Fisher's linear
discrimination function
in (11.31) and (11.32).
lda(x, y)
lda(x, y)
x |
The |
y |
The |
A list with the following components:
A matrix, where column
contains the
coefficents
for (11.31). The final column is all zero.
The -vector of constants
for (11.31).
The final value is zero.
# Iris example x.iris <- as.matrix(iris[, 1:4]) # Gets group vector (1, ... , 1, 2, ... , 2, 3, ..., 3) y.iris <- rep(1:3, c(50, 50, 50)) ld.iris <- lda(x.iris, y.iris)
# Iris example x.iris <- as.matrix(iris[, 1:4]) # Gets group vector (1, ... , 1, 2, ... , 2, 3, ..., 3) y.iris <- rep(1:3, c(50, 50, 50)) ld.iris <- lda(x.iris, y.iris)
Dataset with leprosy patients found in Snedecor and Cochran [1989]. There were 30 patients, randomly allocated to three groups of 10. The first group received drug A, the second drug D, and the third group received a placebo. Each person had their bacterial count taken before and after receiving the treatment.
leprosy
leprosy
A double matrix with 30 observations on the following 3 variables.
Bacterial count taken before receiving the treatment.
Bacterial count taken after receiving the treatment.
Group Coding: 0 = Drug A, 1 = Drug B, 2 = Placebo
George W. Snedecor and William G. Cochran. Statistical Methods. Iowa State University Press, Ames, Iowa, eighth edition, 1989.
Takes the log determinant of a square matrix. Log is that of base e sometimes referred to as ln().
logdet(a)
logdet(a)
a |
Square matrix ( |
A single-value double.
# Identity Matrix of size 2 logdet(diag(c(2, 2)))
# Identity Matrix of size 2 logdet(diag(c(2, 2)))
Measurements were made on the size of mouths of 27 children at four ages: 8, 10, 12, and 14. The measurement is the distance from the "center of the pituitary to the pteryomaxillary fissure" in millimeters. These data can be found in Potthoff and Roy [1964]. There are 11 girls (Sex=1) and 16 boys (Sex=0).
mouths
mouths
A data frame with 27 observations on the following 5 variables.
Measurement on child's month at age eight.
Measurement on child's month at age ten.
Measurement on child's month at age twelve.
Measurement on child's month at age fourteen.
Sex Coding: Girl=1 and Boys=0
Richard F. Potthoff and S. N. Roy. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika, 51:313-326, 1964.
Calculates the histogram-based estimate (A.2) of the negentropy,
for a vector of observations.
negent(x, K = ceiling(log2(length(x)) + 1))
negent(x, K = ceiling(log2(length(x)) + 1))
x |
The |
K |
The number of bins to use in the histogram. |
The value of the estimated negentropy.
# TBA - Submit a PR!
# TBA - Submit a PR!
dimensionsSearches for the rotation that maximizes the estimated negentropy of the
first column of the rotated data, for dimensional data.
negent2D(y, m = 100)
negent2D(y, m = 100)
y |
The |
m |
The number of angles (between |
A list with the following components:
The orthogonal matrix G that optimizes the negentropy.
Estimated negentropies for the two rotated variables. The largest is first.
# Load iris data data(iris) # Centers and scales the variables. y <- scale(as.matrix(iris[, 1:2])) # Obtains Negent Vectors for 2 x 2 matrix gstar <- negent2D(y, m = 10)$vectors
# Load iris data data(iris) # Centers and scales the variables. y <- scale(as.matrix(iris[, 1:2])) # Obtains Negent Vectors for 2 x 2 matrix gstar <- negent2D(y, m = 10)$vectors
dimensionsSearches for the rotation that maximizes the estimated negentropy of the
first column of the rotated data, and of the second variable fixing the
first, for dimensional data. The routine uses a random start for
the function optim using the simulated annealing option SANN, hence one may
wish to increase the number of attempts by setting nstart to a integer larger
than 1.
negent3D(y, nstart = 1, m = 100, ...)
negent3D(y, nstart = 1, m = 100, ...)
y |
The |
nstart |
The number of times to randomly start the search routine. |
m |
The number of angles (between 0 and |
... |
Further optional arguments to pass to the |
A 'list' with the following components:
The orthogonal matrix G that optimizes
the negentropy.
Estimated negentropies for the three rotated variables, from largest to smallest.
## Not run: # Running this example will take approximately 30s. # Centers and scales the variables. y <- scale(as.matrix(iris[, 1:3])) # Obtains Negent Vectors for 3x3 matrix gstar <- negent3D(y, nstart = 100)$vectors ## End(Not run)
## Not run: # Running this example will take approximately 30s. # Centers and scales the variables. y <- scale(as.matrix(iris[, 1:3])) # Obtains Negent Vectors for 3x3 matrix gstar <- negent3D(y, nstart = 100)$vectors ## End(Not run)
The subjective assessment, on a 0 to 20 integer scale, of 54 classical painters. The painters were assessed on four characteristics: composition, drawing, colour and expression. The data is due to the Eighteenth century art critic, de Piles.
painters
painters
The row names of the data frame are the painters. The components are:
Composition score.
Drawing score.
Colour score.
Expression score.
The school to which a painter belongs, as indicated
by a factor level code as follows:
"A"
: Renaissance;
"B"
: Mannerist;
"C"
: Seicento;
"D"
: Venetian;
"E"
: Lombard;
"F"
: Sixteenth Century;
"G"
: Seventeenth Century;
"H"
: French.
A. J. Weekes (1986) A Genstat Primer. Edward Arnold.
M. Davenport and G. Studdert-Kennedy (1972) The statistical analysis of aesthetic judgement: an exploration. Applied Statistics 21, 324–333.
I. T. Jolliffe (1986) Principal Component Analysis. Springer.
MASS, R-Package
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
Find the BIC and MLE from a set of observed eigenvalues for a specific pattern.
pcbic(eigenvals, n, pattern)
pcbic(eigenvals, n, pattern)
eigenvals |
The |
n |
The degrees of freedom in the covariance matrix. |
pattern |
The pattern of equalities of the eigenvalues, given by the
|
A 'list' with the following components:
A -vector containing the MLE's for the eigenvalues.
The deviance of the model, as in (13.13).
The dimension of the model, as in (13.12).
The value of the BIC for the model, as in (13.14).
pcbic.stepwise
, pcbic.unite
,
and pcbic.subpatterns
.
# Build cars1 require("mclust") mcars <- Mclust(cars) cars1 <- cars[mcars$classification == 1, ] xcars <- scale(cars1) eg <- eigen(var(xcars)) pcbic(eg$values, 95, c(1, 1, 3, 3, 2, 1))
# Build cars1 require("mclust") mcars <- Mclust(cars) cars1 <- cars[mcars$classification == 1, ] xcars <- scale(cars1) eg <- eigen(var(xcars)) pcbic(eg$values, 95, c(1, 1, 3, 3, 2, 1))
Uses the stepwise procedure described in Section 13.1.4 to find a pattern for a set of observed eigenvalues with good BIC value.
pcbic.stepwise(eigenvals, n)
pcbic.stepwise(eigenvals, n)
eigenvals |
The |
n |
The degrees of freedom in the covariance matrix. |
A list with the following components:
A list of patterns, one for each value of length .
A vector of the BIC's for the above patterns.
The best (smallest) value among the BIC's in BICs.
The pattern with the best BIC.
A -vector containing the MLE's for the eigenvalues
for the pattern with the best BIC.
pcbic
, pcbic.unite
,
and pcbic.subpatterns
.
# Build cars1 require("mclust") mcars <- Mclust(cars) cars1 <- cars[mcars$classification == 1, ] xcars <- scale(cars1) eg <- eigen(var(xcars)) pcbic.stepwise(eg$values, 95)
# Build cars1 require("mclust") mcars <- Mclust(cars) cars1 <- cars[mcars$classification == 1, ] xcars <- scale(cars1) eg <- eigen(var(xcars)) pcbic.stepwise(eg$values, 95)
Obtains the best pattern and its BIC among the patterns obtainable by summing
two consecutive terms in pattern0
.
pcbic.subpatterns(eigenvals, n, pattern0)
pcbic.subpatterns(eigenvals, n, pattern0)
eigenvals |
The |
n |
The degrees of freedom in the covariance matrix. |
pattern0 |
The pattern of equalities of the eigenvalues, given by the
|
A 'list' containing:
A double matrix containing the pattern evaluated.
A vector containing the BIC for the above pattern matrix.
pcbic
, pcbic.stepwise
,
and pcbic.unite
.
# NA
# NA
Returns the pattern obtained by summing and
.
pcbic.unite(pattern, index1)
pcbic.unite(pattern, index1)
pattern |
The pattern of equalities of the eigenvalues, given by the
|
index1 |
Index |
A 'vector' containing a pattern.
pcbic
, pcbic.stepwise
,
and pcbic.subpatterns
.
# NA
# NA
Six astronomical variables are given on each of the historical nine planets (or eight planets, plus Pluto).
planets
planets
A double matrix with 9 observations on the following 6 variables.
Average distance in millions of miles the planet is from the sun.
The length of the planet's day in Earth days
The length of year in Earth days
The planet's diameter in miles
The planet's temperature in degrees Fahrenheit
Number of moons
John W. Wright, editor. The Universal Almanac. Andrews McMeel Publishing, Kansas City, MO, 1997.
The function uses the output from the function qda
(Section A.3.2) and a -vector
, and calculates the predicted
group for this
.
predict_qda(qd, newx)
predict_qda(qd, newx)
qd |
The output from |
newx |
A |
A -vector of the discriminant values
in
(11.48) for the given
.
# Load Iris Data data(iris) # Build data x.iris <- as.matrix(iris[, 1:4]) n <- nrow(x.iris) # Gets group vector (1, ... , 1, 2, ..., 2, 3, ... , 3) y.iris <- rep(1:3, c(50, 50, 50)) # Perform QDA qd.iris <- qda(x.iris, y.iris) yhat.qd <- NULL for (i in seq_len(n)) { yhat.qd <- c(yhat.qd, imax(predict_qda(qd.iris, x.iris[i, ]))) } table(yhat.qd, y.iris)
# Load Iris Data data(iris) # Build data x.iris <- as.matrix(iris[, 1:4]) n <- nrow(x.iris) # Gets group vector (1, ... , 1, 2, ..., 2, 3, ... , 3) y.iris <- rep(1:3, c(50, 50, 50)) # Perform QDA qd.iris <- qda(x.iris, y.iris) yhat.qd <- NULL for (i in seq_len(n)) { yhat.qd <- c(yhat.qd, imax(predict_qda(qd.iris, x.iris[i, ]))) } table(yhat.qd, y.iris)
Data from Ware and Bowden [1977] taken at six four-hour intervals (labelled T1 to T6) over the course of a day for 10 individuals. The measurements are prostaglandin contents in their urine.
prostaglandin
prostaglandin
A double matrix with 10 observations on the following 6 variables.
First four-hour interval measurement of prostaglandin.
Second four-hour interval measurement of prostaglandin.
Third four-hour interval measurement of prostaglandin.
Fourth four-hour interval measurement of prostaglandin.
Fifth four-hour interval measurement of prostaglandin.
Sixth four-hour interval measurement of prostaglandin.
J H Ware and R E Bowden. Circadian rhythm analysis when output is collected at intervals. Biometrics, 33(3):566-571, 1977.
The function returns the elements needed to calculate the quadratic
discrimination in (11.48). Use the output from this function in
predict_qda
(Section A.3.2) to find the predicted groups.
qda(x, y)
qda(x, y)
x |
The |
y |
The |
A 'list' with the following components:
A matrix, where column
contains
the coefficents
for (11.31). The final column is all
zero.
A array, where the
Sigma[k,,] contains the sample covariance matrix for group
,
.
The -vector of constants
for (11.48).
predict_qda
and lda
# Load Iris Data data(iris) # Iris example x.iris <- as.matrix(iris[, 1:4]) # Gets group vector (1, ... , 1, 2, ... , 2, 3, ... , 3) y.iris <- rep(1:3, c(50, 50, 50)) # Perform QDA qd.iris <- qda(x.iris, y.iris)
# Load Iris Data data(iris) # Iris example x.iris <- as.matrix(iris[, 1:4]) # Gets group vector (1, ... , 1, 2, ... , 2, 3, ... , 3) y.iris <- rep(1:3, c(50, 50, 50)) # Perform QDA qd.iris <- qda(x.iris, y.iris)
This function takes a matrix that is Kronecker product
(Definition 3.5), where
is
and
is
,
and outputs the matrix
.
reverse.kronecker(ab, p, qq)
reverse.kronecker(ab, p, qq)
ab |
The |
p |
The number of rows of |
qq |
The number of columns of |
The matrix
.
# Create matrices (A <- diag(1, 3)) (B <- matrix(1:6, ncol = 2)) # Perform kronecker (kron <- kronecker(A, B)) # Perform reverse kronecker product (reverse.kronecker(kron, 3, 3)) # Perform kronecker again (kron2 <- kronecker(B, A))
# Create matrices (A <- diag(1, 3)) (B <- matrix(1:6, ncol = 2)) # Perform kronecker (kron <- kronecker(A, B)) # Perform reverse kronecker product (reverse.kronecker(kron, 3, 3)) # Perform kronecker again (kron2 <- kronecker(B, A))
A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa.
SAheart
SAheart
A data frame with 462 observations on the following 10 variables.
systolic blood pressure
cumulative tobacco (kg)
low density lipoprotein cholesterol
a numeric vector
family history of heart disease, a factor with levels
"Absent"
and "Present"
type-A behavior
a numeric vector
current alcohol consumption
age at onset
response, coronary heart disease
A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.
Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities, South African Medical Journal 64: 430–436.
ElemStatLearn, R-Package
-Means ClusteringFind the silhouettes (12.9) for K-means clustering from the data and and the groups' centers.
silhouette.km(x, centers)
silhouette.km(x, centers)
x |
The |
centers |
The |
This function is a bit different from the silhouette function in the cluster package, Maechler et al., 2005.
The -vector of silhouettes, indexed by the observations'
indices.
# Uses sports data. data(sportsranks) # Obtain the K-means clustering for sports ranks. kms <- kmeans(sportsranks, centers = 5, nstart = 10) # Silhouettes sil <- silhouette.km(sportsranks, kms$centers)
# Uses sports data. data(sportsranks) # Obtain the K-means clustering for sports ranks. kms <- kmeans(sportsranks, centers = 5, nstart = 10) # Silhouettes sil <- silhouette.km(sportsranks, kms$centers)
The data concern the sizes of Egyptian skulls over time, from Thomson and Randall-MacIver [1905]. There are 30 skulls from each of five time periods, so that n = 150 all together.
skulls
skulls
A double matrix with 150 observations on the following 5 variables.
Maximum length in millimeters
Basibregmatic Height in millimeters
Basialveolar Length in millimeters
Nasal Height in millimeters
Time groupings
A. Thomson and R. Randall-MacIver. Ancient Races of the Thebaid. Oxford University Press, 1905.
A data set that contains 23 peoples' ranking of 8 soft drinks: Coke, Pepsi, Sprite, 7-up, and their diet equivalents
softdrinks
softdrinks
A double matrix with 23 observations on the following 8 variables.
Ranking given to Coke
Ranking given to Pepsi
Ranking given to 7-up
Ranking given to Sprite
Ranking given to Diet Coke
Ranking given to Diet Pepsi
Ranking given to Diet 7-up
Ranking given to Diet Sprite
Data from one of Prof. John Marden's earlier classes
Sorts the silhouettes, first by group, then by value, preparatory to plotting.
sort_silhouette(sil, cluster)
sort_silhouette(sil, cluster)
sil |
The |
cluster |
The |
The -vector of sorted silhouettes.
# Uses sports data. data(sportsranks) # Obtain the K-means clustering for sports ranks. kms <- kmeans(sportsranks, centers = 5, nstart = 10) # Silhouettes sil <- silhouette.km(sportsranks, kms$centers) ssil <- sort_silhouette(sil, kms$cluster)
# Uses sports data. data(sportsranks) # Obtain the K-means clustering for sports ranks. kms <- kmeans(sportsranks, centers = 5, nstart = 10) # Silhouettes sil <- silhouette.km(sportsranks, kms$centers) ssil <- sort_silhouette(sil, kms$cluster)
In the Hewlett-Packard spam data, a set of n = 4601 emails were classified according to whether they were spam, where "0" means not spam, "1" means spam. Fifty-seven explanatory variables based on the content of the emails were recorded, including various word and symbol frequencies. The emails were sent to George Forman (not the boxer) at Hewlett-Packard labs, hence emails with the words "George" or "hp" would likely indicate non-spam, while "credit" or "!" would suggest spam. The data were collected by Hopkins et al. [1999], and are in the data matrix Spam. ( They are also in the R data frame spam from the ElemStatLearn package [Halvorsen, 2009], as well as at the UCI Machine Learning Repository [Frank and Asuncion, 2010].)
Spam
Spam
A double matrix with 4601 observations on the following 58 variables.
Percentage of words in the e-mail that match make.
Percentage of words in the e-mail that match address.
Percentage of words in the e-mail that match all.
Percentage of words in the e-mail that match 3d.
Percentage of words in the e-mail that match our.
Percentage of words in the e-mail that match over.
Percentage of words in the e-mail that match remove.
Percentage of words in the e-mail that match internet.
Percentage of words in the e-mail that match order.
Percentage of words in the e-mail that match mail.
Percentage of words in the e-mail that match receive.
Percentage of words in the e-mail that match will.
Percentage of words in the e-mail that match people.
Percentage of words in the e-mail that match report.
Percentage of words in the e-mail that match addresses.
Percentage of words in the e-mail that match free.
Percentage of words in the e-mail that match business.
Percentage of words in the e-mail that match email.
Percentage of words in the e-mail that match you.
Percentage of words in the e-mail that match credit.
Percentage of words in the e-mail that match your.
Percentage of words in the e-mail that match font.
Percentage of words in the e-mail that match 000.
Percentage of words in the e-mail that match money.
Percentage of words in the e-mail that match hp.
Percentage of words in the e-mail that match george.
Percentage of words in the e-mail that match 650.
Percentage of words in the e-mail that match lab.
Percentage of words in the e-mail that match labs.
Percentage of words in the e-mail that match telnet.
Percentage of words in the e-mail that match 857.
Percentage of words in the e-mail that match data.
Percentage of words in the e-mail that match 415.
Percentage of words in the e-mail that match 85.
Percentage of words in the e-mail that match technology.
Percentage of words in the e-mail that match 1999.
Percentage of words in the e-mail that match parts.
Percentage of words in the e-mail that match pm.
Percentage of words in the e-mail that match direct.
Percentage of words in the e-mail that match cs.
Percentage of words in the e-mail that match meeting.
Percentage of words in the e-mail that match original.
Percentage of words in the e-mail that match project.
Percentage of words in the e-mail that match re.
Percentage of words in the e-mail that match edu.
Percentage of words in the e-mail that match table.
Percentage of words in the e-mail that match conference.
Percentage of characters in the e-mail that match SEMICOLON.
Percentage of characters in the e-mail that match PARENTHESES.
Percentage of characters in the e-mail that match BRACKET.
Percentage of characters in the e-mail that match EXCLAMATION.
Percentage of characters in the e-mail that match DOLLAR.
Percentage of characters in the e-mail that match POUND.
Average length of uninterrupted sequences of capital letters.
Length of longest uninterrupted sequence of capital letters.
Total number of capital letters in the e-mail
Denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spam data. Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304, 1999.
Louis Roussos asked n = 130 people to rank seven sports, assigning #1 to the sport they most wish to participate in, and #7 to the one they least wish to participate in. The sports are baseball, football, basketball, tennis, cycling, swimming and jogging.
sportsranks
sportsranks
A double matrix with 130 observations on the following 7 variables.
Baseball's ranking out of seven sports.
Football's ranking out of seven sports.
Basketball's ranking out of seven sports.
Tennis' ranking out of seven sports.
Cycling's ranking out of seven sports.
Swimming's ranking out of seven sports.
Jogging's ranking out of seven sports.
Data from one of Prof. John Marden's earlier classes
A data set containing several demographic variables on the 50 United States, plus D.C.
states
states
A double matrix with 51 observations on the following 11 variables.
In thousands
The percentage of the population that lives in metripolitan areas.
Number per 100,000 people.
The percentage enrollment in primary and secondary schools.
The average salary of primary and secondary school teachers.
The percentage full-time enrollment at college
Violent crimes per 100,000 people
Number of people in prison per 10,000 people.
Percentage of people below the poverty line.
Percentage of people employed
Median household income
United States (1996) Statistical Abstract of the United States. Bureau of the Census.
http://www.census.gov/statab/www/ranks.html
Takes the traces of a matrix by extracting the diagonal entries and then summing over.
tr(x)
tr(x)
x |
Square matrix ( |
Returns a single-value double.
# Identity Matrix of size 4, gives trace of 4. tr(diag(4))
# Identity Matrix of size 4, gives trace of 4. tr(diag(4))