|
|
||
![]() |
Estimating EquationsVersion 1.0 Quantitative
Genetic Epidemiology Group
ContentsCURRENT VERSION |
|
The current version of EE is 1.0 dated Oct. 25, 1994. This is the first version of the program.
EE, standing for estimating equations, consists of several programs that are useful for regression analysis of univariate or multivariate responses with multiple covariates. Regression analysis is one of the most commonly used statistical techniques because a vast number of different applications can be structured and analyzed as a regression problem, such as simple comparisons of means between two groups by t-test or between multiple groups by analysis of variance, somewhat more complex problems arising from longitudinal studies, and clustered sampling survey studies or family studies. This generality is the essential motivation for developing EE as a general package of regression analysis.
Traditionally, regression techniques have also been categorized by the specific link functions that depict the relationship between outcomes and covariates. For example, linear regression assumes that a linear combination of covariates is linearly associated with or determines the average of the outcomes. Logistic regression assumes that the average of the outcomes is the logit of a linear combination of covariates, and is often used for analyzing binary outcomes.
While many regression techniques, and hence relevant computer packages, have been developed for regression analysis, EE has a number of unique features. First of all, EE is developed based on an estimating equation technique, which differs from the method of maximum likelihood, or quasi-likelihood. The method of estimating equations requires only assumptions on relevant moments rather than knowledge of the entire distribution required for constructing likelihood functions or unneccessary assumptions of moments in constructing quasilikelihood. Hence, the estimates obtained by this technique are valid under much more general condition than those by the other techniques, i.e., they are more robust. Secondly, EE permits simultaneous regression analysis of means, variances, and covariances on covariates, the generality of which sets it apart from two other programs, GEE and GEE2, currently available.
Historically, the idea of robust regression may be traced back
to Godambe (1960), Huber (1967) and White (1982). However, it was
not until 1986 when Liang and Zeger (1986) and Zeger and Liang
(1986) re-invented the estimating equation approach for
longitudinal studies that the concept of estimating equations
became widely accepted and the technique frequently used with
implemented GEE programs. Extending Liang and Zegers idea,
Prentice (1988) proposed the joint estimation of parameters in
both means and covariances. Soon after, Prentice and his
colleagues identified a partly exponential family and a quadratic
exponential family under which the proposed estimating equations
are simply score estimating equations, bridging the gap between
the likelihood principle and the method of estimating equations
(Zhao and Prentice 1990, Prentice and Zhao 1992, Zhao and
Prentice 1992, Zhao, Prentice and Self 1993). While bridging the
theoretical gap, Zhao and Prentice have extended the usual
regression of the marginal means to the joint regression of
means, variances and covariances on covariates. The
implementation of EE is largely based on Prentice and Zhaos
generalization
Back to Top.
EE consists of several programs, all of which can be used to
perform univariate and multivariate analysis. Specifically, when
a single response is identified, EE automatically chooses the
programs for univariate analysis. In performing the univariate
analysis, one has the option of regressing only the mean of the
response on covariates or both the mean and variance of the
response on covariates, with a choice of a number of link
functions, including linear, logistic and exponential. When a
vector of multivariate responses is supplied, EE automatically
selects programs for multivariate analysis. In the multivariate
analysis, one has the option of focusing on the marginal means of
the outcomes, or both the marginal means and covariances of the
outcomes. In either cases, one has a choice of link functions for
the marginal means, variances and covariances, as well as a
choice of covariates to be included in their respective mean link
functions. The flexible choice of covariates into these first two
moments permits us to explore problems ranging from longitudinal
studies, studies with repeated measurements, or studies with
multiple outcomes.
Back to Top
In contrast to the method of maximum likelihood, the method of estimating equations requires only assumptions of relevant moments such as means, variances, and correlations. Let y¢ =(y1,...,ym) denote a vector of m outcomes, which may be a single outcome in the univariate analysis, or may be a vector of any length in a multivariate analysis. Also, for i=1,...,m, let xi¢ =(xi1,...,xip), denote a vector of p covariates associated with the expected value of yi, zi¢ =(zi1,...,ziq) a vector of q covariates associated with the variance of yi, and zzk¢ =(zzk1..,zzkr), k=1,...,m(m-1)/2, a vector of r covariates associated with correlations between yi and other outcomes yj. The vectors xi, zi , and zzk may consist of some of the same covariates. The objective of the regression analysis is to study the dependence of the outcomes, y, on these covariates. Most applications focus on the dependence of the averaged outcomes, i.e., their expectation, on the covariates, and this relationship may be quantified via
![]()
where a ¢ =(a 1,...,a p) is a vector of p regression coefficients, ¢ is the transpose operator, and g(× ) is a specified link function.
Similarly, the variance and correlation of outcomes can be expressed as functions of covariates as follows:

where f(.) and h(.) are specified link functions for variances and corrlelations and b¢ =(b1,...,bq) and g¢ =(g1,...,gr) are vectors of q and r regression coefficients, respectively.
Using this general framework, one can model the dependencies of the outcomes on covariates in term of the means, variances and covariances. In statistical modeling, one needs to rationalize link functions on the basis of the subject matter. For example, in choosing a link function for means, a logistic regression model is often considered for binary outcomes or binomial counts. For Poisson counts, an exponential regression model can be considered, while a linear regression model can be considered for continuous outcomes. Selection of covariates follows the usual principle.
For a specified model, the goal is to estimate the regression coefficients. Let yk =(yk,1,...,yk,nk), k=1,...,K, be a sample of K independant outcomes with associated covariates (xki, zki, zzki), i=1,...mk, and let (a , b , g ) denote the estimated regression coefficients. The estimates satisfy the equations
![]()
where the summation is over all K independent observations, Wk is a weight matrix, m k=E(yk),
sk=(sk,12, sk,1sk,2 , ... , sk,1sk,nk , sk,22,.., sk,nk2), sk,i=(yki-m ki), s k = E(sk), and Dk is a derivative matrix of (m k,s k) with respect to (a , b , g ). Solving this set of estimating equations, one obtains an estimate of the regression coefficients.
In general, there is no explicit solution. The iterative Newton-Raphson algorithm may be used to solve this set of estimating equations. Starting from an initial value, (a 0, b 0, g 0), a new value, (a 1, b 1, g 1), is obtained via

where u is the function described above, I is its derivative with respect to the regression
coefficients, and u0 and I0 are values
of these functions evaluated at (a 0,
b 0, g
0). The above algorithm iterates until either |u0|
< e or |a1-a0|
+ |b1-b0| + |g1-g0|
< e , where e is some prespecified tolerance.
Back to Top
Inputs to the program are provided via a control file, which specifies the name of the file containing the data in the system file format. Other input options include specification of the mean, variance and correlation links to be used, tolerance values for convergence criteria for mean, variance and correlation parameters, and initial values for the parameters to be estimated. The output of the program is written to a file.
Specifications for the control file and input data are
detailed in the sections below. See the next chapter for examples
of how to perform specific regression analyses.
Back to Top
The input data suitable for analysis with EE includes a response (univariate or multivariate), a group identification variable if the response is multivariate, and covariates which possibly explain the variability of the response and the working variances. The response can be continuous or discrete. If any transformation of the response or covariates is required, it must be done beforehand and the resulting values included as part of the input data.
Only one outcome variable can be used per analysis. The input data file can include more variables than will be used in any one model. The same data file can thus be used to explore different models, with appropriate specification of variables to be used in different runs of the program. This program cannot process missing observations. If there are any missing values, they must be removed from the data set before converting it to the system file format.
The EE program can read data from a data file in the system file format, with the .eed suffix, as well as ASCII or SAS xport files. If desired, the practitioners can use asc2eed program to translate ASCII data to data in the system file format as described in above section. The ASCII data must be represented in the ASCII file with one observation per line and fields separated by spaces. An example of multivariate ASCII data file format is as follows:
Figure 1. Sample ASCII data format
groupid outcome(response) var1 var2 .... varp (covariates)
1 y11 x111 x112 .... x11p
1 y12 x121 x122 .... x12p
. . . . .
. . . .
. . . . . ......
1


....
2
2 y21 x211 x212 .... x21p
2 y22 x221 x222 ... x22p
. .
. .
. .
i yij xij1 xij2 ... xijp
where i = 1, 2, ..., K and j = 1, 2, ..., mi
Univariate ASCII data files are similar to multivariate ones,
but there is no groupid variable in the file. Nor is there a
group concept for response and covariate variables. Each row in
file is just considered as an independent observation.
Back to Top
4.2 MEAN, VARIANCE AND CORRELATION LINK FUNCTIONS
The mean, variance and correlation link functions together specify the model to be fit for the data. The mean link function specifies how the mean of the data depends on a linear combination of a set of predictor variables. The variance link function can specify the relationship between the working variances used in the estimating equations and a set of covariates. The correlation link function sets the functional form between the working correlation and a set of covariates.
To simplify the presentation, mean and variance link functions are described separately for the univariate and multivariate situations.
4.2.1.1 Univariate situation:
Let m i signify the mean for each observation i=1,...,K, and (xi1,...,xip) be the set of covariates on which ui depends. The functions available for the mean link functions are:
Linear link function: m i = a 0+a 1xi1 + ··· + a pxip
Exponential link function: m i= exp(a 0+a 1xi1 + ··· + a pxip)
Logistic link function: m i = 1/[1 + exp(a 0+a 1xi1 + ··· + a pxip)]
Note that the intercept term is included in all three link
functions. To fit models without an intercept term, there is an
option which can be included in the command file which indicates
that no intercept should be included in the linear predictor.
Back to Top
4.2.1.2 Multivariate situation:
Let m ij, i=1,...,K and j=1,...,mi, where K is the group number in the data file and mi is the observations in ith group, signify the jth mean element in the ith group, and (xij1, ..., xijp) be the set of covariates on which m ij depends. The functions available for the mean link functions are:
Linear link function: m ij = a j0+a j1xij1 + ··· + a jpxijp
Exponential link function: m ij= exp(a j0+a j1xij1 + ··· + a jpxijp)
Logistic link function: m ij = 1/[1 + exp(a j0+a j1xi1 + ··· + a jpxijp)]
As in the univariate case, the intercept is included in the
linear predictor. The option of fitting a model without an
intercept term is available through a command which can be
included in the control file. All or some of a
jk for j=1,...,mi may be set equal.
Back to Top
4.2.2.1 Univariate situation:
For each observation i=1,...,K, a variance s i is calculated and used in the estimation. The variances are calculated according to the specified variance function. Some variance functions allow for dependence on an additional specified set of covariates (zi1, 1/4 , ziq). The functions available for variance functions are:
Constant variance function: s i = s 0
Fixed variance function: s i = b 1zi1 + b 2zi2 + ··· + b qziq
| Binomial variance function: s | ![]() |
where mi is the binomial denominator
Poisson variance function: s i
= m i
Linear variance function: s i =l m i + b 0 + b 1zi1 + ··· +b qziq
Exponential variance function: s i = m il exp(b 0 + b 1zi1 + ··· + b qziq)
| Logistic variance functions: s i = | ![]() |
where
Linear
intra-correlation coefficient is g i
= b 0 + b
1zi1 + ··· + b
qziq
Hypergeometric intra-correlation coefficient is
![]()
When you specify fixed variance, you must supply the values to
be used as the b coefficients in
addition to the z covariates. In this special case, if you
wish to include an intercept you must explicitly include an
intercept variable of all ones in the list of covariates, along
with the desired value, as no intercept will be assumed in the
model unless specified. No parameters are estimated in this
variance link function. Instead, the specified coefficients
(indicated as the initial values in the control file) and
covariates are used to calculate the values used as variance in
the algorithm.
Back to Top
4.2.2.2 Multivariate situation:
For ith group of observations i=1,...,K, a set of variances, s ij, j=1,...,mi, where mi is the number of observations in ith group, are calculated and used in the estimation. The variances are calculated according to the specified variance function. Some variance functions allow for dependence on an additional specified set of covariates (zij1, 1/4 , zijq). The functions available for variance functions are:
Constant variance function: s ij = s 0
Fixed variance function: s ij = b j1zij1 + b j2zij2 + ··· + b jqzijq
| Binomial variance function:sij | ![]() |
where mij is the binomial denominator
Poisson variance function: s ij
= m ij
Linear variance function: s ij = l m ij + b j0 + b j1zij1 + ··· + b jqzijq
Exponential variance function: s ij = m il exp(b j0 + b j1zij1 + ··· + b jqzijq)
| Logistic variance function:sij = | ![]() |
where:
Linear intra-correlation coefficient is g
ij = b j0
+ b j1zij1 +
··· + b jqzijq
Hypergeometric intra-correlation coefficient is
![]()
Note that like the mean link functions, the variance link
functions that involve covariate terms all include an intercept
term, except for fixed variance link function, where a column of
ones must be included in the data if an intercept is desired.
Back to Top
4.2.3 Correlation link functions
If a correlation link function has been specified, it should be in a multivariate situation. Let i=1,...,K signify the ith group of observations. For each pair of observations in this group, j,k=1,...,mi, j¹ k, a correlation s ijk is calculated and used in the weight matrix of the EE algorithm. The correlation are calculated according to the specified correlation function, which can allow for dependence on additional specified sets of covariates(zzij1,...,zzijr) and (zzik1,...,zzikr). The functions available for the correlation functions are as follows:
Linear correlation function:
| s ijk = (d 0 + d 1|zzij1-zzik1| + ... + d r|zzijr-zzikr|) |
Exponential correlation function:
| s ijk = exp(d 0 + d 1|zzij1-zzik1| + ... + d r|zzijr-zzikr|) |
Hypergeometric correlation function:
| s ijk = |
4.3 SPECIFYING OPTIONS IN A CONTROL FILE
The EE program reads all input options from a control file. Inputs to the program include the name of the data file, variable names of covariates and response, and specifications of mean link function, variance link function, and correlation link function. Specify these and other input options in a control file using the format described below. Some arguments are required to run the program, while others need only be supplied if you wish to specify additional options. Figure 2 shows a sample control file. The control file must be a simple text file. If you use a word processor to create and edit your control file, be sure it is saved in ASCII format and not as a word processed document, as the EE program cannot interpret any additional formatting commands placed in a document by a word processing program.
Figure 2. Sample control file
Data filename : mvn2.eed
Output filename :
Identification variable of cluster : groupid
Outcome variable : y
Denominator variable :
Covariates in
Mean : x1 x2 x3 x4 x5
Variance : z1 z2 z3
Correlation :
Mean link : LIN
Variance link : EXP
Correlation link :
Initial values of parameters in
mean function :
variance function :
correlation function :
Fixed parameters about dependence of
variances on means :
Convergence criteria
mean parameters :
variance parameters :
correlation parameters :
Maximum number of iterations :
Joint :
Intercept
Back to Top
Data filename
Outcome variable
Identification variable of cluster
Covariates in Mean
Mean link
LIN
linear
link
EXP
exponential
link
LOGIT
logistic
link
Output filename
Identification variable of cluster
Denominator variable
Variance covariates
Correlation covariates
Variance function
CONST constant variance FIXED fixed variance BIN binomial variance POIS Poisson variance LIN linear variance function EXP exponential variance function LOGITLIN logistic variance function with linear intra-correlation coefficient LOGITHG logistic variance function with hypergeometric intra-correlation
If variance function is not specified, then the default variance function, constant variance, will be used.
Correlation function
Initial values
Starting values for the parameters to be estimated in the mean link, variance link, and correlation link functions. If not specified, the defaults are zeros.
In the univariate situation, this argument is also used to specify the coefficients for terms in fixed variance. This argument is required if the fixed variance link function is specified.
Dependence of variances on means
Value for parameter lambda in the linear and exponential variance functions, if a contribution of mean is desired for the working variances. Default is 0.
Convergence criteria
Tolerance value for convergence of the parameters in the mean link, variance link, and correlation link. Default values are 0.0001 for mean parameters and 0.001 for variance and correlation parameters.
Maximum number of iterations
Joint
0
plain univariate or multivariate working matrix
1
independence
working matrix
2 Gaussian
working matrix
3
extended
Gaussian working matrix
Default value is 0.
Intercept
The output of the program is written to a file. The output file includes information about the specification of the model, the parameter estimates with standard errors, and covariates for the estimates. Figure 3 shows a sample output file. In the tables for marginal means, marginal variances, and marginal correlations, both model-based and robust estimates of the standard errors of the parameter estimates are given. The SE is the model-based estimate, which is valid if the mean, variance and correlation are correctly specified. The robust SE is from the "sandwich" estimate for the asymptotic variance of the estimated parameters. The z-values are the ratios of the parameter estimates and their robust standard errors, and the p-values for these z-values are from the normal distribution. Under the null hypothesis, the EE parameter estimates follow an asymptotically normal distribution.
Figure 3. Sample output file
Data : mvn2.eed
Outcome variable : y
Covariates in Mean : intercept x1 x2 x3 x4 x5
Covariates in Var : z1 z2 z3
Mean link : Linear link function
Variance link : Exponential variance function
Correlation link : Hypergeometric correlation function
Dependence of Variances on Means : 0.000000
epsilon1 : 0.001000
epsilon2 : 0.010000
Number of observations : 500
Number of groups : 81
Maximum group size : 10
Minimum group size : 2
Number of Fisher Scoring Iterations : 6
=====================================================================
Marginal Means
=====================================================================
Estimate SE robust SE z-value p-value
intercept -2.06193 0.10062 0.06410 -32.16744 0.00000
x1 -0.92815 0.07406 0.04827 -19.22953 0.00000
x2 0.07829 0.07766 0.05674 1.37982 0.16764
x3 1.00097 0.07247 0.04886 20.48720 0.00000
x4 1.99449 0.06911 0.04271 46.69948 0.00000
x5 2.93977 0.07052 0.04744 61.97361 0.00000
Correlation of Coefficients
intercept x1 x2 x3 x4
x1 0.109781
x2 -0.078542 -0.013310
x3 0.096431 0.063342 0.073256
x4 -0.031132 -0.087433 -0.189933 0.031581
x5 0.064436 0.090774 0.193858 0.032531 -0.241995
Covariance of Coefficients
Intercept x1 x2 x3 x4 x5
Intercept 0.004109
x1 0.000340 0.002330
x2 -0.000286 -0.000036 0.003219
x3 0.000302 0.000149 0.000203 0.002387
x4 -0.000085 -0.000180 -0.000460 0.000066 0.001824
x5 0.000196 0.000208 0.000522 0.000075 -0.000490 0.002250
Marginal Variance
=====================================================================
Estimate SE robust SE z-value p-value
intercept 1.41014 0.01119 0.15355 9.18380 0.00000
z1 0.13608 0.00655 0.11333 1.20071 0.22986
z2 0.63862 0.00628 0.16495 3.87168 0.00011
z3 -0.46481 0.00574 0.13552 -3.42981 0.00060
Correlation of Coefficients
intercept z1 z2
z1 -0.098094
z2 -0.687131 0.267283
z3 0.175155 0.077508 0.310760
Covariance of Coefficients
intercept z1 z2 z3
intercept 0.023577
z1 -0.001707 0.012844
z2 -0.017403 0.004997 0.027208
z3 0.003645 0.001190 0.006947 0.018366
=====================================================================
Marginal Correlation ===================================================================== Estimate SE robust SE z-value p-value Intercept 0.38540 0.00993 0.19635 1.96281 0.04967 Back to Top
The most common statistical analysis performed on data with a binary outcome is logistic regression. It is also possible to perform a linear regression with binary data, for a different interpretation of the coefficients in the model. The EE approach using the logistic link function and binomial variance function is equivalent to standard logistic regression, while the linear link function together with the constant variance function gives standard linear regression.
First we use univariate estimating equations to illustrate logistic regression with the EE program. The data we used are from Meyer and White (1993) in a study on alcohol and nutrients in relation to colon cancer. The covariates include several dietary variables, and the outcome is a binary variable indicating whether or not a subject has colon cancer. A subset of the complete data is stored in an ASCII file called colon.dat whose five beginning and ending lines look like this:
status sex alcohol fiber kcal fat
status sex alcohol fiber kcal fat
1 0 10.93611 41.38175 3108.84 178.109
1 1 29.46 70.24778 4338.851 218.1827
1 0 4.131667 28.56525 2802.585 126.0111
1 0 2.451667 11.8891 1526.369 78.6589
1 0 0.130556 46.08092 2085.224 94.13641
. . . . . .
. . . . . .
. . . . . .
0 0 0.333333 21.88417 843.4617 33.04477
0 0 25.805 31.12233 2150.008 86.51898
0 1 10.60667 26.56195 2893.791 122.3937
0 1 35.33167 37.09152 2789.703 97.97728
0 1 79.62083 22.94439 2522.076 67.47237
The variables in this file include a variable for the status of colon cancer coded as 1 or 0, sex coded 1 for male and 0 for female, daily alcohol intake in grams (alcohol), daily total dietary fiber intake in grams (fiber), total energy in kilocalories (kcal), and dietary fat in grams (fat). We use the command
asc2eed colon.dat
to create a binary data file named colon.eed containing this data in the EE system file format. To perform a logistic regression on this data set, we create a control file named colon.eec that contains these specifications:
Data filename: colon.eed Outcome variable: status Mean covariates: intercept sex kcal fat fiber alcohol Mean link: LOGIT Variance link: BIN
Then the command
ee colon.eec
runs the EE program and produces an output file named colon.eeo shown below. The model-based and robust standard errors for the parameter estimates in the table for Marginal Means are very similar in this case because we use the binomial variance function which is specified by the mean. The model-based SEs are valid if the assumed model is correct, while the robust SEs do not require correct specification of the variance.
According to this analysis, total dietary fiber intake is the only significant contributor to the status of colon cancer from this set of variables, with a p-value of about .03. The negative
coefficient for this variable indicates that it has a
protective effect against colon cancer
Data : colon.eed
Outcome variable : status
Covariates in Mean : intercept sex kcal fat fiber alcohol
Mean link : Logistic mean link function
Variance link : Binomial variance link function
Dependence of Variances on Means : 0.000000
epsilon1 : 0.000100
epsilon2 : 0.001000
Number of observations : 838
Number of Fisher Scoring Iterations : 4
=====================================================================
Marginal Means
=====================================================================
Estimate SE robust SE z-value p-value
intercept -0.01795 0.20616 0.20459 -0.08773 0.93009
sex -0.06050 0.14816 0.15131 -0.39984 0.68927
kcal 0.00035 0.00033 0.00033 1.05891 0.28964
fat -0.00279 0.00581 0.00582 -0.47924 0.63177
fiber -0.01805 0.00814 0.00828 -2.17914 0.02932
alcohol 0.00067 0.00517 0.00546 0.12278 0.90228
Correlation of Coefficients
intercept sex kcal fat fiber
sex -0.186219
kcal -0.292919 -0.093585
fat 0.155901 0.027950 -0.929452
fiber -0.249781 0.080118 -0.464397 0.248175
alcohol 0.165314 -0.096348 -0.834207 0.754208 0.452118
Covariance of Coefficients
intercept sex kcal fat fiber alcohol
intercept 0.041856
sex -0.005764 0.022894
kcal -0.000020 -0.000005 0.000000
fat 0.000186 0.000025 -0.000002 0.000034
fiber -0.000423 0.000100 -0.000001 0.000012 0.000069
alcohol 0.000185 -0.000080 -0.000002 0.000024 0.000020 0.000030
Back to Top
In this section, we use multivariate estimating equations to perform linear regression with the EE program. Here the data are from a large case-control study conducted in the multiethnic population of Hawaii to compare the colorectal cancer risk of relatives (parents and siblings) of colorectal cancer cases with that of relatives of controls. The outcome variable is colorectal cancer (y) and the group identification (id). The covariates are cc (case 1, control 0), cjap (Japanese case relative 1, other case relative 0), ccau (Caucasians case relative 1, other case relative 0), rage (relative age = (actual age - 50)/10) and page (proband age = (actual age - 50)/10). The results show that relatives of cases have 1.5=exp(0.386) fold increased risk of colorectal cancer compared to relatives of controls. The control file colorect6.eec created to perform a logistic regression on the data contains following specifications:
Data filename : colorect6.eed Identification variable of cluster : id Outcome variable : y Covariates in Mean : intercept cc cjap ccau rage page Mean link : LOGIT Variance link : BIN Correlation link : LIN The result output file is: Data : colorect6.eed Outcome variable : y Covariates in Mean : intercept cc cjap ccau rage page Mean link : Logistic mean link function Variance link : Binomial variance link function Correlation link : Linear correlation link function Dependence of Variances on Means : 0.000000 epsilon1 : 0.000100 epsilon2 : 0.001000 Number of observations : 15523 Number of groups : 2377 Maximum groups size : 23 Minimum groups size : 1 Number of Fisher Scoring Iterations : 5 ===================================================================== Marginal Means ===================================================================== Estimate SE robust SE z-value p-value intercept -4.49777 0.16362 0.15160 -29.66927 0.00000 cc 0.38599 0.23601 0.23203 1.66354 0.09620 cjap 0.58089 0.22885 0.22399 2.59335 0.00950 ccau 0.70483 0.26262 0.26924 2.61783 0.00850 rage 0.23144 0.03896 0.03047 7.59598 0.00000 page -0.21644 0.06847 0.07100 -3.04846 0.00230 Correlation of Coefficients intercept cc cjap ccau rage cc -0.376105 cjap 0.035243 -0.754507 ccau -0.018145 -0.641089 0.669916 rage -0.173857 0.060252 -0.099542 -0.028132 page -0.492877 0.093073 -0.000321 -0.035539 -0.319607 Covariance of Coefficients intercept cc cjap ccau rage page intercept 0.022982 cc -0.013229 0.053837 cjap 0.001197 -0.039214 0.050172 ccau 0.000741 -0.040050 0.040401 0.072491 rage -0.000803 0.000426 -0.000679 -0.000231 0.000928 page -0.005305 0.001533 -0.000005 -0.000679 -0.000691 0.005041 ===================================================================== Marginal Correlation ===================================================================== Estimate SE robust SE z-value p-value intercept 0.04751 0.21422 0.01701 2.79275 0.00523 Back to Top
Count data that do not arise from a situation with a maximum count possible typically follow a Poisson distribution. To perform Poisson regression using EE, specify the exponential mean link function with Poisson variance link function. The exponential variance link function allows for a power family, with an additional parameter specifying the power of the contribution of the mean in this variance link function.
The data in the following example are from Laishes and Rolfe (1980). The number of GT-positive liver colonies detected in the ARL lobe of the livers of rats injected with diseased cells is used as the outcome variable. The covariates are the number of cells injected into the rat liver (cells), the number of the particular experiment (1, 2, or 3) (exper), and the section for each observation (the experiment was replicated within each specimen in two different sections) (section). The results of an analysis using EE for Poisson regression are shown below. The section does not appear to have a significant effect on the outcome, while the number of GT-positive colonies increases with increasing number of injected cells.
Data: liver.eed
Outcome variable: ARL
Covariates in Mean: cells exper section
Mean link: Exponential link function
Variance link: Poisson variance function
Dependence of Variances on Means: 1.000000
epsilon1 = 0.000100
epsilon2 = 0.001000
Number of observations: 52
Number of Fisher Scoring Iterations: 4
=====================================================================
Marginal Means
=====================================================================
Estimate SE robust SE z-value p-value
cells 0.25819 0.00587 0.01599 16.14432 0.00000
exper 0.43941 0.03110 0.06970 6.30395 0.00000
section 0.27012 0.07249 0.19752 1.36759 0.17144
Correlation of Coefficients
cells exper
exper -0.626994
section -0.132547 -0.492546
Covariance of Coefficients
cells exper section
cells 0.000256
exper -0.000699 0.004859
section -0.000419 -0.006781 0.039013
Back to Top
5.3 Continuous outcome
Linear regression is the most common model for data with a continuous outcome. The linear link function expresses the conditional expectation of the outcome on the covariates as a linear combination of the covariates. A linear mean link with constant variance corresponds to an ordinary least-squares regression.
The data in the following example are from a recent AIDS clinical trial (Volberding et al, 1990). The CD4 cell count, a measure of the immunological status of subjects, is used as the outcome. The covariates are race (1=black, 0=otherwise), hispanic origin (1=hispanic origin, 0=otherwise), sex (1=male, 0=female), age (which is transformed as (age-30)/10), homosexuality (1=yes, 0=no) and parental use of drug (1=yes, 0=no). The results show that CD4 cell counts are greater among males, non-Hispanics, and IV drug users than among females, Hispanics, and non-IV drug users, respectively.
______________________________________________________________________________
Data : aids2.eed Outcome variable : cd4 Covariates in Mean : intercept race sex hisp age homosex bisex pdrug Covariates in Var : race sex hisp age homosex bisex pdrug Covariates in Col : age Mean link : Linear mean link function Variance link : Linear variance link function Correlation link : Linear correlation link function Dependence of Variances on Means : 0.000000 epsilon1 : 0.000100 epsilon2 : 0.001000 Number of observations : 767 Number of groups : 296 Maximum group size : 4 Minimum group size : 1 Number of Fisher Scoring Iterations : 37 ===================================================================== Marginal Means ===================================================================== Estimate SE robust SE z-value p-value intercept 5.88564 0.12397 0.13462 43.72112 0.00000 race 0.17587 0.09054 0.09204 1.91086 0.05602 sex -0.04589 0.14639 0.15351 -0.29893 0.76499 hisp -0.01353 0.10511 0.10480 -0.12915 0.89724 age -0.04280 0.03205 0.02991 -1.43084 0.15248 homosex 0.00758 0.10245 0.10103 0.07501 0.94021 bisex -0.04302 0.10407 0.09659 -0.44543 0.65601 pdrug -0.08556 0.10587 0.13998 -0.61127 0.54102 Correlation of Coefficients intercept race sex hisp age homosex bisex race -0.220549 sex -0.455414 0.178294 hisp -0.509924 -0.126538 -0.272034 age -0.200106 0.181740 0.061813 -0.071307 homosex -0.100311 0.079924 -0.650060 0.148885 0.102126 bisex -0.040682 0.069350 -0.506308 0.047481 0.036053 0.708477 pdrug -0.396592 -0.050452 0.412659 -0.111312 0.148884 -0.031537 -0.013277 Covariance of Coefficients intercept race sex hisp age homosex bisex pdrug intercept 0.018122 race -0.002733 0.008471 sex -0.009411 0.002519 0.023566 hisp -0.007194 -0.001220 -0.004376 0.010982 age -0.000806 0.000500 0.000284 -0.000224 0.000895 homosex -0.001364 0.000743 -0.010082 0.001576 0.000309 0.010207 bisex -0.000529 0.000616 -0.007507 0.000481 0.000104 0.006913 0.009329 pdrug -0.007473 -0.000650 0.008867 -0.001633 0.000623 -0.000446 -0.000180 0.019593 ===================================================================== Marginal Variance ===================================================================== Estimate SE robust SE z-value p-value intercept 0.25322 0.05248 0.10852 2.33343 0.01963 race 0.00095 0.03626 0.06447 0.01470 0.98827 sex 0.08854 0.05933 0.10059 0.88018 0.37876 hisp -0.02680 0.04628 0.09169 -0.29225 0.77010 age -0.00194 0.01292 0.02342 -0.08297 0.93388 homosex -0.05531 0.03969 0.05493 -1.00700 0.31393 bisex -0.12025 0.03897 0.05281 -2.27689 0.02279 pdrug 0.01889 0.04730 0.10134 0.18640 0.85213 Correlation of Coefficients intercept race sex hisp age homosec bisex race -0.251102 sex -0.509835 0.383899 hisp -0.633539 -0.160563 -0.176536 age -0.098871 0.339738 0.146322 -0.167215 homosex -0.011417 -0.134563 -0.549546 0.108393 -0.111768 bisex 0.016829 0.101839 -0.366664 -0.075173 -0.058066 0.572637 pdrug -0.372748 0.077465 0.488143 -0.108004 0.168403 -0.059314 -0.017718 Covariance of Coefficients intercept race sex hisp age homosex bisex pdrug intercept 0.011776 race -0.001757 0.004157 sex -0.005565 0.002490 0.010119 hisp -0.006304 -0.000949 -0.001628 0.008407 age -0.000251 0.000513 0.000345 -0.000359 0.000548 homosex -0.000068 -0.000477 -0.003036 0.000546 -0.000144 0.003017 bisex 0.000096 0.000347 -0.001948 -0.000364 -0.000072 0.001661 0.002789 pdrug -0.004099 0.000506 0.004976 -0.001004 0.000400 -0.000330 -0.000095 0.010271 ===================================================================== Marginal Correlation ===================================================================== Estimate SE robust SE z-value p-value intercept 0.54310 0.04745 0.04625 11.74226 0.00000 age -1.10456 1.00455 0.09331 -11.83749 0.00000 Correlation of Coefficients intercept age -0.998388 Covariance of Coefficients intercept age intercept 0.002139 age -0.004309 0.008707 Back to Top
6.1 Introduction
The EE object code can be linked to Splus creating a new executable version capable of ee based statistical modeling. The necessary Splus functions which facilitates using ee under the created Splus executable are available. These functions are modifications of, and are used similarly as, those developed by V. Carey and A. Mcdermott of Chaining Laboratory, Harvard Medical School for their Splus implementation of GEE. The primary function, named ee, is used to pass appropriate parameters and invoke the EE module then creates an ee-object with a class attribute glm based on the parameters passed and the results returned by the estimation procedure. The ee function is defined as follows:
| ee <- function | (formula = formula(data), mean.function = NA, variance.function = NA, corr.function = NA,id = NA, deno = NA, data = sys.parent(), subset, na.action, mean.coef = NA, variance.coef = NA, corr.coef = NA, lambda= 0, tol = 0.001, maxiter = as.integer(25), mean.link = "LIN", variance.link = "CONST", corr.link = NA,joint = F, silent = T) {...} |
where
| formula | is an Splus expression of the form response~predictors. Response is the response variable and predictors include all the other variables to be used in the model. |
| mean.function | is an Splus expression of the form ~predictors. Predictors are the variables included in the left-hand side of the formula expression above for the mean link function. This parameter can be omitted if the predictors specified in the formula expression will all be used for the mean link function. |
| variance.function | is an Splus expression of the form ~predictors. Predictors are the variables included in the left-hand side of the formula expression above for the mean link function. This parameter can be omitted if the predictors specified in the formula expression will all be used for the mean link function. |
| corr.function | is an Splus expression of the form ~predictors. Here predictors include the variables for the correlation link function. |
| id | specifies the variable to identify clusters. |
| deno | is the denominator variable. |
| data | is the source data frame for the formula expression above and for the id and deno variables. |
| subset | expression specifying a subset of the rows of the data to be used in the fit. |
| na.action | is a function to filter missing data. |
| mean.coeff | is a vector of initial values for the mean link coefficients. |
| variance.coeff | is a vector of initial values for the variance link coefficients. |
| corr.coeff | is a vector of initial values for the correlation link coefficients. |
| tol | is the convergence criteria. |
| maxiter | is the maximum number of iterations allowed for the fitting algorithm. |
| mean.link | is the mean link function type. |
| variance.link | is the variance link function type. |
| corr.link | is the correlation link function type. |
| joint | is a flag set to True if joint estimation is desired |
| silent | is a flag set to False to get a display of the estimated coefficients at each iteration. |
The other functions are print.ee and print.summary.ee which
are used for displaying respectively all the details or a summary
including the model specified, the estimated coefficients and
variance/covariance matrix of an ee-object. A help file is also
available which can be installed as an on-line help in Splus
describing the parameters of the ee function.
Back to Top
6.2 Instalation and use of the ee function in Splus
After downloading all the files, the following commands will install the macro and the associated Splus help file.
Splus
QINSTALL .Data *.q
Splus
HINSTALL .Data/.Help *.d
If these commands are not issued from the directory containing the downloaded .q and .d files, the path to these files must be specified. Or if these commands are not issued from the directory containing your .Data directory, a path to that directory must be specified. It may be necessary to create the .Help directory within your .Data directory.
To invoke Splus, issue the command
Splus EXEC local.Sqpe
If this command is not issued from the directory containing the downloaded local.Sqpe file the path to this file must be specifed.
Once the above commands have been executed, the Splus command
help(ee)
will supply a help file on the ee macro, with a description of
the required and optional arguements for the ee macro.
Back to Top
This program was written by programmers and researchers in the Quantitative Genetic Epidemiology Group. We appreciate your feedback and will be happy to assist with any questions you may have regarding the program.
The EE program is designed using object-oriented methodology and is written in the C++ programming language. The EE program is originally designed for solving univariate estimating equation problems, then has been extended for solving both univariate and multivariate estimating equation problems. The basic classes are EEModel, Accumulator, Linkfunction and SysFile, where EEModel dealing with estimating equation computing models; Accumulator controlling regression loops; Linkfunction choosing different mean and variance link functions and SysFile providing system format data file as input data. There are some subclasses such as MultiEEModel, MultiAccumulator and MultiLinkfunction etc., derived from the base classes of EEModel, Accumulator and Linkfunction respectively. The EE program also has some utility components to process data file format translation; regression results output, and user select options. Because of its object-oriented style, the EE program is quite flexible and can be easily extended to include additional link, variance and correlation functions.
The EE program is developed under AIX (Unix) environment on
IBM RS6000 workstation. It can be run under DOS or MS window on
PC as well.
Back to Top
7.2 Methodological development of estimating equation
The development of estimating equation can be traced back to the early sixties, and later in the eighties, when Godambe and his coworkers [1960], Huber [1967, 1981] and White [1982] independently proposed the concept of an estimating function (or likelihood estimation of misspecified model) and studied its theoretical properties. These research activities remained largely theoretical, and the relevant literature was not well known to the majority of statisticians until the late eighties. In 1986, Liang and Zeger, without being aware of the earlier works in estimating functions, independently proposed estimating equation for analyzing longitudinal data (especially discrete data) with minimum statistical assumptions [Liang and Zeger, 1986; Zeger and Liang, 1986]. Their work may be thought as a multivariate extension of the generalized linear model. Following this research, other researchers including Prentice, Zhao, and Self [Prentice, 1988; Zhao and Prentice, 1990; Zhao and Prentice, 1991; Zhao, Prentice and Self, 1992; Prentice and Zhao, 1991] studied the properties of estimating equation, established the connection between the estimating equation and maximum likelihood frameworks, and extended the estimating equation framework from marginal means to include higher-order moments. They have used the estimating equation framework to establish a number of methodologies for biomedical problems arising from public health research, especially cancer research.
Let y denote an outcome, and let x=(x1,...,xp) denote a vector of p covariates. Regression of y on x assesses the dependence of the outcome y on the covariate vector x via
,
where a =(a
1,...,a p) is a
vector of p regression coefficients, ¢
is the transpose operator, and g(.) is a specified
function. The specification g(a) = a corresponds to linear
regression, g(a) = 1/[1+exp(-a)] to logistic regression, g(a) =
exp(a) to exponential regression, etc. The main objective of
regression analysis is to estimate the regression coefficients a , and to make relevant statistical
inference. For example, if a 1=0,
it implies that the covariate x1 is not associated
with the mean of the outcome y, since any change in x1
does not alter the mean of y.![]()
In classical statistics, a distribution assumption is often made regarding the random outcome y given the covariate vector x, i.e. f(yi|a , xi). Consider a sample of N random observations, (yi,xi), i=1,...,N. Under the assumption f(yi|a , xi), the likelihood function is obtainable, and is written as the product over all N random observations of the density functions f. Maximizing this likelihood with respect to a results in an estimate of the regression coefficients (if such an estimate exists) which is consistent to the true underlying coefficient and has an asymptotic normal distribution.
The asymptotic normal distribution can then be used for making
statistical inference about the regression coefficient. Also
according to the maximum likelihood theory, the estimate is fully
efficient under the assumed distribution.![]()
![]()
During the early seventies, Nelder and Wedderburn [Nelder and Wedderburn, 1972; Wedderburn, 1974] recognized that all of the commonly used regression methods rely on distributions that are members of the exponential family, and hence their likelihood functions are proportional to an exponential function
![]()
where the canonical parameter q i is a function of m i and c(yi) is a shape function. Consequently, the score estimating equations, which are the first derivative of the log-likelihood function with respect to the parameters of interest, have a unified form, namely,
![]()
where the summation is over all N independent samples, Di is a derivative matrix of m i with respect to a , and Vi is the variance of yi. The estimate of a is a solution to the equations above. Interestingly, the above estimating equations have the same form regardless of the type of outcome. This uniform expression for the score estimating equations forms the basis of the generalized linear model [McCullagh and Nelder, 1991].
A usual concern with any maximum likelihood estimation is its validity under a violation of the assumed distribution. A possible consequence of violating the distributional assumption is that the estimates of the parameters or their variances may be biased, or may not have an asymptotic normal distribution at all. The concern with misspecified distribution assumption motivates many statisticians to establish methods that are distribution-free, i.e., robust statistical methods. By careful observation of the score estimating equations, Wedderburn [1974] discovered that the distribution assumption may be relaxed, and may be replaced by a weaker assumption of the first two moments, namely, the marginal mean, m i, and the marginal variance, Vi, since they fully specify the score estimating equations above and their correct specification ensures a valid estimate of the regression coefficients and statistical inference about them. In fact, under the assumption of these two moments the estimate is fully efficient. This discovery leads to the development of the quasi-likelihood function, which is the result of integrating the assumed "score estimating equations" with respect to the regression coefficient. The estimate from quasi-likelihood is no longer a likelihood estimate, since the quasi-likelihood may not be a likelihood function at all. Nevertheless, the quasi-likelihood estimate possesses all of the properties of likelihood estimate.
In the quasi-likelihood regression method above, both the
marginal mean and the marginal variance are assumed, although the
marginal variances or their parameters are often not of interest
in most regression problems. Unfortunately, any misspecification
of the variance function, although it does not bias the estimate
of the regression coefficients, may yield an incorrect estimate
of the asymptotic variance and hence may invalidate the
statistical inference. Through further examination of the score
estimating equations above, Liang and Zeger [Liang and Zeger,
1986; Zeger and Liang, 1986], Prentice and Zhao [Prentice, 1988;
Zhao and Prentice, 1990; Zhao and Prentice, 1991; Prentice and
Zhao, 1991] and others [Lipsitz et al, 1991; Zeger et al, 1988;
Zhao et al, 1992; Liang et al, 1992] have found in the context of
multivariate problems that the relevant estimate, even with
incorrectly specified variances, still has an asymptotic normal
distribution with an easily estimable variance. In fact, the
"score estimating equations" above, with the variance
function replaced by an arbitrary weight, leads to a set of
estimating equations. Provided that the marginal mean m i is correctly specified, the
estimating equations yield a consistent estimate of the
regression coefficients, and the estimate has an asymptotic
normal distribution. As the ratios of the weights versus the
actual variances approach a constant, the estimate from
estimating equations is fully efficient [Zhao et al, 1992]. It
should be noted that estimating equations are also referred to as
GEE, which is an acronym for generalized estimating equations.
Back to Top
Please use the following reference when reporting results based on EE:
Q.G.E.: "EE: Estimating Equations", Fred Hutchinson Cancer Research Center,
Quantitative Genetic Epidemiology, Technical report 126.
Back to Top
Godambe V.P. (1960). An optimum property of regular maximum likelihood estimation. Annals of Mathematical Statistics 31: 1208-1212.
Huber P.J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceeding of the Fifth Berkeley Symposium in Mathematical Institute and Probability. Berkeley: UC Press.
Huber P.J. (1981). Robust Statistics. New York: John Wiley & Sons.
Karim M.R., Zeger S.L. (1988). GEE: A SAS Macro for Longitudinal Analysis. Technical Report #674. Dept. of Biostatistics, John Hopkins University.
Laishes B.A., Rolfe P.B. (1980). Quantitative Assessment of Liver Colony Formation and Hepatocellular Carcinoma Incidence in Rats Receiving Intravenous Injections of Isogeneic Liver Cells Isolated during Hepatocarcinogenesis. Cancer Research 40: 4133-4143.
Liang K.Y., Zeger S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73: 13-22.
Liang K.Y., Zeger S.L., Qaqish B. (1992). Multivariate regression analysis for categorical data. Journal of the Royal Statistical Society, Series B 54: 3-40.
Lipsitz S.R., Laird N.M., Harrington D.P. (1991). Generalized estimating equations for correlated binary data: Using the odds ratio as a measure of association. Biometrika 78: 153-160.
McCullagh P., Nelder J.A. (1989). Generalized Linear Models, second edition. London: Chapman and Hall.
Meyer F., White E. (1993). Alcohol and Nutrients in Relation to Colon Cancer in Middle-aged Adults. American Journal of Epidemiology 138: 225-236.
Prentice R.L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics 44: 1033- 1048.
Prentice R.L., Zhao L.P. (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47: 825-839.
Wedderburn R.W.M. (1974). Quasilikelihood functions, generalized linear models and the Gauss-Newton method. Biometrika 61:439-447.
White H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50: 1-25.
Zeger S.L., Liang K.Y. (1986). Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42: 121-130.
Zeger S.L., Liang K.Y., Albert P. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics 44: 1049-1060.
Zhao L.P., Grove J., Quiaoit F. (1992). A method for assessing patterns of familial resemblance in complex human pedigrees with an application to the nevus count data in 28 Utah kindreds. American Journal of Human Genetics 51: 178-190.
Zhao L.P., Prentice R.L. (1990). Correlated binary regression using a quadratic exponential model. Biometrika 77: 642-648.
Zhao L.P., Prentice R.L. (1991). Use of a quadratic exponential model to generate estimating equations for means, variances, and covariances. In Estimating Functions, Godambe V.P. (ed). Oxford: Oxford Press.
Zhao L.P., Prentice R.L., Self S.G. (1992). Multivariate mean
parameter estimation using a partly exponential model. Journal of
the Royal Statistical Society, Series B 54: 805-811.
Back to Top