proc glm vs proc reg

现在的位置: 首页 > 综合 > 正文

proc glm vs proc reg

2018年10月21日 ⁄ 综合 ⁄ 共 17741字 ⁄ 字号小中大 ⁄ 评论关闭

1.0 Introduction to Regression Procedures in SAS

Statistical Procedure	Functions
REG	performs linear regression with many diagnostic capabilities, selects models using one of nine methods, produces scatter plots of raw data and statistics, highlights scatter plots to identify particular observations, and allows interactive changes in both the regression model and the data used to fit the model.
CATMOD	analyzes data that can be represented by a contingency table.
GENMOD	fits generalized linear models.
GLM	uses the method of least squares to fit general linear models.
LOGISTIC	fits logistic models for binomial and ordinal outcomes.
NLIN	builds nonlinear regression models.
PROBIT	performs probit regression as well as logistic regression and ordinal logistic regression.
LIFEREG	fits parametric models to failure-time data that may be right censored.
ROBUSTREG	performs robust regression using Huber M estimation and high breakdown value estimation.

This chapter we introduce the procedure GLM and
REG. The REG procedure provides the most general analysis capabilities; the other procedures give more specialized analyses.

2.0 General Linear Model

The GLM procedure (general linear model) uses the method of least squares to fit general linear models relating to one or several continuous dependent variables to one or several independent variables.

Strengths:

direct specification of polynomial effects
ease of specifying categorical effects (PROC GLM automatically generates dummy variables for class variables)

Weaknesses:

No collinearity diagnostics
No influence diagnostics
No scatter plots
Only one model at one time

Most of the statistics based on predicted and residual values that are available in PROC REG are also available in PROC GLM. However, PROC GLM does not produce collinearity diagnostics, influence diagnostics,
or scatter plots. In addition, PROC GLM allows only one model and fits the full model.

The general form of PROC GLM can be found in Introduction to ANOVA.

Demonstrations and explanations:

We use SAS data set drugtest as an example. In this data, we have three variables
Drug, PreTreatment and PostTreatment, meaning the drug types, pre and post treatment measures. Here is the list of samle data:

Pre Post

Obs Drug Treatment Treatment

1 A 11 6

2 A 8 0

3 A 5 2

4 A 14 8

5 A 19 11

6 A 6 4

7 A 10 13

8 A 6 1

9 A 11 8

10 A 3 0

The following codes model a general linear regression to predict the effect of drug type and pre treatment measure to the post treatment outcome. In addition, ouput the predicted values and residuals to a new SAS data set.

odshtml;

ods graphics
on;

PROC GLM data=mylib.drugtest;

      class Drug;

      model PostTreatment = Drug PreTreatment /
solution;

      outputout=drugest p=drugpred r=resid;

RUN;

ods graphics
off;

odshtmlclose;

QUIT;

The option SOLUTION produces parameter estimates.

Here is the main output.

---------------------------------------------------------------------------------------------------

The GLM Procedure

Dependent Variable: PostTreatment

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 3 871.497403 290.499134 18.10 <.0001

Error 26 417.202597 16.046254

Corrected Total 29 1288.700000

R-Square Coeff Var Root MSE PostTreatment Mean

0.676261 50.70604 4.005778 7.900000

Source DF Type I SS Mean Square F Value Pr > F

Drug 2 293.6000000 146.8000000 9.15 0.0010

PreTreatment 1 577.8974030 577.8974030 36.01 <.0001

Source DF Type III SS Mean Square F Value Pr > F

Drug 2 68.5537106 34.2768553 2.14 0.1384

PreTreatment 1 577.8974030 577.8974030 36.01 <.0001

---------------------------------------------------------------------------------------------------

Let's review this output a bit more carefully.

First, we see that the F-test is statistically significant, which means that the model is statistically significant. The R-squared is .676 means that approximately 67.6% of the variance of post
treatment is accounted for by the model.

Second, the Type I SS for
Drug (293.6) gives the between-drug sums of squares that are obtained for the analysis-of-variance model
PostTreatment=Drug. This measures the difference between arithmetic means of posttreatment scores for different drugs, disregarding the covariate. The Type III SS for
Drug (68.5537) gives the Drug sum of squares adjusted for the covariate. This measures the differences between
Drug LS-means, controlling for the covariate PreTreatment.

---------------------------------------------------------------------------------------------------

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept -0.434671164 B 2.47135356 -0.18 0.8617

Drug A -3.446138280 B 1.88678065 -1.83 0.0793

Drug D -3.337166948 B 1.85386642 -1.80 0.0835

Drug F 0.000000000 B . . .

PreTreatment 0.987183811 0.16449757 6.00 <.0001

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to

solve the normal equations. Terms whose estimates are followed by the letter 'B'

are not uniquely estimable.

---------------------------------------------------------------------------------------------------

Then we take a look at each of independent variable. The t-test for PreTreatment equals 6 , and is statistically significant, meaning that the
regression coefficient for PreTreatment is significantly different from zero. The coefficient for
PreTreatment is 0.987, or approximately 1, meaning that for a one unit increase in
PreTreatment, we would expect a 1-unit increase in PostTreatment. The constant is -0.43467, and this is the predicted value when independent variables equal zero. In most cases, the constant is not very interesting.

For class variable Drug, the effects are not significant. The estimate of Drug F is zeor as it is set to be reference group in the model. in SAS, the last group of a class variable is set to be referenmce group
by default. The negative sign of the estimates of drug A and D indicate that the effect of drug A and D to post treatment are less than the effect of drug F.

---------------------------------------------------------------------------------------------------

ODS graphics can generate analysis of covariance plot in PROC GLM. The plot makes it clear that the control (drug F) has higher post-treatment scores across the range of pre-treatment scores, while the fitted models for the two antibiotics (drugs A and D)
nearly coincide. .

As we have saved the predicted values and residuals into a new SAS data set drugest, we can plot a residual diagnostic plot by using plot command PROC GPLOT.

PROC GPLOT data=drugest;

plot drugpred*resid ;

RUN;

QUIT;

3.0 Regression with PROC REG

The REG procedure provides the most general analysis capabilities:

handles multiple regression models
provides nine model-selection methods
allows interactive changes both in the model and in the data used to fit the model
allows linear equality restrictions on parameters
tests linear hypotheses and multivariate hypotheses
produces collinearity diagnostics, influence diagnostics, and partial regression leverage plots
saves estimates, predicted values, residuals, confidence limits, and other diagnostic statistics in output SAS data sets
generates plots of data and of various statistics

The general form of a PROC REG step is:

PROC REG DATA=SAS-dataset;
      MODEL dependent-variable = predictors /
            selection=method R CLI CLM ;
       PLOT r.*p. ;
RUN ;

QUIT;

MODEL	specifies the dependent/independent variables in the model.
SELECTION	specifies model selection model: forward, backward, etc.
R	requests a residual analysis to be performed.
CLI	requests confidence limits for an individual predicted value .
CLM	displays confidence limits for the expected value of the dependent variable for each observation.
r.p.*	plot of the residuals against the predicted values.

Demonstrations and explanations:

We use SAS data set insurance as an example. Here is the dat input:

DATA mylib.insurance;

      input time size type @@;

      sizetype=size*type;

      datalines;

   17 151 0   26 92 0   21 175 0   30 31 0   22 104 0

    0 277 0   12 210 0   19 120 0    4 290 0   16 238 0

   28 164 1   15 272 1   11 295 1   38 68 1   31 85 1

   21 224 1   20 166 1   13 305 1   30 124 1   14 246 1

   ;

There are four variables time, size, type and interaction term of
sizetype. We are going to construct a linear model to describe the linear relationship between output
time and independent variable size, type. We also take count of the possible interaction of size and type.

PROC REGdata=mylib.insurance;

      model time = size type sizetype /selection=none;

RUN;

      delete sizetype;

      print;

RUN;

      plotr.*p.
time*p.;

      outputout=insurancepre
p=fit
r=resid;

RUN;

QUIT;

The DELETE
statement deletes the specified term from the constructed model. The
PRINT prints the model results.

Here is the main output:

---------------------------------------------------------------------------------------------------

Model: MODEL1

Dependent Variable: time

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 3 1504.41904 501.47301 45.49 <.0001

Error 16 176.38096 11.02381

Corrected Total 19 1680.80000

Root MSE 3.32021 R-Square 0.8951

Dependent Mean 19.40000 Adj R-Sq 0.8754

Coeff Var 17.11450

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 33.83837 2.44065 13.86 <.0001

size 1 -0.10153 0.01305 -7.78 <.0001

type 1 8.13125 3.65405 2.23 0.0408

sizetype 1 -0.00041714 0.01833 -0.02 0.9821

---------------------------------------------------------------------------------------------------

1) The above lists the results of model1 which includes the interaction term. The model F test indicates the model is significant. R-Square value says there
is 89.51% of variance of outcome explained by the model. From parameter estimates, all the main effects a re significant, but the interaction term is not.

---------------------------------------------------------------------------------------------------

The REG Procedure

Model: MODEL1.1

Dependent Variable: time

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 1504.41333 752.20667 72.50 <.0001

Error 17 176.38667 10.37569

Corrected Total 19 1680.80000

Root MSE 3.22113 R-Square 0.8951

Dependent Mean 19.40000 Adj R-Sq 0.8827

Coeff Var 16.60377

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 33.87407 1.81386 18.68 <.0001

size 1 -0.10174 0.00889 -11.44 <.0001

type 1 8.05547 1.45911 5.52 <.0001

---------------------------------------------------------------------------------------------------

2) Model 1.1 excludes the interaction effect. The model and paramters are all significant.

---------------------------------------------------------------------------------------------------

3) The above plots are graphed by PLOT statement.

---------------------------------------------------------------------------------------------------

4) We also can use ODS Graphics to produce more diagnostic plots.

odshtml;

ods graphics
on;

PROC REGdata=mylib.insurance;

model time = size type /selection=none;

RUN;

ods graphics
off;

odshtmlclose;

RUN;

QUIT;

4.0 Polynomial Regression Using PROC REG

Demonstrations and explanations:

The example SAS data set USpopulation
has three variables, population, year and yearsq.

DATA mylib.USPopulation;

      input Population @@;

      retain Year
1780;

      Year=Year+10;

      YearSq=Year*Year;

      Population=Population/1000;

      datalines;

3929 5308 7239 9638 12866 17069 23191 31443 39818 50155

62947 75994 91972 105710 122775 131669 151325 179323 203211

226542 248710 281422

   ;

Here is the sample observations:

Obs Population Year YearSq

1 3.929 1790 3204100

2 5.308 1800 3240000

3 7.239 1810 3276100

4 9.638 1820 3312400

5 12.866 1830 3348900

6 17.069 1840 3385600

7 23.191 1850 3422500

8 31.443 1860 3459600

9 39.818 1870 3496900

We first run a simple linear model with
population and year, then add an polynomial term yearsq.

PROC REGdata=mylib.USPopulation;

      var YearSq;

      model Population=Year /
selection=none;

      plotr.*p.
;

RUN;

      add YearSq;

      print;

      plot /
cframe=ligr;

RUN;

      plot (Population
predicted.u95.l95.)*Year

        /
overlaycframe=ligr;

RUN;

QUIT;

Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement.

The PLOT statement with no variables recreates the most recent plot requested. To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some
control over the look of the resulting plot.

Here is the main output:

---------------------------------------------------------------------------------------------------

The REG Procedure

Model: MODEL1

Dependent Variable: Population

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 146869 146869 228.92 <.0001

Error 20 12832 641.58160

Corrected Total 21 159700

Root MSE 25.32946 R-Square 0.9197

Dependent Mean 94.64800 Adj R-Sq 0.9156

Coeff Var 26.76175

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -2345.85498 161.39279 -14.54 <.0001

Year 1 1.28786 0.08512 15.13 <.0001

The REG Procedure

Model: MODEL1.1

Dependent Variable: Population

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 159529 79765 8864.19 <.0001

Error 19 170.97193 8.99852

Corrected Total 21 159700

Root MSE 2.99975 R-Square 0.9989

Dependent Mean 94.64800 Adj R-Sq 0.9988

Coeff Var 3.16938

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 21631 639.50181 33.82 <.0001

Year 1 -24.04581 0.67547 -35.60 <.0001

YearSq 1 0.00668 0.00017820 37.51 <.0001

---------------------------------------------------------------------------------------------------
The results tell us that the main effects and the polynomial term are all significant.

The SAS also produces the following three plots:

---------------------------------------------------------------------------------------------------

The above plot is generated in the first model. The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the
quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends.

---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
We can use ODS Graphics to produce more diagnostic plots:

odshtml;

ods graphics
on;

PROC REGdata=mylib.USPopulation;

Linear:
model Population=Year;

Quadratic:model Population=Year YearSq;

RUN;

ods graphics
off;

odshtmlclose;

QUIT;

We omit the results here.

【上篇】sas做广义估计方程
【下篇】提取数据的变量信息content

作者: Jvjpopuc

该日志由 Jvjpopuc 于6年前发表在综合分类下，最后更新于 2018年10月21日.
转载请注明: proc glm vs proc reg | 学步园 +复制链接

抱歉!评论已关闭.

学步园

proc glm vs proc reg

作者: Jvjpopuc

书签

最新文章New

本站推荐

返回首页