现在的位置: 首页 > 综合 > 正文

分层抽样Selecting a Stratified Sample with PROC SURVEYSELECT

2018年10月24日 ⁄ 综合 ⁄ 共 10649字 ⁄ 字号 评论关闭

声明:该文作者是Diana Suhr, University of Northern Colorado,摘自SAS Global Forum 2009.

 

Abstract

Stratified random sampling is simple and efficient using PROC FREQ and PROC SURVEYSELECT. A routine was

developed to select stratified samples determined by population parameters. SAS code and examples will be shown

to select samples stratified on 1, 2, and 3 variables.

 

Introduction

Selecting random samples representative of the population is essential for research studies. Definitions, a checklist

for conducting a survey, and examples of selecting stratified random samples are provided in this paper. Annotated

examples shown determine sample size for each strata and stratify on 1, 2, and 3 variables. Before PROC

SURVEYSELECT was available, the ranuni function with several data steps was used to obtain stratified samples.

Appendix A illustrates a ranuni method to select stratified samples.

 

Sampling

A sample is a group selected from a population. Inferences about a population can be made from information

obtained in a sample when the sample is representative of the population. Samples based on planned randomness

are called probability samples. Probability sampling has a certain amount of randomness built in so that bias or

unbiasedness can be established and probability statements could be made about the accuracy of the methods

(Scheaffer, Mendenhall, & Ott, 1996). Randomization inherent in probability sampling helps balance out variables that

cannot be controlled or measured directly.

 

Simple random sampling consists of selecting a group of n units such that each sample of n units has the same

chance of being selected.

 

Stratified random sampling occurs when the population is divided into groups, or strata, according to selected

variables (e.g., gender, income) and a simple random sample is selected from each group.

 

Ratio estimators use responses from variables of interest incorporated with responses from an auxiliary variable

(e.g., ratio of entertainment expense to total household expense when estimating the average yearly amount spent

on entertainment).

 

Cluster sampling takes a simple random sample of groups and then samples items within the selected clusters.

 

Systematic sampling
selects every nth observation in a list (e.g., every 10thor 15thname).


Unlike simple random sampling,quotasamplingselects subjects one at a time until desired percentages arereached.

Polls of the 1948 U.S.presidential election illustrate an example of quota sampling. Respondents were chosen  according to gender, age, income,education, and factors related to political views. However,
the polls

underestimated the popularity of HarryTruman and overestimated the popularity of Thomas E. Dewey because

Republicans were over represented inthe poll. It is impossible to control for all variables in quota sampling.


Convenience sampling
results when a group of people are selected because they are available.This type of sampling

could limit inferences, result in biasand provide a sample unrepresentative of the population.

 

Planning a Survey

The following checklist could be followed when planning, administering, and analyzing a survey.

1) Statement of objectives. State objectives clearly and concisely. Refer to objectives regularly in the design,

implementation, and analysis of the survey.

2) Measurement instrument. Select an appropriate measurement instrument(s) to answer research questions

and meet objectives.

3) Data analysis. Outline the analyses to answer research questions/objectives.

4) Sample design. Define the target population and sampling variables. Choose a sample design so the

sample provides sufficient information to meet objectives of the survey.

5) Method of measurement. Determine methods of measurement (e.g., interview, mailed questionnaire, direct

observation, online survey).

6) Selection and training of survey administrators. Teach those collecting data/administering survey how to

properly and accurately collect data.

7) Data organization. A plan is necessary for small or large surveys. The organizational plan includes data

management and a codebook.

8) Pilot study. Provides an opportunity to field-test measurement instrument, survey administrators,

management of survey and make modifications.

 

Sample selection can be accomplished easily with PROC SURVEYSELECT.

 

PROC SURVEYSELECT SYNTAX

PROC SURVEYSELECT <options>;

          STRATA variables;

          CONTRAL variables;

          SIZE variable;

          ID variables;

Selected PROC SURVEYSELECT <options>

DATA=         specify the input data set, the set from which the sample is selected.

                     If this option is omitted, the most recently created SAS data set is used.

OUT=           specify the output data set, the data set that contains the sample. If this statement is omitted,

                     the data set is named DATAn, n is the smallest integer to create a unique name.

METHOD=    specify sample selection method. Default method is simple random sampling (METHOD=SRS)

                     with no SIZE statement. With a SIZE statement, default method is probabioity proportional to size

                      without replacement (METHOD=PPS)

SAMPSIZE=  specify number for sample size

                      Specify values for each strata

                      Specify data set containing sample sizes for each strata.

SEED=           specify initial seed for random number generation.

NOPRINT=    suppress displayed output

 

Statements:

STRATA            partitions input data set into nonoverlapping groups

                         selects independent samples from strata

                         strata somewhat like BY variables

                         input data set must be sorted by STRATA variables

CONTROL         names variables to sort the input data set

                         if STRATA is specified, input data is sorted by control variables within STRATA.

SIZE                 names one and only one size variable which contains size measures to use when sampling

                         with probability proportional to size; not the same as SAMPSIZE option

ID                     lists variables from the input data set to be included in the output data set.

                         With no ID statement, all variables from input data set are included in output data set

 

 

Formatting Data

PROC FORMAT;

   VALUE LVLFMT   /*The PROC FORMAT statement creates “lvlfmt”*/            

       1='FRESHMAN'/*to describe level (classification) as freshman,*/

       2='SOPHOMORE'/*sophomore,junior, or senior*/

       3='JUNIOR'

       4=‘SENIOR’;

   VALUE COLGFMT     /*“colgfmt” to describe major college as arts & sciences,*/  

       1 = ‘ARTS & SCI’/*education, health & human sciences,business,*/

       2 = ‘EDUCATION’ /*performing & visual arts, graduate school or undeclared.*/

       3 = ‘HHS’

       4 = ‘BUSINESS’

       5 = ‘PVA’

       6 = ‘GRAD SCH’

       7 = ‘UNDECLARED’;

Reading Data

DATA RAWSUB;

   INFILE RAWSUB;         /*Data is read from an external file.*/

   INPUT ID 1-4           /*Formats are “attached”in the data step.*/

         LEVEL 6           /*A format statement could be included*/

         GEND $8          /*n a procedure rather than in the data step.*/

         MAJCOLG 27;

   FORMAT LEVEL LVLFMT.

          MAJCOLG COLGFMT.;

Example #1

PROC FREQ DATA = RAWSUB;          

   TABLES GEND/OUT=NEWFREQ NOPRINT; 

DATA NEWFREQ2 ERROR;

   SET NEWFREQ;

   SAMPNUM=(PERCENT * 500)/100;

   _NSIZE_= ROUND(SAMP,1);

   SAMPNUM=ROUND(SAMPNUM,.01);

   IF _NSIZE_=0 THEN OUTPUT ERROR;

   IF _NSIZE_=0 THEN DELETE;

OUTPUT NEWFREQ2;

DATA NEWFREQ3;

   SET NEWFREQ2;

   KEEP GEND _NSIZE_;

PROC SORT DATA = NEWFREQ3;

   BY GEND;

PROC SORT DATA = RAWSUB;

   BY GEND;

PROC SURVEYSELECT DATA=RAWSUB

   OUT=SAMPFL

   SAMPSIZE=NEWFREQ3;

   STRATA GEND;

   ID ID GEND;

PROC FREQ DATA = SAMPFL;

   TABLES GEND/OUT=SAMPFREQ NOPRINT;

PROC PRINT DATA=SAMPFREQ;

   TITLE ‘SAMPLE FREQUENCIES’;

PROC PRINT DATA = ERROR;

   TITLE 'STRATA DELETED';

PROC DELETE DATA = NEWFREQ NEWFREQ2

   NEWFREQ3 SAMPFL SAMPFREQ ERROR;

Annotations

The PROC FORMAT statement creates “lvlfmt”to describe level (classification) as freshman,sophomore, junior, or                senior and“colgfmt” to describe major college as arts & sciences, education, health&
human sciences,business,          performing & visual arts, graduate school orundeclared.

Data is read from an external file.Formats are “attached” in the data step. A format statement could be includedin a       procedure rather than in the data step.

PROC FREQ calculates gender frequenciesand percentages for the total population (data=rawsub) that are not                 printed (noprint).

Strata sizes are determined in a DATA step.Sample size is 500 in this example. PROC SURVEYSELECT options SAMPSIZE=specifies the name of the data set containing sample sizes.

_NSIZE_, specifies sample size, must be a positiveinteger, and is rounded off to an integer in the data step. If _NSIZE_ is not apositive integer, it is deleted from the sample size data set and an “error”data
set is created.

Gender and sample/strata sizes are keptto read into the PROC SURVEYSELECT procedure.

The sample/strata size data set and thepopulation data set are sorted by gender.

PROC SURVEYSELECT stratifies on gender,creates an output data set named “SAMPFL”, and keeps identification                   variables “ID”and “GENDER”.

Frequencies are output and not printedwith PROC FREQ. Values, counts, and percentages are printed with PROC PRINT.

If sample frequencies are equal tozero, an error message is printed.Data sets are deleted with PROC DELETE.

Example #2

PROC FREQ DATA = RAWSUB;

   TABLES LEVEL*GEND

           /OUT=NEWFREQ NOPRINT;

DATA NEWFREQ2 ERROR;

   SET NEWFREQ;

   SAMPNUM=(PERCENT * 500)/100;

   _NSIZE_= ROUND(SAMPNUM,1);

   SAMPNUM=ROUND(SAMPNUM,.01);

   IF _NSIZE_=0 THEN OUTPUT ERROR;

   IF _NSIZE_=0 THEN DELETE;

OUTPUT NEWFREQ2;

DATA NEWFREQ3;

   SET NEWFREQ2;

   KEEP LEVEL GEND _NSIZE_;

PROC SORT DATA = NEWFREQ3;

   BY LEVEL GEND;

PROC SORT DATA = RAWSUB;

   BY LEVEL GEND;

PROC SURVEYSELECT DATA=RAWSUB

   OUT=SAMPFL

   SAMPSIZE=NEWFREQ3;

   STRATA LEVEL GEND;

   ID ID LEVEL GEND;

PROC FREQ DATA = SAMPFL;

   TABLES LEVEL * GEND

        /OUT=SAMPFREQ NOPRINT;

PROC PRINT DATA=SAMPFREQ;

   TITLE2 ‘SAMPLE FREQUENCIES’;

PROC PRINT DATA = ERROR;

   TITLE2 'STRATA DELETED’;

PROC DELETE DATA = NEWFREQ NEWFREQ2

   NEWFREQ3 SAMPFL SAMPFREQ ERROR;


Annotations

A similar procedure is followed to determine stratasizes to stratify on two variables, level and gender.Determine population frequencies and percentages.Determine sample size (positive integers) for asample of 500.Create error data set.
Delete strata size if equalto 0.Keep level, gender, and strata sizes.Sort population data set and strata data set bylevel and gender.PROC SURVEYSELECT selects a random sample stratifiedon level and gender, creates an output data set, and keeps id level and
genderas identifiers.Check sample frequencies and percentages.Print an error report.Delete data sets with PROC DELETE.

Example #3

PROC FREQ DATA = RAWSUB; 
   TABLES LEVEL * GEND * MAJCOLG /OUT=NEWFREQ NOPRINT;

DATA NEWFREQ2 ERROR; 
   SET NEWFREQ; 
   SAMPNUM=(PERCENT * 500)/100; 
   _NSIZE_= ROUND(SAMPNUM,1); 
   SAMPNUM=ROUND(SAMPNUM,.01); 
   IF _NSIZE_=0 THEN OUTPUT ERROR; 
   IF _NSIZE_=0 THEN DELETE;OUTPUT NEWFREQ2;

DATA NEWFREQ3; 
   SET NEWFREQ2; 
   KEEP LEVEL GEND MAJCOLG _NSIZE_;
   
PROC SORT DATA = NEWFREQ3; 
   BY LEVEL GEND MAJCOLG;
  
PROC SORT DATA = RAWSUB; 
   BY LEVEL GEND MAJCOLG;
PROC SURVEYSELECT DATA=RAWSUB OUT=SAMPFL SAMPSIZE=NEWFREQ3; 
   STRATA LEVEL GEND MAJCOLG; 
   ID ID LEVEL GEND MAJCOLG;

PROC FREQ DATA = SAMPFL; 
   TABLES LEVEL * GEND * MAJCOLG /OUT=SAMPFREQ NOPRINT;
 
PROC PRINT DATA = SAMFREQ;

PROC PRINT DATA = ERROR;PROC DELETE DATA = NEWFREQ NEWFREQ2 NEWFREQ3 SAMPFL SAMPFREQ ERROR;

Annotations

A similar procedure is followed to determine stratasizes to stratify on three variables, level, gender, and college.

Determine population frequencies and percentages.

Determine strata sizes (positive integers) for asample of 500.

Create error data set.

Delete strata size if equal to 0.

Keep level, gender, major college, and stratasizes.

Sort population data set and strata size data setby level,gender and major college.

PROC SURVEYSELECT selects a random sample stratified on level, gender, and major college,creates an

   output data set, and keeps id level gender andmajor college as identifiers.

Check sample frequencies and percentages.

Print an error report.

Delete data sets with PROC DELETE.

Conclusion

PROC FREQ and PROC SURVEYSELECT facilitate a quick, efficient, easy method for stratified random sampling. PROC

FREQ determines population percentages. Those percentages allow calculations in a data step to determine the number in each strata. PROC SURVEYSELECT randomly samples from each strata to provide a stratified sample. The procedure is an easy
as changing the size of the sample in the data step and running the SAS code.

 

抱歉!评论已关闭.