声明:该文作者是Diana Suhr, University of Northern Colorado,摘自SAS Global Forum 2009.
Abstract
Stratified random sampling is simple and efficient using PROC FREQ and PROC SURVEYSELECT. A routine was
developed to select stratified samples determined by population parameters. SAS code and examples will be shown
to select samples stratified on 1, 2, and 3 variables.
Introduction
Selecting random samples representative of the population is essential for research studies. Definitions, a checklist
for conducting a survey, and examples of selecting stratified random samples are provided in this paper. Annotated
examples shown determine sample size for each strata and stratify on 1, 2, and 3 variables. Before PROC
SURVEYSELECT was available, the ranuni function with several data steps was used to obtain stratified samples.
Appendix A illustrates a ranuni method to select stratified samples.
Sampling
A sample is a group selected from a population. Inferences about a population can be made from information
obtained in a sample when the sample is representative of the population. Samples based on planned randomness
are called probability samples. Probability sampling has a certain amount of randomness built in so that bias or
unbiasedness can be established and probability statements could be made about the accuracy of the methods
(Scheaffer, Mendenhall, & Ott, 1996). Randomization inherent in probability sampling helps balance out variables that
cannot be controlled or measured directly.
Simple random sampling consists of selecting a group of n units such that each sample of n units has the same
chance of being selected.
Stratified random sampling occurs when the population is divided into groups, or strata, according to selected
variables (e.g., gender, income) and a simple random sample is selected from each group.
Ratio estimators use responses from variables of interest incorporated with responses from an auxiliary variable
(e.g., ratio of entertainment expense to total household expense when estimating the average yearly amount spent
on entertainment).
Cluster sampling takes a simple random sample of groups and then samples items within the selected clusters.
Systematic sampling
selects every nth observation in a list (e.g., every 10thor 15thname).
Unlike simple random sampling,quotasamplingselects subjects one at a time until desired percentages arereached.
Polls of the 1948 U.S.presidential election illustrate an example of quota sampling. Respondents were chosen according to gender, age, income,education, and factors related to political views. However,
the polls
underestimated the popularity of HarryTruman and overestimated the popularity of Thomas E. Dewey because
Republicans were over represented inthe poll. It is impossible to control for all variables in quota sampling.
Convenience sampling
results when a group of people are selected because they are available.This type of sampling
could limit inferences, result in biasand provide a sample unrepresentative of the population.
Planning a Survey
The following checklist could be followed when planning, administering, and analyzing a survey.
1) Statement of objectives. State objectives clearly and concisely. Refer to objectives regularly in the design,
implementation, and analysis of the survey.
2) Measurement instrument. Select an appropriate measurement instrument(s) to answer research questions
and meet objectives.
3) Data analysis. Outline the analyses to answer research questions/objectives.
4) Sample design. Define the target population and sampling variables. Choose a sample design so the
sample provides sufficient information to meet objectives of the survey.
5) Method of measurement. Determine methods of measurement (e.g., interview, mailed questionnaire, direct
observation, online survey).
6) Selection and training of survey administrators. Teach those collecting data/administering survey how to
properly and accurately collect data.
7) Data organization. A plan is necessary for small or large surveys. The organizational plan includes data
management and a codebook.
8) Pilot study. Provides an opportunity to field-test measurement instrument, survey administrators,
management of survey and make modifications.
Sample selection can be accomplished easily with PROC SURVEYSELECT.
PROC SURVEYSELECT SYNTAX
PROC SURVEYSELECT <options>; STRATA variables; CONTRAL variables; SIZE variable; ID variables;
Selected PROC SURVEYSELECT <options>
DATA= specify the input data set, the set from which the sample is selected.
If this option is omitted, the most recently created SAS data set is used.
OUT= specify the output data set, the data set that contains the sample. If this statement is omitted,
the data set is named DATAn, n is the smallest integer to create a unique name.
METHOD= specify sample selection method. Default method is simple random sampling (METHOD=SRS)
with no SIZE statement. With a SIZE statement, default method is probabioity proportional to size
without replacement (METHOD=PPS)
SAMPSIZE= specify number for sample size
Specify values for each strata
Specify data set containing sample sizes for each strata.
SEED= specify initial seed for random number generation.
NOPRINT= suppress displayed output
Statements:
STRATA partitions input data set into nonoverlapping groups
selects independent samples from strata
strata somewhat like BY variables
input data set must be sorted by STRATA variables
CONTROL names variables to sort the input data set
if STRATA is specified, input data is sorted by control variables within STRATA.
SIZE names one and only one size variable which contains size measures to use when sampling
with probability proportional to size; not the same as SAMPSIZE option
ID lists variables from the input data set to be included in the output data set.
With no ID statement, all variables from input data set are included in output data set
Formatting Data
PROC FORMAT; VALUE LVLFMT /*The PROC FORMAT statement creates “lvlfmt”*/ 1='FRESHMAN'/*to describe level (classification) as freshman,*/ 2='SOPHOMORE'/*sophomore,junior, or senior*/ 3='JUNIOR' 4=‘SENIOR’; VALUE COLGFMT /*“colgfmt” to describe major college as arts & sciences,*/ 1 = ‘ARTS & SCI’/*education, health & human sciences,business,*/ 2 = ‘EDUCATION’ /*performing & visual arts, graduate school or undeclared.*/ 3 = ‘HHS’ 4 = ‘BUSINESS’ 5 = ‘PVA’ 6 = ‘GRAD SCH’ 7 = ‘UNDECLARED’;
Reading Data
DATA RAWSUB; INFILE RAWSUB; /*Data is read from an external file.*/ INPUT ID 1-4 /*Formats are “attached”in the data step.*/ LEVEL 6 /*A format statement could be included*/ GEND $8 /*n a procedure rather than in the data step.*/ MAJCOLG 27; FORMAT LEVEL LVLFMT. MAJCOLG COLGFMT.;
Example #1
PROC FREQ DATA = RAWSUB; TABLES GEND/OUT=NEWFREQ NOPRINT; DATA NEWFREQ2 ERROR; SET NEWFREQ; SAMPNUM=(PERCENT * 500)/100; _NSIZE_= ROUND(SAMP,1); SAMPNUM=ROUND(SAMPNUM,.01); IF _NSIZE_=0 THEN OUTPUT ERROR; IF _NSIZE_=0 THEN DELETE; OUTPUT NEWFREQ2; DATA NEWFREQ3; SET NEWFREQ2; KEEP GEND _NSIZE_; PROC SORT DATA = NEWFREQ3; BY GEND; PROC SORT DATA = RAWSUB; BY GEND; PROC SURVEYSELECT DATA=RAWSUB OUT=SAMPFL SAMPSIZE=NEWFREQ3; STRATA GEND; ID ID GEND; PROC FREQ DATA = SAMPFL; TABLES GEND/OUT=SAMPFREQ NOPRINT; PROC PRINT DATA=SAMPFREQ; TITLE ‘SAMPLE FREQUENCIES’; PROC PRINT DATA = ERROR; TITLE 'STRATA DELETED'; PROC DELETE DATA = NEWFREQ NEWFREQ2 NEWFREQ3 SAMPFL SAMPFREQ ERROR;
Annotations
The PROC FORMAT statement creates “lvlfmt”to describe level (classification) as freshman,sophomore, junior, or senior and“colgfmt” to describe major college as arts & sciences, education, health&
human sciences,business, performing & visual arts, graduate school orundeclared.
Data is read from an external file.Formats are “attached” in the data step. A format statement could be includedin a procedure rather than in the data step.
PROC FREQ calculates gender frequenciesand percentages for the total population (data=rawsub) that are not printed (noprint).
Strata sizes are determined in a DATA step.Sample size is 500 in this example. PROC SURVEYSELECT options SAMPSIZE=specifies the name of the data set containing sample sizes.
_NSIZE_, specifies sample size, must be a positiveinteger, and is rounded off to an integer in the data step. If _NSIZE_ is not apositive integer, it is deleted from the sample size data set and an “error”data
set is created.
Gender and sample/strata sizes are keptto read into the PROC SURVEYSELECT procedure.
The sample/strata size data set and thepopulation data set are sorted by gender.
PROC SURVEYSELECT stratifies on gender,creates an output data set named “SAMPFL”, and keeps identification variables “ID”and “GENDER”.
Frequencies are output and not printedwith PROC FREQ. Values, counts, and percentages are printed with PROC PRINT.
If sample frequencies are equal tozero, an error message is printed.Data sets are deleted with PROC DELETE.
Example #2
PROC FREQ DATA = RAWSUB; TABLES LEVEL*GEND /OUT=NEWFREQ NOPRINT; DATA NEWFREQ2 ERROR; SET NEWFREQ; SAMPNUM=(PERCENT * 500)/100; _NSIZE_= ROUND(SAMPNUM,1); SAMPNUM=ROUND(SAMPNUM,.01); IF _NSIZE_=0 THEN OUTPUT ERROR; IF _NSIZE_=0 THEN DELETE; OUTPUT NEWFREQ2; DATA NEWFREQ3; SET NEWFREQ2; KEEP LEVEL GEND _NSIZE_; PROC SORT DATA = NEWFREQ3; BY LEVEL GEND; PROC SORT DATA = RAWSUB; BY LEVEL GEND; PROC SURVEYSELECT DATA=RAWSUB OUT=SAMPFL SAMPSIZE=NEWFREQ3; STRATA LEVEL GEND; ID ID LEVEL GEND; PROC FREQ DATA = SAMPFL; TABLES LEVEL * GEND /OUT=SAMPFREQ NOPRINT; PROC PRINT DATA=SAMPFREQ; TITLE2 ‘SAMPLE FREQUENCIES’; PROC PRINT DATA = ERROR; TITLE2 'STRATA DELETED’; PROC DELETE DATA = NEWFREQ NEWFREQ2 NEWFREQ3 SAMPFL SAMPFREQ ERROR;
Annotations
A similar procedure is followed to determine stratasizes to stratify on two variables, level and gender.Determine population frequencies and percentages.Determine sample size (positive integers) for asample of 500.Create error data set.
Delete strata size if equalto 0.Keep level, gender, and strata sizes.Sort population data set and strata data set bylevel and gender.PROC SURVEYSELECT selects a random sample stratifiedon level and gender, creates an output data set, and keeps id level and
genderas identifiers.Check sample frequencies and percentages.Print an error report.Delete data sets with PROC DELETE.
Example #3
PROC FREQ DATA = RAWSUB; TABLES LEVEL * GEND * MAJCOLG /OUT=NEWFREQ NOPRINT; DATA NEWFREQ2 ERROR; SET NEWFREQ; SAMPNUM=(PERCENT * 500)/100; _NSIZE_= ROUND(SAMPNUM,1); SAMPNUM=ROUND(SAMPNUM,.01); IF _NSIZE_=0 THEN OUTPUT ERROR; IF _NSIZE_=0 THEN DELETE;OUTPUT NEWFREQ2; DATA NEWFREQ3; SET NEWFREQ2; KEEP LEVEL GEND MAJCOLG _NSIZE_; PROC SORT DATA = NEWFREQ3; BY LEVEL GEND MAJCOLG; PROC SORT DATA = RAWSUB; BY LEVEL GEND MAJCOLG; PROC SURVEYSELECT DATA=RAWSUB OUT=SAMPFL SAMPSIZE=NEWFREQ3; STRATA LEVEL GEND MAJCOLG; ID ID LEVEL GEND MAJCOLG; PROC FREQ DATA = SAMPFL; TABLES LEVEL * GEND * MAJCOLG /OUT=SAMPFREQ NOPRINT; PROC PRINT DATA = SAMFREQ; PROC PRINT DATA = ERROR;PROC DELETE DATA = NEWFREQ NEWFREQ2 NEWFREQ3 SAMPFL SAMPFREQ ERROR;
Annotations
A similar procedure is followed to determine stratasizes to stratify on three variables, level, gender, and college.
Determine population frequencies and percentages.
Determine strata sizes (positive integers) for asample of 500.
Create error data set.
Delete strata size if equal to 0.
Keep level, gender, major college, and stratasizes.
Sort population data set and strata size data setby level,gender and major college.
PROC SURVEYSELECT selects a random sample stratified on level, gender, and major college,creates an
output data set, and keeps id level gender andmajor college as identifiers.
Check sample frequencies and percentages.
Print an error report.
Delete data sets with PROC DELETE.
Conclusion
PROC FREQ and PROC SURVEYSELECT facilitate a quick, efficient, easy method for stratified random sampling. PROC
FREQ determines population percentages. Those percentages allow calculations in a data step to determine the number in each strata. PROC SURVEYSELECT randomly samples from each strata to provide a stratified sample. The procedure is an easy
as changing the size of the sample in the data step and running the SAS code.