以之前的数据(clean.patients)为例,如果我们要识别字符型变量比如Gender等是否有效,以及列出不合要求的数据,我们可以有以下几种选择。
1.data step
title "Listing of invalid patient numbers and data values";
data _null_; /*空数据集,提高效率*/
set clean.patients;
if Gender not in ('F' 'M' ' ') then put Patno= Gender=; /*核查Gender变量*/
if verify(trim(Dx),'0123456789') and not missing(Dx)
then put Patno= Dx=; /*verify用于返回属于trim(Dx)而属于后面字符串的字符位置*/
/*这里等价于sas 9中得到if notdigit(trim(Dx)) and not missing*/
if AE not in ('0' '1' ' ') then put Patno= AE=; /*核实变量AE*/
run;
2.条件语句where
title "Listing of invalid character values";
proc print data=clean.patients;
where Gender not in ('M' 'F' ' ') or
notdigit(trim(Dx)) and not missing(Dx) or
AE not in ('0' '1' ' ');
id Patno;
var Gender Dx AE;
run;
3.设定格式标签format语句
proc format;
value $gender 'F','M' = 'Valid'
' ' = 'Missing'
other = 'Miscoded';
value $ae '0','1' = 'Valid'
' ' = 'Missing'
other = 'Miscoded';
run;
title "Listing of invalid patient numbers and data values";
data _null_;
set clean.patients(keep=Patno Gender AE);
if put(Gender,$gender.) = 'Miscoded' then put Patno= Gender=;
if put(AE,$ae.) = 'Miscoded' then put Patno= AE=;
run;
另外我们需要对这些无效的数据进行处理,可以删除这些数据也可以选择保留:
1.将无效数据设为缺失可以利用设置自定义格式标签的方法,将无效数据设置为缺失
proc format;
invalue $gen 'F','M' = _same_ /*_same_系统保存变量,变量名包括等号左边的*/
other = ' ';
invalue $ae '0','1' = _same_
other = ' ';
run;
data clean.patients_filtered;
infile "c:/books/clean/patients.txt" truncover; /*该文件以文本形式存在*/
input @1 Patno $3.
@4 Gender $gen1.
@27 AE $ae1.;
label Patno = "Patient Number"
Gender = "Gender"
AE = "adverse event?";
run;
title "Listing of data set PATIENTS_FILTERED";
proc print data=clean.patients_filtered;
var Patno Gender AE;
run;
2.保留无效数据,利用input函数,注意不是input语句。这种方法可以直接对sas数据集操作。
proc format;
invalue $gender 'F','M' = _same_
other = 'Error';
invalue $ae '0','1' = _same_
other = 'Error';
run;
title "Listing of invalid character values";
data _null_;
file print; /*输出到窗口*/
set clean.patients; /*clean库中已存在数据patients*/
if input (Gender,$gender.) = 'Error' then /*input函数创建变量Gender,格式gender.*/
put @1 "Error for Gender for patient:" Patno" value is " Gender;
if input (AE,$ae.) = 'Error' then
put @1 "Error for AE for patient:" Patno" value is " AE;
run;
参考文献《Cody's Data cleaning techniques using SAS》