朴素贝叶斯分类工作过程:
1,设D是训练元组和相关联的类标号的集合。
2,假定有m个类C1,C2,C3,...Cm。给定元组X,分类法将预测X属于具有最高后验概率(条件X下)的类,即,当P(Ci|X)>P(Cj|X),朴素贝叶斯分类法预测X属于类Cj
贝叶斯定理:P(Ci|X)=P(X|Ci)P(Ci)/P(X)
3,问题转换为根据P(X|Ci)P(Ci)/P(X)的大小判断类别,先求P(Ci)的先验概率
4,假定类条件独立,P(X|Ci)=P(x1|Ci)*P(x2|Ci).....*P(xn|Ci),比较结果确定属于哪个类别。
训练集:
<30 high no fair no
<30 high no excellent no
30-40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
30-40 low yes excellent yes
<30 medium no fair no
<30 low yes fair yes
>40 medium yes fair yes
<30 medium yes excellent yes
30-40 medium yes excellent yes
30-40 high yes fair yes
>40 medium no excellent no
测试集:
<30 medium yes fair
>40 high no excellent
30-40 low no excellent
>40 high no fair
<30 medium no fair
源码:
%function out=my_bayes(X,Y) %X为原数据集,Y是要预测的数据,out是返回预测的结果 function out=bayes() %%%%%%%%%%%%%%%%%%%%%%打开test.txt文件 clc; file = textread('train.txt','%s','delimiter','\n','whitespace','');%以换行为分隔符读取,whitespace用‘’代替 [m,n]=size(file); for i=1:m words=strread(file{i},'%s','delimiter',' ');%将字符串file(i),以空格分隔符进行分割,并存到数组中 words=words'; X{i}=words; end%这时候X是1*14,每个元素实际上是个cell,每个cell保存的是个字符串,如X{1}即'<30' 'high' 'no' 'fair' 'no' X=X';%转置14*1 %%%%%%%%%%%%%%%%%%%%%打开predict.txt文件 file = textread('predict1.txt','%s','delimiter','\n','whitespace',''); [m,n]=size(file); for i=1:m words=strread(file{i},'%s','delimiter',' '); words=words'; Y{i}=words; end Y=Y';%转置 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%训练部分 [M,N]=size(X); [m,n]=size(X{1}); decision=attribute(X,n); %提取决策属性,将类别列提取出来 [ProName,Pro]=probality(decision);%计算决策属性个分量概率,各样本概率 for i=1:n-1 [post_pro{i},post_name{i}]=post_prob(attribute(X,i),decision); %求各条件属性后验概率 end %%%%%%%%%%%%%%%%%%%%%%%%预测部分 uniq_decis=unique(decision); %求决策属性的类别 P_X=ones(size(uniq_decis,1),1); %初始化决策属性后验概率 [M,N]=size(Y); k=1; for i=1:M for j=1:n-1 [temp,loc]=ismember(attribute({Y{i}},j),unique(attribute(X,j)));%决策属性计算后验概率 P_X=post_pro{j}(:,loc).*P_X;%各条件属性后验概率之积(贝叶斯公式) %post_pro{j}(:,loc)对应的含义:loc表示是第几列属性,:,loc代表loc属性在no和yes情况下的条件概率,j代表的是某类别 end %P_X中两行,代表在不同决策类别下的各独立概率之积 P_X=P_X.*Pro; [MAX,I]=max(P_X);%寻找最大值 out{k}=uniq_decis{I};%哪一类决策属性后验概率最大,则次样本属于那一类 k=k+1; P_X=ones(size(uniq_decis,1),1);%再次初始化决策属性后验概率P_X,以便为下一样本计算作准备 end out=out'; %输出结果(转置形式)
结果:
>> out out = 'yes' 'no' 'yes' 'no' 'no'