现在的位置: 首页 > 综合 > 正文

半监督算法工具SVMlin使用

2013年10月06日 ⁄ 综合 ⁄ 共 2712字 ⁄ 字号小中大 ⁄ 评论关闭

转自 Koala++'s blog 感谢原作者

SVMlin中有监督SVM和半监督SVM算法，下载地址是http://people.cs.uchicago.edu/~vikass/svmlin.html，其实google一下svmlin就找到了。

SVMlin is
software package for linear SVMs. It is well-suited to classification problems involving a large number of examples and features. It is primarily written for sparse datasets (number of non-zero features in an example is typically small). It is written in
C++ (mostly C).

大概翻译一下，他说svmlin能处理大样本，多特征的数据集（我是有点怀疑，他用的数据结据就是一个简单的数组，能行吗？），主要用于稀疏数据集（也就是说有许多特征是0值），它是用C++写的（大部分是用C），

SVMlin can
also utilize unlabeled data, in addition to labeled examples. It currently implements two extensions of standard SVMs to incorporate unlabeled examples.

SVMlin可以利用未标记样本进行分类，它目前实现了两个标准SVM的扩展算法（目前？我看也就是永远了，2006年后没有再更新了，所以他写的bug我都懒的告诉他了）。

For a Reuters text categorization problem with around 804414 labeled examples and 47326 features,SVMlin takes
less than two minutes to train a linear SVM on an Intel machine with 3GHz processor and 2GB RAM. Given just 1000 labels, it can utilize the remaining hundreds of thousands of unlabeled examples for training a semi-supervised linear SVM in about 20 minutes.
Unlabeled data can be very useful in improving classification performance when labels are relatively few.

这上面是讲这个算法很强悍的数据证据，看样子还真是不错。

它用的数据集与LibSVM比较相似，只是作者比较懒惰（不过很有奉献精神），他并没有把第一列作为类别，他是将特征和类别分成两个文件（当然，这样写程序好写一点）。

For example, the following data matrix with 4 examples and 5 features
0 3 0 0 1
4 1 0 0 0
0 5 9 2 0
6 0 0 5 3

is described in the input file as

2:3 5:1
1:4 2:1
2:5 3:9 4:2
1:6 4:5 5:3

这是作者举的数据集的例子。

The file containing labels is separate since it is routine to use the same inputs with different labels. Each line should contain a label for the corresponding line in the input
file with one of the following values:
+1 (labeled positive example)
-1 (labeled negative example)
0 (unlabeled examples)

+1 正例 -1 负例，0表示未标记样本

Download the file example.tar.gz or example.zip

下载这两个数据集先试一下吧。

两个半监督算法运行的命令（对思维怪异的人再提醒一句：下面的命令，每次输一个就行了）

1：svmlin -A 2
-W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels

2：svmlin -A 3
-W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels

用下面的命令看一下准确率：

svmlin -f training_examples.weights example/test_examples example/test_labels

如果你是在linux下面当然一切都不是问题，安装上编译工具（ubuntu下面似乎没有自带的，让我这个windows
fan还晕了一会，至于怎么装，自已google去吧）。

Type

make

This will create an executable

Svmlin

如作者所说的，敲个make就有一个svmlin的可执行文件。

最好用的当然还是windows的visual
studio了，不过visual c++ 6.0并不完全支持标准c++（这点也是非常让人心烦的，循环变量i，j，k用完了用什么变量呢？），导致svmlin在vc6.0下编辑会提示i，j重复定义了，最简单的做法，直接删吧，还有提醒一上register
int，这个的意思是i放到寄存器中，当然在现在的编译器下，这只是一种通常不会实现的愿望。

改完了还会有两个错误，isnan和isinf

在ssl.h中加入以下宏：

#if defined(WIN32)

#ifndef FOUND_C99_ISXX

#undef isnan

#undef isinf

#endif

#if !defined(isnan) && !defined(HAVE_ISNAN) && !defined(HAVE_C99_ISNAN)

#define isnan(val) (0)

#endif

#if !defined(isinf) && !defined(HAVE_ISINF) && !defined(HAVE_C99_ISINF)

#define isinf(val) (0)

#endif

对于怎么把文件加进工程这种问题，就不讲了，讲这些，感觉有点污辱人。这时如果没干什么傻事，应该就可以运行了。

【上篇】谷歌浏览器的源码分析(34)
【下篇】hdu 1394 Minimum Inversion Number(优化版)

作者: dreadlock

该日志由 dreadlock 于11年前发表在综合分类下，最后更新于 2013年10月06日.
转载请注明: 半监督算法工具SVMlin使用 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

半监督算法工具SVMlin使用

转自 Koala++'s blog 感谢原作者

作者: dreadlock

书签

最新文章New

本站推荐

返回首页