Weka运用之属性选择

2012-11-03

Weka使用之属性选择转自：http://anna-zr.iteye.com/blog/578943http://blog.sina.com.cn/s/blog_591e979d0

Weka使用之属性选择
转自：http://anna-zr.iteye.com/blog/578943
http://blog.sina.com.cn/s/blog_591e979d0100kds0.html
在这一节我们看看属性选择。在数据挖掘的研究中，通常要通过距离来计算样本之间的距离，而样本距离是通过属性值来计算的。我们知道对于不同的属性，它们在样本空间的权重是不一样的，即它们与类别的关联度是不同的，因此有必要筛选一些属性或者对各个属性赋一定的权重。这样属性选择的方法就应运而生了。

在属性选择方面InfoGain和GainRatio的比较常见，也是最通俗易懂的方法。它们与Decision Tree的构造原理比较相似，哪个节点拥有的信息量就为哪个节点赋较高的权重。其它的还有根据关联度的办法来进行属性选择（Correlation-based Feature Subset Selection for Machine Learning）。具体它的工作原理大家可以在网上看论文。

现在我将简单的属性选择实例给大家展示一下：

package com.csdn;import java.io.File;import weka.attributeSelection.InfoGainAttributeEval;import weka.attributeSelection.Ranker;import weka.classifiers.Classifier;import weka.core.Instances;import weka.core.converters.ArffLoader;public class SimpleAttributeSelection {       public static void main(String[] args) {       // TODO Auto-generated method stub       Instances trainIns = null;             try{                               File file= new File("C:\\Program Files\\Weka-3-6\\data\\segment-challenge.arff");           ArffLoader loader = new ArffLoader();           loader.setFile(file);           trainIns = loader.getDataSet();                     //在使用样本之前一定要首先设置instances的classIndex，否则在使用instances对象是会抛出异常           trainIns.setClassIndex(trainIns.numAttributes()-1);                               Ranker rank = new Ranker();           InfoGainAttributeEval eval = new InfoGainAttributeeval_r();                               eval.buildEvaluator(trainIns);           //System.out.println(rank.search(eval, trainIns));                               int[] attrIndex = rank.search(eval, trainIns);                               StringBuffer attrIndexInfo = new StringBuffer();           StringBuffer attrInfoGainInfo = new StringBuffer();           attrIndexInfo.append("Selected attributes:");           attrInfoGainInfo.append("Ranked attributes:\n");           for(int i = 0; i < attrIndex.length; i ++){              attrIndexInfo.append(attrIndex[i]);              attrIndexInfo.append(",");                           attrInfoGainInfo.append(eval.evaluateAttribute(attrIndex[i]));              attrInfoGainInfo.append("\t");              attrInfoGainInfo.append((trainIns.attribute(attrIndex[i]).name()));              attrInfoGainInfo.append("\n");           }           System.out.println(attrIndexInfo.toString());           System.out.println(attrInfoGainInfo.toString());                 }catch(Exception e){           e.printStackTrace();       }    }}

在这个实例中，我用了InfoGain的属性选择类来进行特征选择。InfoGainAttributeEval主要是计算出各个属性的InfoGain信息。同时在weka中为属性选择方法配备的有搜索算法（seacher method），在这里我们用最简单的Ranker类。它对属性进行了简单的排序。在Weka中我们还可以对搜索算法设置一些其它的属性，例如设置搜索的属性集，阈值等等，如果有需求大家可以进行详细的设置。

在最后我们打印了一些结果信息，打印了各个属性的InfoGain的信息。

热点排行

其他相关

Weka运用之属性选择