weka的java使用(3)—

继续weka的编程系列。数据挖掘的一个重要的过程就是要特征选择，主要作用就是降维，并且降低计算的复杂性，摒弃那些可能的潜在噪声。在我的paper中和硕士论文中都用到了CFS的特征子集选择方法，配以最佳优先的搜索或者贪心搜索，这样可以将维度比较高的训练特征集降维并简化，大概用CFS+Best first可以将我的训练样本中的145维特征降到40-50之间。
具体的实现方法见下面的测试代码（只做示范用）：

/**
2

*
3

*/
4

package edu.tju.ikse.mi.util;
5

import java.io.File;
7

import java.io.IOException;
8

import java.util.Random;
9

import weka.attributeSelection.ASEvaluation;
11

import weka.attributeSelection.ASSearch;
12

import weka.attributeSelection.AttributeSelection;
13

import weka.attributeSelection.BestFirst;
14

import weka.attributeSelection.CfsSubsetEval;
15

import weka.core.Instances;
16

import weka.core.converters.ArffLoader;
17

/**
19

* @author Jia Yu
20

* @date 2010-11-23
21

*/
22

public class WekaSelector {
23

private ArffLoader loader;
25

private Instances dataSet;
26

private File arffFile;
27

private int sizeOfDataset;
28

private int numOfOldAttributes;
29

private int numOfNewAttributes;
30

private int classIndex;
31

private int[] selectedAttributes;
32

public WekaSelector(File file) throws IOException {
34

loader = new ArffLoader();
35

arffFile = file;
36

loader.setFile(arffFile);
37

dataSet = loader.getDataSet();
38

sizeOfDataset = dataSet.numInstances();
39

numOfOldAttributes = dataSet.numAttributes();
40

classIndex = numOfOldAttributes - 1;
41

dataSet.setClassIndex(classIndex);
42

}
43

public void select() throws Exception {
45

ASEvaluation evaluator = new CfsSubsetEval();
46

ASSearch search = new BestFirst();
47

AttributeSelection eval = null;
48

eval = new AttributeSelection();
50

eval.setEvaluator(evaluator);
51

eval.setSearch(search);
52

eval.SelectAttributes(dataSet);
54

numOfNewAttributes = eval.numberAttributesSelected();
55

selectedAttributes = eval.selectedAttributes();
56

System.out.println("result is "+eval.toResultsString());
57

/*
58

Random random = new Random(seed);
59

dataSet.randomize(random);
60

if (dataSet.attribute(classIndex).isNominal()) {
61

dataSet.stratify(numFolds);
62

}
63

for (int fold = 0; fold < numFolds; fold++) {
64

Instances train = dataSet.trainCV(numFolds, fold, random);
65

eval.selectAttributesCVSplit(train);
66

}
67

System.out.println("result is "+eval.CVResultsString());
68

*/
69

System.out.println("old number of Attributes is "+numOfOldAttributes);
70

System.out.println("new number of Attributes is "+numOfNewAttributes);
71

for(int i=0;i<selectedAttributes.length;i++){
72

System.out.println(selectedAttributes[i]);
73

}
74

}
75

/**
77

* @param args
78

*/
79

public static void main(String[] args) {
80

// TODO Auto-generated method stub
81

File file = new File("iris.arff");
82

try {
83

WekaSelector ws = new WekaSelector(file);
84

ws.select();
85

} catch (IOException e) {
87

// TODO Auto-generated catch block
88

e.printStackTrace();
89

} catch (Exception e) {
90

// TODO Auto-generated catch block
91

e.printStackTrace();
92

}
93

}
95

}
97

其中的注释部分是使用交叉验证的部分。默认是十折交叉验证，当然这个可以通过set方法设置。具体的使用或者用到reduce dimensionality的方法大家可以参看源代码。毕竟weka开源很是方便。源代码涉及到的类主要是查看weka.attributeSelection.AttributeSelection类就可以了。当然如何调用和选择可以看看weka.gui.explorer.AttributeSelectionPanel类。

上面代码的实验结果如下：

result is

=== Attribute Selection on all input data ===

Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 12
Merit of best subset found: 0.887

Attribute Subset Evaluator (supervised, Class (nominal): 5 class):
CFS Subset Evaluator
Including locally predictive attributes

Selected attributes: 3,4 : 2
petallength
petalwidth

old number of Attributes is 5
new number of Attributes is 2
2
3
4

原来的iris数据集中共有4个属性（包含一个分类类标所以一共5维），经过特征选择后，只有第3和第4两个维度的特征保留，所以新特征子集有两个维度（不包含类标，有点绕，不好意思，我总是这样）。
最后的2，3，4是属性数组的下标，表示经过特征选择保留的属性子集是第3，4，5个属性。

changedi 2010-11-23 10:06 发表评论

weka的java使用(3)——特征选择

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本