上篇主要分析了Partial Implementation的建树主要操作,下面就自己使用mahout的源码自己实战一下:
(注意:所建的MR工程需要导入下面的包:http://download.csdn.net/detail/fansy1990/5030740,否则看不到console里面的提示)
新建如下的类文件:
package org.fansy.forest.test;
import java.io.*;
import java.util.List;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.df.Bagging;
import org.apache.mahout.classifier.df.builder.DecisionTreeBuilder;
import org.apache.mahout.classifier.df.builder.TreeBuilder;
import org.apache.mahout.classifier.df.data.Data;
import org.apache.mahout.classifier.df.data.DataConverter;
import org.apache.mahout.classifier.df.data.Dataset;
import org.apache.mahout.classifier.df.data.Instance;
import org.apache.mahout.classifier.df.node.Node;
import org.apache.mahout.common.RandomUtils;
import com.google.common.collect.Lists;
public class TestBuildTree {
/**
* use the Mahout source code to build a decision tree
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
Path dsPath=new Path("/home/fansy/workspace/MahTestDemo/car_small.info");
String dataPath="/home/fansy/mahout/data/forest/car_test_small.txt";
Random rng=RandomUtils.getRandom(555);
// create dataset
Dataset ds=Dataset.load(new Configuration(), dsPath);
// create converter
DataConverter converter=new DataConverter(ds);
// load data
Data data=loadData(ds,converter,dataPath);
// create treeBuilder and build tree
TreeBuilder treeBuilder=new DecisionTreeBuilder();
Bagging bag=new Bagging(treeBuilder,data);
Node tree=bag.build(rng);
System.out.println("the tree is builded"+tree);
}
/**
* load data from the given data path
* @param ds :dataset
* @param converter: DataConverter
* @param dataPath : data path
* @return Data
* @throws IOException
*/
public static Data loadData(Dataset ds,DataConverter converter,String dataPath) throws IOException{
List<Instance> instances=Lists.newArrayList();
File dataSourthPath=new File(dataPath);
FileReader fileReader=null;
try {
fileReader = new FileReader(dataSourthPath);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
BufferedReader bf=new BufferedReader(fileReader);
String line=null;
try {
while((line=bf.readLine())!=null){
instances.add(converter.convert(line));
}
} catch (IOException e) {
e.printStackTrace();
}
bf.close();
fileReader.close();
System.out.println("load file to Data done ...");
return new Data(ds,instances);
}
}
其中car_small.info可以参考
http://blog.csdn.net/fansy1990/article/details/8443342得到,原始数据文件和car_test_small.txt一样,参考上篇
http://blog.csdn.net/fansy1990/article/details/8544344中的原始数据文件;
直接运行可以看到下面的提示:
load file to Data done ...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/fansy/hadoop-1.0.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/fansy/mahout-0.7-pure/examples/target/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/fansy/mahout-0.7-pure/core/target/mahout-core-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
__________________________ bag data which is changed
-----------------------
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:1.0,1:2.0,2:3.0,4:1.0}
{0:1.0,1:1.0,3:2.0,4:1.0,5:2.0}
{0:2.0,2:3.0}
{0:1.0,2:3.0,3:1.0,4:2.0,5:1.0,6:1.0}
{1:2.0,2:1.0,3:1.0,5:2.0}
{0:3.0,1:2.0,2:2.0,3:1.0,4:2.0,5:1.0,6:1.0}
{0:1.0,1:1.0,2:2.0,4:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0}
{0:2.0,2:3.0,5:2.0}
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{1:2.0,2:3.0,4:2.0}
{0:1.0,1:3.0,2:1.0,4:2.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:2.0,1:1.0,4:2.0}
{0:1.0,1:1.0,2:2.0,4:1.0}
{0:2.0,1:2.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{0:2.0,2:3.0}
{0:1.0,1:1.0,3:2.0,4:1.0,5:2.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{0:3.0,1:3.0,3:1.0,5:2.0}
{2:3.0,4:2.0,5:1.0}
{0:1.0,1:1.0,2:2.0,4:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:1.0,1:2.0,2:3.0,4:1.0}
{1:1.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{1:1.0,2:3.0,4:2.0,5:1.0}
{0:2.0,1:1.0,2:1.0,3:1.0,4:2.0,5:2.0}
{0:2.0,1:2.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{0:1.0,1:2.0,3:2.0,4:2.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{0:3.0,1:2.0,2:1.0}
{1:2.0,2:1.0,4:1.0,5:1.0}
{0:2.0,1:1.0,2:1.0,4:1.0}
{0:1.0,1:3.0,2:1.0,3:2.0}
{0:1.0,1:2.0,2:2.0,3:2.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:1.0,1:1.0,2:2.0,3:2.0,5:1.0}
{0:2.0,2:1.0,4:2.0,5:1.0}
{0:2.0,1:2.0,2:2.0,3:1.0,4:2.0}
{1:2.0,2:1.0,4:1.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:3.0,1:3.0,3:1.0,5:2.0}
{0:3.0,2:3.0,5:2.0}
{1:1.0,2:3.0,4:2.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:1.0,1:1.0,3:2.0,4:1.0,5:2.0}
{1:2.0,2:1.0,3:1.0,5:2.0}
{0:3.0,1:2.0,2:2.0,3:1.0,4:2.0,5:1.0,6:1.0}
{0:1.0,1:2.0,2:3.0,4:1.0}
{0:2.0,1:2.0,2:2.0,3:1.0,4:2.0}
-----------------------
goes down*************+time:1359169561154
the igSplit is:null
%%%%%%%%%%% the attributes
2,5,3,
the best ig is:0.3791769206396574,attribute:5,split:NaN
the attributes%%%%%%%%%%%%%
****************** not return but goes down1359169561157
if(complemented) before................
0.0,2.0,1.0,
complemented:true
0.0,2.0,1.0,
if(complemented) after................
subset[0] size:18
subset[1] size:10
subset[2] size:26
******************* cnt:3,---------------->end subsets size
__________________________subsets[0]
-----------------------
{0:1.0,1:2.0,2:3.0,4:1.0}
{0:2.0,2:3.0}
{0:1.0,1:1.0,2:2.0,4:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0}
{1:2.0,2:3.0,4:2.0}
{0:2.0,1:1.0,4:2.0}
{0:1.0,1:1.0,2:2.0,4:1.0}
{0:2.0,2:3.0}
{0:1.0,1:1.0,2:2.0,4:1.0}
{0:1.0,1:2.0,2:3.0,4:1.0}
{0:1.0,1:2.0,3:2.0,4:2.0}
{0:3.0,1:2.0,2:1.0}
{0:2.0,1:1.0,2:1.0,4:1.0}
{0:1.0,1:3.0,2:1.0,3:2.0}
{0:1.0,1:2.0,2:2.0,3:2.0}
{0:2.0,1:2.0,2:2.0,3:1.0,4:2.0}
{0:1.0,1:2.0,2:3.0,4:1.0}
{0:2.0,1:2.0,2:2.0,3:1.0,4:2.0}
-----------------------
XXXXXXXXXXXXXXXXXXXXXXXXXdata.isIdenticalLabel() in DecisionTreeBuilder it should not be here,time is :1359169561167,data.getDataset.getLabel():0.0
__________________________subsets[1]
-----------------------
{0:1.0,1:1.0,3:2.0,4:1.0,5:2.0}
{1:2.0,2:1.0,3:1.0,5:2.0}
{0:2.0,2:3.0,5:2.0}
{0:1.0,1:1.0,3:2.0,4:1.0,5:2.0}
{0:3.0,1:3.0,3:1.0,5:2.0}
{0:2.0,1:1.0,2:1.0,3:1.0,4:2.0,5:2.0}
{0:3.0,1:3.0,3:1.0,5:2.0}
{0:3.0,2:3.0,5:2.0}
{0:1.0,1:1.0,3:2.0,4:1.0,5:2.0}
{1:2.0,2:1.0,3:1.0,5:2.0}
-----------------------
XXXXXXXXXXXXXXXXXXXXXXXXXdata.isIdenticalLabel() in DecisionTreeBuilder it should not be here,time is :1359169561168,data.getDataset.getLabel():0.0
__________________________subsets[2]
-----------------------
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:1.0,2:3.0,3:1.0,4:2.0,5:1.0,6:1.0}
{0:3.0,1:2.0,2:2.0,3:1.0,4:2.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:1.0,1:3.0,2:1.0,4:2.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:2.0,1:2.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{2:3.0,4:2.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:1.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{1:1.0,2:3.0,4:2.0,5:1.0}
{0:2.0,1:2.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{1:2.0,2:1.0,4:1.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:1.0,1:1.0,2:2.0,3:2.0,5:1.0}
{0:2.0,2:1.0,4:2.0,5:1.0}
{1:2.0,2:1.0,4:1.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:1.0,2:3.0,4:2.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:3.0,1:2.0,2:2.0,3:1.0,4:2.0,5:1.0,6:1.0}
-----------------------
goes down*************+time:1359169561170
%%%%%%%%%%% the attributes
3,0,2,
the best ig is:0.8024757691014436,attribute:3,split:NaN
the attributes%%%%%%%%%%%%%
****************** not return but goes down1359169561170
if(complemented) before................
0.0,2.0,1.0,
complemented:true
0.0,2.0,1.0,
if(complemented) after................
subset[0] size:10
subset[1] size:10
subset[2] size:6
******************* cnt:3,---------------->end subsets size
__________________________subsets[0]
-----------------------
{0:1.0,1:3.0,2:1.0,4:2.0,5:1.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{2:3.0,4:2.0,5:1.0}
{1:1.0,2:3.0,4:2.0,5:1.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{0:3.0,2:2.0,4:2.0,5:1.0}
{1:2.0,2:1.0,4:1.0,5:1.0}
{0:2.0,2:1.0,4:2.0,5:1.0}
{1:2.0,2:1.0,4:1.0,5:1.0}
{1:1.0,2:3.0,4:2.0,5:1.0}
-----------------------
XXXXXXXXXXXXXXXXXXXXXXXXXdata.isIdenticalLabel() in DecisionTreeBuilder it should not be here,time is :1359169561172,data.getDataset.getLabel():0.0
__________________________subsets[1]
-----------------------
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:1.0,1:1.0,2:2.0,3:2.0,5:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
-----------------------
goes down*************+time:1359169561173
%%%%%%%%%%% the attributes
4,0,2,
the best ig is:0.4689955935892812,attribute:4,split:NaN
the attributes%%%%%%%%%%%%%
****************** not return but goes down1359169561173
if(complemented) before................
0.0,2.0,1.0,
complemented:true
0.0,2.0,1.0,
if(complemented) after................
subset[0] size:1
subset[1] size:5
subset[2] size:4
******************* cnt:2,---------------->end subsets size
__________________________subsets[0]
-----------------------
{0:1.0,1:1.0,2:2.0,3:2.0,5:1.0}
-----------------------
isIdentical(data) in DecisionTreeBuilder,time is :1359169561174,data.majorityLabel:0
__________________________subsets[1]
-----------------------
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
{1:2.0,3:2.0,4:2.0,5:1.0,6:1.0}
-----------------------
isIdentical(data) in DecisionTreeBuilder,time is :1359169561175,data.majorityLabel:1
__________________________subsets[2]
-----------------------
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{1:2.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
{0:3.0,1:1.0,2:1.0,3:2.0,4:1.0,5:1.0,6:1.0}
-----------------------
XXXXXXXXXXXXXXXXXXXXXXXXXdata.isIdenticalLabel() in DecisionTreeBuilder it should not be here,time is :1359169561175,data.getDataset.getLabel():1.0
__________________________subsets[2]
-----------------------
{0:1.0,2:3.0,3:1.0,4:2.0,5:1.0,6:1.0}
{0:3.0,1:2.0,2:2.0,3:1.0,4:2.0,5:1.0,6:1.0}
{0:2.0,1:2.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{1:1.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{0:2.0,1:2.0,2:3.0,3:1.0,4:1.0,5:1.0,6:1.0}
{0:3.0,1:2.0,2:2.0,3:1.0,4:2.0,5:1.0,6:1.0}
-----------------------
XXXXXXXXXXXXXXXXXXXXXXXXXdata.isIdenticalLabel() in DecisionTreeBuilder it should not be here,time is :1359169561176,data.getDataset.getLabel():1.0
the tree is buildedCATEGORICAL:LEAF:0.0;,LEAF:0.0;,CATEGORICAL:LEAF:0.0;,CATEGORICAL:LEAF:0.0;,LEAF:1.0;,LEAF:1.0;,;,LEAF:1.0;,;,;
对照这个提示结果和前篇的原理分析的<2.1>~<2.6>即可明白意思;
由最后一行提示亦可以看出最后得到的树如下:
转换为原始数据对应的树为:
分享,快乐,成长
转载请注明出处:http://blog.csdn.net/fansy1990
分享到:
相关推荐
mahoutAlgorithms源码分析 mahout代码解析
Mahout算法解析与案例实战_PDF电子书下载 带书签目录 完整版
Mahout in Action 源码,结合Mahout in Action 学习数据挖掘,比较容易理解
MAHOUT实战 MAHOUT IN ACTION
Mahout是一个Java的机器学习库。Mahout的完整源代码,基于maven,可以轻易导入工程中
用于测试mahout中的决策树 ,即Partial Implementation用到的测试jar包。所谓的测试其实也只是把相应的数据可以打印出来,方便单机调试,理解算法实现原理而已。
maven_mahout_template-mahout-0.8
Mahout实战案例-约会推荐系统,详情参考博客《Mahout案例实战--Dating Recommender 系统》http://blog.csdn.net/fansy1990/article/details/44181459
mahout,朴素贝叶斯分类,中文分词,mahout,朴素贝叶斯分类,中文分词,
apache-mahout-trunk_java_物联_源码.zip
第三部分 功能主要包括四个方面:集群配置、集群算法监控、Hadoop模块、Mahout模块。 详情参考《Mahout算法调用展示平台2.1》
Mahout in action 实战中文版 高清 完整,,最经典的hadoop机器学习库
Mahout算法解析与案例实战_PDF电子书下载 带书签目录 完整版【编辑推荐】, 全面分析Mahout算法库中不同模块中各个算法的原理及其Mahout实现流程, 每个算法都辅之以实战案例,同时还包括4个系统级案例,实战性强, ...
该资源是mahout in action 中的源码,适用于自学,可在github下载:https://github.com/tdunning/MiA
Mahout推荐算法实战
Mahout0.8_API 喜欢的人就请下载吧
mahout实战 源码 mahout实战 配套 mahout-distribution-0.5.tar.gz 版本
不仅全面分析了Mahout算法库不同模块中的各个算法的原理及其实现流程,而且每个算法都辅之以实战案例。此外,还包括4个系统级案例,实战性非常强。全书11章共分为三个部分:第一部分为基础篇(第1~2章),首先介绍...
mahout实战
通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中