【朴素无华贝叶斯】实战朴素贝叶斯_代码实现_数据和接口

2012-09-18

【朴素贝叶斯】实战朴素贝叶斯_代码实现_数据和接口接下来，进行代码实现。【样本式】首先，定义一下输入样本的式

【朴素贝叶斯】实战朴素贝叶斯_代码实现_数据和接口

接下来，进行代码实现。

【样本格式】

首先，定义一下输入样本的格式，这些样本用于对朴素贝叶斯模型进行训练。定义格式如下：

1:9242 13626 28005 41622 41623 34625 36848 5342 51265
0:16712 49100 2933 65827 6200
1:53396 3675 43979 25739
0:17347 61515 53679 59426
1:32712 39134 63265 65430

每一行是一个样本。行首是这个样本所属的类别的索引，类别索引从0开始。当类别是以标签的形式存在的时候，可以建立类别标签与类别索引之间的联系。这个由其他程序来做，很简单。类别索引与后面的输入特征用冒号（“：”）相分隔。后面是特征词的索引。同样，特征词词串与索引之间的映射关系也可以建立。用索引的好处是：1. 使得分类器算法更加通用化；2. 整数处理相对字符串更加高效。

【数据结构】

接下来，定义数据结构。在“实战朴素贝叶斯——基本原理”一文中，我们已经分析了模型的参数空间。接下来，我直接给出数据结构代码：

// The format of input samples://ClassLabelIndex segmenter(not whitespace) ItemOneIndex whitespace ItemTwoIndex......//iClassNum: the number of class label index, [0, iClassNum-1]//iFeaTypeNum: the number of feature type index, [0, iFeaTypeNum-1]//sSegmenter: the segmenter between Class label index and Item index//iFeaExtractNum: the number of features which is going to extract//sFileModel: the output model into txt file//bCompactModel: whether to show some infor for debug, true for not include those infor// // The format of compact model parameters //1. the number of class//2. the prior probability of class//3. the conditional probability of p(item|class)bool Train (const char * sFileSample, int iClassNum, int iFeaTypeNum, string & sSegmenter, int iFeaExtractNum, const char * sFileModel, bool bCompactModel = true);// Load the naive bayes modelbool LoadNaiveBayesModel (const char * sFileModel);// predict according to the input featuresbool PredictByInputFeas (vector<int> & FeaIdVec, int & iClassId);// predict by input test corpus whose format is the same with the training corpusbool PredictFrmTstCorpus (const char * sFileTestCorpus, string & sSegmenter, const char * sFileOutput);

Train函数的参数比较多。首先，要有输入样本的文本文件，格式在上文中已经描述；然后，是样本类别数目（决定了ClassFeaVec的大小）和特征类别数目（通常是词表的大小）；然后，是样本和特征之间的分隔符，在我们的例子中是“：”；接下来，是指定我们最终在模型中要选择多少特征——在这个程序中，我把特征选择和参数训练放在一起了，好处是只扫描一遍样本，效率高，坏处是代码偶合性太强，不容易扩展；最后，是输出的模型文件，bCompactModel是个标志，是否输出额外的信息，以便调试模型，默认是关闭的。

其实上面这些内容都是放到一个类NaiveBayes里面的，为了叙述方便，拆开来说了。下一篇，讲训练过程。

热点排行

其他相关

【朴素无华贝叶斯】实战朴素贝叶斯_代码实现_数据和接口