paoding Lucene中文分词Paoding Analysis
?
Paoding Analysis摘要Paoding's Knives 中文分词具有极 高效率 和 高扩展性 。引入隐喻,采用完全的面向对象设计,构思先进。
高效率:在PIII 1G内存个人机器上,1秒 可准确分词 100万 汉字。
采用基于 不限制个数 的词典文件对文章进行有效切分,使能够将对词汇分类定义。
能够对未知的词汇进行合理解析
?
2010-01-20 庖丁 Lucene 3.0 升级说明
(代码已提交svn,下载包稍后稍推迟下)
这次升级的主要目的是支持Lucene 3.0,具体改动如下:
(1)支持Lucene 3.0,对Lucene 3.0以下的版本,请使用 http://paoding.googlecode.com/svn/branches/paoding-for-lucene-2.4/ 中的代码编译。
(2)使用Java 5.0编译,不再支持Java 1.4,以后的新功能将会在Java 5上开发。
(3)PaodingAnalyzer的调用接口没有改动,但在使用上需要适应Lucene 3.0的API,分词示例如下:
//生成analyzer实例 Analyzer analyzer = new PaodingAnalyzer(properties);
//取得Token流 TokenStream stream = analyzer.tokenStream("", reader);
//重置到流的开始位置 stream.reset();
//添加工具类 TermAttribute termAtt = (TermAttribute) stream.addAttribute(TermAttribute.class); OffsetAttribute offAtt = (OffsetAttribute) stream.addAttribute(OffsetAttribute.class);
//循环打印所有分词及其位置 while (stream.incrementToken()) {System.out.println(termAtt.term() + " " + offAtt.startOffset() + " " + offAtt.endOffset());}
具体使用方法可以参见net.paoding.analysis.analyzer.estimate以及net.paoding.analysis.examples包下面的示例代码。
?
?
/*
??? ?*param?? 分词
??? ?*/
??? public List getname(String param) throws IOException{
??? ??? //分词(庖丁解牛分词法)
??? ??? Analyzer ika = new PaodingAnalyzer();
??? ??? List<String> keys = new ArrayList<String>();
??? ??? ??? TokenStream ts = null;
??? ??? ???
??? ??? ??? try{
??? ??? ??? ??? Reader r = new StringReader(param);
??? ??? ??? ??? ts = ika.tokenStream("TestField", r);
??? ??? ??? ??? TermAttribute termAtt = (TermAttribute) ts.getAttribute(TermAttribute.class);
??? ??? ??? ??? TypeAttribute typeAtt = (TypeAttribute) ts.getAttribute(TypeAttribute.class);
??? ??? ??? ??? String key = null;
??? ??? ??? ??? while (ts.incrementToken()) {
??? ??? ??? ??? ??? if ("word".equals(typeAtt.type())) {
??? ??? ??? ??? ??? ??? key = termAtt.term();
??? ??? ??? ??? ??? ??? if (key.length() >= 2) {
??? ??? ??? ??? ??? ??? ??? keys.add(key);
??? ??? ??? ??? ??? ??? }
??? ??? ??? ??? ??? }
??? ??? ??? ??? }
??? ??? ??? }catch(IOException e){
??? ??? ??? ??? e.printStackTrace();
??? ??? ??? } finally {
??? ??? ??? ??? if (ts != null) {
??? ??? ??? ??? ??? ts.close();
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ???
??? ??? ??? Map<String, Integer> keyMap = new HashMap<String, Integer>();
??? ??? ??? Integer $ = null;
??? ??? ??? //计算每个词出现的次数
??? ??? ??? for (String key : keys) {
??? ??? ??? ??? keyMap.put(key, ($ = keyMap.get(key)) == null ? 1 : $ + 1);
??? ??? ??? }
??? ??? ??? List<Map.Entry<String, Integer>> keyList = new ArrayList<Map.Entry<String, Integer>>(keyMap.entrySet());
??? ??? ??? //进行排序
??? ??? ??? Collections.sort(keyList, new Comparator<Map.Entry<String, Integer>>() {
??? ??? ??? ??? public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
??? ??? ??? ??? ??? return (o2.getValue() - o1.getValue());
??? ??? ??? ??? }
??? ??? ??? });
??? ??? ??? //取出关键词
??? ??? ??? String id = null;
??? ??? ??? String str = "";
??? ??? ??? List list = new ArrayList();
??? ??? ??? if(keyList.size() >0){
??? ??? ??? ??? for (int i = 0;i < keyList.size(); i++) {
??? ??? ??? ??? ??? ?id = keyList.get(i).toString();
??? ??? ??? ??? ??? ?String[] strs = id.split("\\=");
??? ??? ??? ??? ??? ?str = strs[0];
??? ??? ??? ??? ??? ?list.add(strs[0]);
??? ??? ??? ??? ??? ?System.out.println("id:"+id);
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? return list;
??? }