Lucene应用的一点体会Lucene应用(我用的是Lucene2.1.0,有些观点有可能也不太正确)1.多线程索引,共享同一个
Lucene应用的一点体会
Lucene应用(我用的是Lucene2.1.0,有些观点有可能也不太正确)
1.多线程索引,共享同一个IndexWriter对象
这种方式效率很慢,主要原因是因为:
java 代码?
- public?void?addDocument(Document?doc,?Analyzer?analyzer)?throws?IOException?{??
- SegmentInfo?newSegmentInfo?=?buildSingleDocSegment(doc,?analyzer);??
- synchronized?(this)?{??
- ramSegmentInfos.addElement(newSegmentInfo);//这句很占用效率??
- maybeFlushRamSegments();??
- }??
- }??
ramSegmentInfos 是一个SegmentInfos 对象,这个对象extends Vector,Vector的addElement是同步的。这个可能是导致效率慢的主要原因吧.
2 多线程索引,? 先写到RAMDirectory,再一次性写到FSDirectory
功能:首先向RAMDirectory里写,当达到1000个Document後,再向FSDirectory里写。
当多线程执行时,会大量报java.lang.NullPointerException
自己写的多线程索引的类为(IndexWriterServer,该对象只在Server启动时初始化一次):
代码
- public?class?IndexWriterServer{??
- private?static?IndexWriter?indexWriter?=?null;??
- ??????
- ????//private?String?indexDir?;//索引目录;??
- ??????
- ????private?static??CJKAnalyzer?analyzer?=?null;??
- ??????
- ????private?static?RAMDirectory?ramDir?=?new?RAMDirectory();??
- ??????
- ????private?static?IndexWriter?ramWriter?=?null;??
- ??????
- ????private?static?int?diskFactor?=?0;//内存中现在有多少Document??
- ??????
- ????private?static?long?ramToDistTime?=?0;//内存向硬盘写需要多少时间??
- ??????
- ????private?int?initValue?=?1000;//内存中达到多少Document,才向硬盘写??
- ??????
- ????private?static?IndexItem?[]indexItems?=?null;??
- ??????
- ????public?IndexWriterServer(String?indexDir){??
- ????????initIndexWriter(indexDir);??
- ????}??
- ????public?void?initIndexWriter(String?indexDir){??
- ????????boolean?create?=?false;//是否创建新的??
- ??????????
- ????????analyzer?=?new?CJKAnalyzer();??
- ??????????
- ????????Directory?directory?=?this.getDirectory(indexDir);??
- ????????//判断是否为索引目录??
- ????????if(!IndexReader.indexExists(indexDir)){??
- ????????????create?=?true;??
- ????????}??
- ??????????
- ????????indexWriter?=?getIndexWriter(directory,create);??
- ??????????
- ????????try{??
- ????????????ramWriter?=?new?IndexWriter(ramDir,?analyzer,?true);??
- ????????}catch(Exception?e){??
- ????????????logger.info(e);??
- ????????}??
- ??????????
- ????????indexItems?=?new?IndexItem[initValue+2];??
- ????}??
- ??
- /**?
- ?????*?生成单个Item索引?
- ?????*/??
- ????public?boolean?generatorItemIndex(IndexItem?item,?Current?__current)?throws?DatabaseError,?RuntimeError{??
- ????????boolean?isSuccess?=?true;//是否索引成功??
- ??????????
- ????????try{??
- ??????????????
- ????????????Document?doc?=?getItemDocument(item);??
- ??????????????
- ????????????ramWriter.addDocument(doc);//关键代码,错误就是从这里报出来的??
- ??????????????
- ????????????indexItems[diskFactor]?=?item;//为数据挖掘使用??
- ????????????diskFactor?++;??
- ????????????if((diskFactor?%?initValue)?==?0){??
- ????????????????ramToDisk(ramDir,ramWriter,indexWriter);??
- ????????????????//ramWriter?=?new?IndexWriter(ramDir,?analyzer,?true);??
- ????????????????diskFactor?=?0;??
- ??????????????????
- ????????????????//数据挖掘??
- ????????????????isSuccess?=?MiningData();??
- ??
- ??????????????????
- ????????????}??
- ??????????????
- ????????????doc?=?null;??
- ??????????????
- ??
- ????????????logger.info("generator?index?item?link:"?+?item.itemLink??+"?success");??
- ????????}catch(Exception?e){??
- ????????????logger.info(e);??
- ????????????e.printStackTrace();??
- ??????
- ????????????logger.info("generator?index?item?link:"?+?item.itemLink??+"?faiture");??
- ????????????isSuccess?=?false;??
- ????????}finally{??
- ????????????item?=?null;??
- ????????}??
- ??????????
- ????????return?isSuccess;??
- ????}??
- ??
- public??void?ramToDisk(RAMDirectory?ramDir,?IndexWriter?ramWriter,IndexWriter?writer){??
- ????????try{??
- ????????????ramWriter.close();//关键代码,把fileMap赋值为null了??
- ????????????ramWriter?=?new?IndexWriter(ramDir,?analyzer,?true);//重新构建一个ramWriter对象。因为它的fileMap为null了,但是好像并没有太大作用??
- ????????????Directory?ramDirArray[]?=?new?Directory[1];??
- ????????????ramDirArray[0]?=?ramDir;??
- ????????????mergeDirs(writer,?ramDirArray);??
- ????????}catch(Exception?e){??
- ????????????logger.info(e);??
- ????????}??
- ????}??
- ????/**?
- ?????*?将内存里的索引信息写到硬盘里?
- ?????*?@param?writer?
- ?????*?@param?ramDirArray?
- ?????*/???
- ????public??void?mergeDirs(IndexWriter?writer,Directory[]?ramDirArray){??
- ????????try?{??
- ????????????writer.addIndexes(ramDirArray);??
- ????????????//optimize();??
- ????????}?catch?(IOException?e)?{??
- ????????????logger.info(e);??
- ????????}??
- ????}??
- ??????
- }??
<script type="text/javascript">render_code();</script>
主要原因大概是因为:在调用ramWriter.close();时,Lucene2.1里RAMDirectory 的close()方法
代码
- public?final?void?close()?{??
- ???fileMap?=?null;??
- ?}??
<script type="text/javascript">render_code();</script>
把fileMap 给置null了,当多线程执行ramWriter.addDocument(doc);时,最终执行RAMDirectory 的方法:代码
- public?IndexOutput?createOutput(String?name)?{??
- ????RAMFile?file?=?new?RAMFile(this);??
- ????synchronized?(this)?{??
- ????RAMFile?existing?=?(RAMFile)fileMap.get(name);//fileMap为null,所以报:NullPointerException,??
- ??????if?(existing!=null)?{??
- ????????sizeInBytes?-=?existing.sizeInBytes;??
- ????????existing.directory?=?null;??
- ??????}??
- ??????fileMap.put(name,?file);??
- ????}??
- ????return?new?RAMOutputStream(file);??
- ??}??
<script type="text/javascript">render_code();</script>
提示:在网上搜索了一下,好像这个是lucene的一个bug(http://www.opensubscriber.com/message/java-user@lucene.apache.org/6227647.html),但是好像并没有给出解决方案。
?
3.多线程索引,每个线程一个IndexWriter对象,每个IndexWriter 绑定一个FSDirectory对象。每个FSDirectory绑定一个本地的磁盘目录(唯一的)。单独开辟一个线程出来监控这些索引线程(监控线程),也就是说负责索引的线程索引完了以后,给这个监控线程的queue里发送一个对象:queue.add(directory);,这个监控现成的queue对象是个全局的。当这个queue的size() > 20 时,监控线程 把这20个索引目录合并(merge):indexWriter.addIndexes(dirs);//合并索引,合并到真正的索引目录里。,合并完了以后,然后删除掉这些已经合并了的目录。
但是这样也有几个bug:
a. 合并线程的速度 小于 索引线程的速度。导致 目录越来越多
b.经常会报一个类似这样的错误:
2007-06-08 10:49:18 INFO [Thread-2] (IndexWriter.java:1070) - java.io.FileNotFoundException: /home/spider/luceneserver/merge/item_d28686afe01f365c5669e1f19a2492c8/_1.cfs (No such file or directory)
?
4.单线程索引,调几个参数後,效率也非常快(索引一条信息大概在6-30 ms之间)。感觉一般的需求单线程就够用了。这些参数如下:
?? private int mergeFactor = 100;//磁盘里达到多少後会自动合并
??? private int maxMergeDocs = 1000;//内存中达到多少会向磁盘写入
??? private int minMergeDocs = 1000;//lucene2.0已经取消了
??? private int maxFieldLength = 2000;//索引的最大文章长度
??? private int maxBufferedDocs = 10000;//这个参数不能要,要不然不会自动合并了
得出的结论是:Lucene的多线程索引会有些问题,如果没有特殊需求,单线程的效率几乎就能满足需求.
?
如果单线程的速度满足不了你的需求,你可以多开几个应用。每个应用都绑定一个FSDirectory,然后通过search时通过RMI去这些索引目录进行搜索。
RMI Server端,关键性代码:
?
java 代码
- private?void?initRMI(){??
- ????????//第一安全配置??
- ????????if?(System.getSecurityManager()?==?null)?{??
- ????????????System.setSecurityManager(?new?RMISecurityManager()?);??
- ????????}??
- ????????//注册??
- ????????startRMIRegistry(serverUrl);??
- ??????????
- ????????SearcherWork?searcherWork?=?new?SearcherWork("//"?+?serverUrl?+?"/"?+?bindName,?directory);??
- ??????????
- ????????searcherWork.run();??
- ??????????
- ????}??
- ??
- public?class?SearcherWork??{??
- //???Logger??
- ????private?static?Logger?logger?=?Logger.getLogger(SearcherWork.class);??
- ????private?String?serverUrl?=null;??
- ????private?Directory?directory?=null;??
- ??
- ????public?SearcherWork(){??
- ??????????
- ????}??
- ??????
- ????public?SearcherWork(String?serverUrl,?Directory?directory){??
- ????????this.serverUrl?=?serverUrl;??
- ????????this.directory?=?directory;??
- ????}??
- ??????
- ????public?void?run(){??
- ????????try{??
- ?????????????Searchable?searcher?=?new?IndexSearcher(directory);??
- ?????????????SearchService?service?=?new?SearchService(searcher);??
- ?????????????Naming.rebind(serverUrl,?service);??
- ?????????????logger.info("RMI?Server?bind?"?+?serverUrl?+?"?success");??
- ??????????????
- ????????}catch(Exception?e){??
- ????????????logger.info(e);??
- ????????????System.out.println(e);??
- ????????}??
- ????}??
- ??????
- ??
- }??
- ??
- public?class?SearchService?extends?RemoteSearchable?implements?Searchable?{??
- ??????
- ????public?SearchService?(Searchable?local)?throws?RemoteException?{??
- ????????super(local);??
- ????}??
- }??
客户端关键性代码:
java 代码?
- RemoteLuceneConnector?rlc=?new?RemoteLuceneConnector();??
- RemoteSearchable[]?rs=?rlc.getRemoteSearchers();??
- MultiSearcher?multi?=?new?MultiSearcher(rs);??
- Hits?hits?=?multi.search(new?TermQuery(new?Term("content","中国")));??
?
1 楼 sg552 2007-06-11 楼主想说些什么?
LUCENE到底是好?还是不好?
还是在普及LUCENE知识??? 2 楼 rtdb 2007-06-11 支持一下楼主,这种个人应用上的体会,是有价值的。 3 楼 restart 2007-06-11 楼主,光看你的第一点“多线程索引,共享同一个IndexWriter对象”就严重错误了! 4 楼 ttitfly 2007-06-11 只是我个人测试的结果,有些地方不对请指点,请问restart错在什么地方? 5 楼 yfmine 2007-06-12 引用
private int maxMergeDocs = 1000;//内存中达到多少会向磁盘写入
Determines the largest number of documents ever merged by addDocument().
引用
private int minMergeDocs = 1000;//lucene2.0已经取消了
private int maxBufferedDocs = 10000;//这个参数不能要,要不然不会自动合并了
minMergeDocs没有取消,就是改成setMaxBufferedDocs了。
引用
ramWriter = new IndexWriter(ramDir, analyzer, true);//重新构建一个ramWriter对象。因为它的fileMap为null了,但是好像并没有太大作用
lz你的代码有问题,这个是参数……
在实际中,我会用多线程开多个IndexWriter分别索引到不同的目录,IndexWriter被设计成线程安全的,有同步机制,多线程使用同一个IndexWriter,应该没有多大提高,和lz说的3比较类似,但通常不会合并(索引目录是按数据库表分的)。索引的速度和I/O关系比较大,要么是写入磁盘比较慢,要么是从数据源读取慢,通过调整maxBufferedDocs,写入索引文件这一端应该不会成为瓶颈。但调整到10000很可能服务器没那么大的内存。一般都是1000左右