lucene基本
Lucene索引中有几个最基础的概念,索引(index),文档(document),域(field),和项(或者译为语词term)?字串4?
其中Index为Document的序列?
字串6
?????Document为Field的序列?字串8?
?????Field为Term的序列?
字串7
?????Term就是一个子串.?字串7?
存在于不同的Field中的同一个子串被认为是不同的Term.因此Term实际上是用一对子串表示的,第一个子串为Field的name,第二个为Field中的子串.既然Term这么重要,我们先来认识一下Term.?
字串5
认识Term?字??
最好的方法就是看其源码表示.?
字串7
public?final?class?Term?implements?Comparable,?java.io.Serializable?{?字串8?
?String?field;?字串8?
?String?text;?字串1?
?public?Term(String?fld,?String?txt)?{this(fld,?txt,?true);}?
字串6
?public?final?String?field()?{?return?field;?}?
字串8
?public?final?String?text()?{?return?text;?}?
字串4
//overwrite?equals()?
字串9
?public?final?boolean?equals(Object?o)?{?}?
字串3
//overwrite?hashCode()?
字串5
?public?final?int?hashCode()?{return?field.hashCode()?+?text.hashCode();?字串5?
?}?字串3?
?
?
?
?public?int?compareTo(Object?other)?{return?compareTo((Term)other);}?字串5?
?public?final?int?compareTo(Term?other)?字串8?
?final?void?set(String?fld,?String?txt)?public?final?String?toString()?{?return?field?+?":"?+?text;?}?
字串6
?private?void?readObject(java.io.ObjectInputStream?in)?
字串7
?}?字串4?
从代码中我们可以大体看出Tern其实是一个二元组<FieldName,text>?字串8?
倒排索引
为了使得基于项的搜索更有效率,索引中项是静态存储的。Lucene的索引属于索引方式中的倒排索引,因为对于一个项这种索引可以列出包含它的文档。这刚好是文档与项自然联系的倒置。?字串5?
Field的类型?
Lucene中,Field的文本可能以逐字的非倒排的方式存储在索引中。而倒排过的Field称为被索引过了。Field也可能同时被存储和被索引。Field的文本可能被分解许多Term而被索引,或者就被用作一个Term而被索引。大多数的Field是被分解过的,但是有些时候某些标识符域被当做一个Term索引是很有用的。?字串2?
Index包中的每个类解析?字串3?
CompoundFileReader?
字串4
???????提供读取.cfs文件的方法.?字串9?
CompoundFileWriter?
字串3
???????用来构建.cfs文件,从Lucene1.4开始,会将下面要提到的各类文件,譬如.tii,.tis等合并成一个.cfs文件!?
字串4
???????其结构如下?
字串1
Compound?(.cfs)?-->?FileCount,?<DataOffset,?FileName>FileCount,?FileDataFileCount?
字串1
FileCount?-->?VInt?
字串2
DataOffset?-->?Long?字串6?
FileName?-->?String?
字串2
FileData?-->?raw?file?data?字串9?
DocumentWriter?
字串4
?????构建.frq,.prx,.f文件????
字串7
1.FreqFile?(.frq)?-->?<TermFreqs,?SkipData>TermCount?
字串9
TermFreqs?-->?<TermFreq>DocFreq?
字串4
TermFreq?-->?DocDelta,?Freq??字串2?
SkipData?-->?<SkipDatum>DocFreq/SkipInterval?字串9?
SkipDatum?-->?DocSkip,FreqSkip,ProxSkip?字串4?
DocDelta,Freq,DocSkip,FreqSkip,ProxSkip?-->?VInt?
字串5
?
?
?
?
字串5
??
字串9
??
字串6
2.The?.prx?file?contains?the?lists?of?positions?that?each?term?occurs?at?within?documents.?字串3?
ProxFile?(.prx)?-->?<TermPositions>TermCount?
字串3
TermPositions?-->?<Positions>DocFreq?字串8?
Positions?-->?<PositionDelta>Freq?字串5?
PositionDelta?-->?VInt??
字串7
?
字串7?
?
?
?
3.There′s?a?norm?file?for?each?indexed?field?with?a?byte?for?each?document.?The?.f[0-9]*?file?contains,?for?each?document,?a?byte?that?encodes?a?value?that?is?multiplied?into?the?score?for?hits?on?that?field:?字串4?
Norms?(.f[0-9]*)?-->?<Byte>SegSize?
字串3
Each?byte?encodes?a?floating?point?value.?Bits?0-2?contain?the?3-bit?mantissa,?and?bits?3-8?contain?the?5-bit?exponent.?字串3?
These?are?converted?to?an?IEEE?single?float?value?as?follows:?
字串3
1.????If?the?byte?is?zero,?use?a?zero?float.?
字串1
2.????Otherwise,?set?the?sign?bit?of?the?float?to?zero;?字串7?
3.????add?48?to?the?exponent?and?use?this?as?the?float′s?exponent;?字串4?
4.????map?the?mantissa?to?the?high-order?3?bits?of?the?float′s?mantissa;?and?字串7?
5.????set?the?low-order?21?bits?of?the?float′s?mantissa?to?zero.?字串6?
?
?
?
FieldInfo?
字串3
??????里边有Field的部分信息,是一个四元组<name,isIndexed,num,?storeTermVector>?字串1?
FieldInfos?字串5?
?????此类用来描述Document的fields是否被索引.每个Segment有一个单独的FieldInfo?文件.对于多线程,此类的对象为线程安全的.但是某一时刻,只允许一个线程添加document.别的reader和writer不允许进入.此类维护两个容器ArrayList和HashMap,这两个容器都不是synchronized,何言线程安全,不解???字串4?
观察write函数可知?.fnm文件的构成为?字串1?
?????FieldInfos?(.fnm)?-->?FieldsCount,?<FieldName,?FieldBits>FieldsCount?
????????????????????????FieldsCount?-->?VInt?
字串8
????????????????????????FieldName?-->?String?
字串7
????????????????????????FieldBits?-->?Byte?
字串3
FieldReader?
字串6
????用来读取.fdx文件和.fdt文件?
字串4
FieldWriter?字串3?
?????此类创建两个文件.fdx和.fdt文件?
字串1
?????FieldIndex(.fdx)对于每一个Document,里面都含有一个指向Field的指针(其实是整数)?字串1?
<FieldValuesPosition>SegSize?字串3?
FieldValuesPosition?-->?Uint64?字串4?
?????????????则第n个document的Field?pointer为n*8?字串9?
????FieldData(.fdt)里面包含了每一个文档包含的存储的field信息.内容如下:?字串9?
<DocFieldData>SegSize?字串1?
DocFieldData?-->?FieldCount,?<FieldNum,?Bits,?Value>FieldCount?
字串9
FieldCount?-->?VInt?字串6?
FieldNum?-->?VInt?字串2?
Lucene?<=?1.4:?
字串9
Bits?-->?Byte?字串7?
Value?-->?String?字串1?
Only?the?low-order?bit?of?Bits?is?used.?It?is?one?for?tokenized?fields,?and?zero?for?non-tokenized?fields.?字串8?
FilterIndexReader?
字串2
?????扩展自IndexReader,提供了具体的方法.?字串9?
IndexReader?字串7?
?????为abstract?class!用来读取建完索引的Directory,并可以返回各种信息,譬如Term,TermPosition等等.?字串7?
IndexWriter?字串4?
????IndexWriter用来创建和维护索引。?字串4?
???IndexWriter构造函数中的第三个参数决定一个新的索引是否被创建,或者一个存在的索引是否开放给欲新加入的新的document?字串3?
???通过addDocument()0函数加入新的documents,当添加完document之后,调用close()函数?
字串1
???如果一个Index没有document需要加入并且需要优化查询性能。则在索引close()之前,调用optimize()函数进行优化。?
字串5
????Deleteable文件结构:?
字串8
????A?file?named?"deletable"?contains?the?names?of?files?that?are?no?longer?used?by?the?index,?but?which?could?not?be?deleted.?This?is?only?used?on?Win32,?where?a?file?may?not?be?deleted?while?it?is?still?open.?On?other?platforms?the?file?contains?only?null?bytes.?字串4?
Deletable?-->?DeletableCount,?<DelableName>DeletableCount?
字串7
DeletableCount?-->?UInt32?字串6?
DeletableName?-->?String?字串1?
MultipleTermPositions?
字串2
专门用于search包中的PhrasePrefixQuery?字串2?
MultiReader?字串8?
扩展自IndexReader,用来读取多个索引!添加他们的内容?
字串4
SegmentInfo?
字串1
?????一些关于Segment的信息,是一个三元组<segmentname,docCount,dir>?
字串8
SegmentInfos?字串6?
?????扩展自Vector,就是一个向量组,其中任意成员为SegmentInfo!用来构建segments文件,每个Index有且只有一个这样的文件,此类提供了read和write的方法.?
字串2
?????其内容如下:?
字串5
?????Segments?-->?Format,?Version,?NameCounter,?SegCount,?<SegName,?SegSize>SegCount?
字串2
Format,?NameCounter,?SegCount,?SegSize?-->?UInt32?字串1?
Version?-->?UInt64?字串1?
SegName?-->?String?字串1?
Format?is?-1?in?Lucene?1.4.?字串2?
Version?counts?how?often?the?index?has?been?changed?by?adding?or?deleting?documents.?
字串5
NameCounter?is?used?to?generate?names?for?new?segment?files.?字串4?
SegName?is?the?name?of?the?segment,?and?is?used?as?the?file?name?prefix?for?all?of?the?files?that?compose?the?segment′s?index.?
字串8
SegSize?is?the?number?of?documents?contained?in?the?segment?index.?
字串1
?
?
?
SegmentMergeInfo?
字串9
????用来记录segment合并信息.?字串5?
SegmentMergeQueue?
字串5
????扩展自PriorityQueue(按升序排列)?
字串7
SegmentMerger?字串2?
此类合并多个Segment为一个Segment,被IndexWriter.addIndexes()创建此类对象?字串8?
如果compoundFile为True即可以合并了,创建.cfs文件,并且把其余的几乎所有文件全部合并到.cfs文件中!?
字串2
SegmentReader?字串3?
扩展自IndexReader,提供了很多读取Index的方法?
字串1
SegmentTermDocs?
字串7
扩展自TermDocs?
字串8
SegmentTermEnum?
字串8
??扩展自TermEnum?
字串3
SegmentTermPositions?
字串9
???扩展自TermPositions?
字串8
SegmentTermVector?字串7?
?扩展自TermFreqVector?
字串3
Term?
字串9
?????Term是一个<fieldName,text>对.而Field由于分多种,但是至少都含有<fieldName,fieldValue>这样二者就可以建立关联了.Term是一个搜索单元.Term的text都是诸如dates,email?address,urls等等.?字串7?
TermDocs?字串1?
?????TermDocs是一个Interface.?TermDocs提供一个接口,用来列举<document,frequency>,以共Term使用?字串4?
?????在<document,frequency>对中,document部分给每一个含有term的document命名.document根据其document?number进行标引.frequency部分列举在每一个document中term的数量.<document,frequency>对根据document?number排序.?
字串2
TermEnum?字串3?
?????此类为抽象类,用来enumerate?term.Term?enumerations?由Term.compareTo()进行排序此enumeration中的每一个term都要大于所有在此enumeration之前的term.?字串6?
TermFreqVector?字串8?
?????此Interface用来访问一个document的Field的Term?Vector?
字串6
TermInfo?字串7?
?????此类主要用来存储Term信息.其可以说为一个五元组<Term,docFreq,freqPointer,proxPointer,skipOffset>?字串4?
TermInfoReader?字串9?
?????未细读,待读完SegmentTermEnum?字串7?
TermInfoWriter?
字串9
?????此类用来构建(.tis)和(.tii)文件.这些构成了term?dictionary?
字串9
1.?????The?term?infos,?or?tis?file.?
字串6
TermInfoFile?(.tis)-->?TIVersion,?TermCount,?IndexInterval,?SkipInterval,?TermInfos?字串4?
TIVersion?-->?UInt32?
字串5
TermCount?-->?UInt64?字串1?
IndexInterval?-->?UInt32?
字串9
SkipInterval?-->?UInt32?字串4?
TermInfos?-->?<TermInfo>TermCount?
字串2
TermInfo?-->?<Term,?DocFreq,?FreqDelta,?ProxDelta,?SkipDelta>?字串3?
Term?-->?<PrefixLength,?Suffix,?FieldNum>?字串5?
Suffix?-->?String?字串2?
PrefixLength,?DocFreq,FreqDelta,?ProxDelta,?SkipDelta
-->?VInt?
字串2
This?file?is?sorted?by?Term.?Terms?are?ordered?first?lexicographically?by?the?term′s?field?name,?and?within?that?lexicographically?by?the?term′s?text.?
字串3
TIVersion?names?the?version?of?the?format?of?this?file?and?is?-2?in?Lucene?1.4.?
字串8
Term?text?prefixes?are?shared.?The?PrefixLength?is?the?number?of?initial?characters?from?the?previous?term?which?must?be?pre-pended?to?a?term′s?suffix?in?order?to?form?the?term′s?text.?Thus,?if?the?previous?term′s?text?was?"bone"?and?the?term?is?"boy",?the?PrefixLength?is?two?and?the?suffix?is?"y".?
字串8
FieldNumber?determines?the?term′s?field,?whose?name?is?stored?in?the?.fdt?file.?
字串6
DocFreq?is?the?count?of?documents?which?contain?the?term.?
字串4
FreqDelta?determines?the?position?of?this?term′s?TermFreqs?within?the?.frq?file.?In?particular,?it?is?the?difference?between?the?position?of?this?term′s?data?in?that?file?and?the?position?of?the?previous?term′s?data?(or?zero,?for?the?first?term?in?the?file).?字串5?
ProxDelta?determines?the?position?of?this?term′s?TermPositions?within?the?.prx?file.?In?particular,?it?is?the?difference?between?the?position?of?this?term′s?data?in?that?file?and?the?position?of?the?previous?term′s?data?(or?zero,?for?the?first?term?in?the?file.?字串6?
SkipDelta?determines?the?position?of?this?term′s?SkipData?within?the?.frq?file.?In?particular,?it?is?the?number?of?bytes?after?TermFreqs?that?the?SkipData?starts.?In?other?words,?it?is?the?length?of?the?TermFreq?data.?字串1?
2.?????The?term?info?index,?or?.tii?file.?
字串8
This?contains?every?IndexIntervalth?entry?from?the?.tis?file,?along?with?its?location?in?the?"tis"?file.?This?is?designed?to?be?read?entirely?into?memory?and?used?to?provide?random?access?to?the?"tis"?file.?
字串8
The?structure?of?this?file?is?very?similar?to?the?.tis?file,?with?the?addition?of?one?item?per?record,?the?IndexDelta.?字串3?
TermInfoIndex?(.tii)-->?TIVersion,?IndexTermCount,?IndexInterval,?SkipInterval,?TermIndices?
字串3
TIVersion?-->?UInt32?
字串4
IndexTermCount?-->?UInt64?字串8?
IndexInterval?-->?UInt32?字串9?
SkipInterval?-->?UInt32?
字串6
TermIndices?-->?<TermInfo,?IndexDelta>IndexTermCount?
字串9
IndexDelta?-->?VLong?
字串7
IndexDelta?determines?the?position?of?this?term′s?TermInfo?within?the?.tis?file.?In?particular,?it?is?the?difference?between?the?position?of?this?term′s?entry?in?that?file?and?the?position?of?the?previous?term′s?entry.?字串4?
TODO:?document?skipInterval?information?
字串7
?????????????其中IndexDelta是.tii文件,比之.tis文件多的东西.?
字串4
TermPosition?
字串3
??????此类扩展自TermDocs,是一个Interface,用来enumerate<document,frequency,<position>*>三元组,?
字串3
以供term使用.在此三元组中document和frequency于TernDocs中的相同.postions部分列出了在一个document中,一个term每次出现的顺序位置此三元组为倒排文档的事件表表示.?字串1?
TermPositionVector?字串4?
??????扩展自TermFreqVector.比之TermFreqVector扩展了功能,可以提供term所在的位置?
字串1
TermVectorReader?字串9?
??????用来读取.tvd,.tvf.tvx三个文件.?字串9?
TermVectorWriter?字串7?
??????用于构建.tvd,?.tvf,.tvx文件,这三个文件构成TermVector?字串4?
1.????The?Document?Index?or?.tvx?file.?字串6?
This?contains,?for?each?document,?a?pointer?to?the?document?data?in?the?Document?(.tvd)?file.?字串7?
DocumentIndex?(.tvx)?-->?TVXVersion<DocumentPosition>NumDocs?字串8?
TVXVersion?-->?Int?字串8?
DocumentPosition?-->?UInt64?
字串2
This?is?used?to?find?the?position?of?the?Document?in?the?.tvd?file.?
字串3
2.????The?Document?or?.tvd?file.?字串3?
This?contains,?for?each?document,?the?number?of?fields,?a?list?of?the?fields?with?term?vector?info?and?finally?a?list?of?pointers?to?the?field?information?in?the?.tvf?(Term?Vector?Fields)?file.?字串9?
Document?(.tvd)?-->?TVDVersion<NumFields,?FieldNums,?FieldPositions,>NumDocs?字串2?
TVDVersion?-->?Int?字串8?
NumFields?-->?VInt?
字串7
FieldNums?-->?<FieldNumDelta>NumFields?
字串5
FieldNumDelta?-->?VInt?字串8?
FieldPositions?-->?<FieldPosition>NumFields?字串2?
FieldPosition?-->?VLong?
字串1
The?.tvd?file?is?used?to?map?out?the?fields?that?have?term?vectors?stored?and?where?the?field?information?is?in?the?.tvf?file.?字串8?
3.????The?Field?or?.tvf?file.?字串5?
This?file?contains,?for?each?field?that?has?a?term?vector?stored,?a?list?of?the?terms?and?their?frequencies.?字串6?
Field?(.tvf)?-->?TVFVersion<NumTerms,?NumDistinct,?TermFreqs>NumFields?字串3?
TVFVersion?-->?Int?字串7?
NumTerms?-->?VInt?字串9?
NumDistinct?-->?VInt?--?Future?Use?
字串1
TermFreqs?-->?<TermText,?TermFreq>NumTerms?
字串7
TermText?-->?<PrefixLength,?Suffix>?字串4?
PrefixLength?-->?VInt?
字串6
Suffix?-->?String?字串5?
TermFreq?-->?VInt?
字串2
Term?text?prefixes?are?shared.?The?PrefixLength?is?the?number?of?initial?characters?from?the?previous?term?which?must?be?pre-pended?to?a?term′s?suffix?in?order?to?form?the?term′s?text.?Thus,?if?the?previous?term′s?text?was?"bone"?and?the?term?is?"boy",?the?PrefixLength?is?two?and?the?suffix?is?"y".?字串4?
?
字串8
好的,整个Index包所有类都讲解了,下边咱们开始来编码重新审视一下!?字串5?
下边来编制一个程序来结束本章的讨论。?
字串7
package?org.apache.lucene.index;?字串7?
import?org.apache.lucene.analysis.*;?字串1?
import?org.apache.lucene.analysis.standard.*;?
字串3
import?org.apache.lucene.store.*;?
字串6
import?org.apache.lucene.document.*;?字串8?
import?org.apache.lucene.demo.*;?字串9?
import?org.apache.lucene.search.*;?字串8?
import?java.io.*;?
字串2
/**在使用此程序时,会尽量用到Lucene?Index中的每一个类,尽量将其展示个大家?
字串4
?*使用的Index包中类有?字串9?
?*DocumentWriter(提供给用用户使用的为IndexWriter)?字串7?
?*FieldInfo(和FieldInfos)?字串9?
?*?SegmentDocs(扩展自TermDocs)?
字串1
?*SegmentReader(扩展自IndexReader,提供给用户使用的是IndexReader)?
字串4
?*SegmentMerger?
字串2
?*segmentTermEnum(扩展自TermEnum)?字串2?
?*segmentTermPositions(扩展自TermPositions)?
字串2
?*segmentTermVector(扩展自TermFreqVector)?字串5?
*/?字串8?
?
?
?
public?class?TestIndexPackage?
字串9
{?字串1?
?//用于将Document加入索引?字串8?
?public?static?void?indexDocument(String?segment,String?fileName)?throws?Exception?
字串8
?{?字串8?
????//第二个参数用来控制,如果获得不了目录是否创建?字串5?
????Directory?directory?=?FSDirectory.getDirectory("testIndexPackage",false);?字串3?
????Analyzer?analyzer?=?new?SimpleAnalyzer();?字串8?
????//第三个参数为每一个Field最多拥有的Token个数?
字串5
???DocumentWriter?writer?=?new?DocumentWriter(directory,analyzer,Similarity.getDefault(),1000);?字串4?
????File?file?=?new?File(fileName);?
字串7
????//由于使用FileDocument将file包装成了Docuement,会在document中创建三个field(path,modified,contents)?字串5?
????Document?doc?=?FileDocument.Document(file);?字串7?
????writer.addDocument(segment,doc);?字串9?
????directory.close();?
字串8
?}?
字串8
?//将多个segment进行合并?字串4?
?public?static?void?merge(String?segment1,String?segment2,String?segmentMerged)throws?Exception?字串8?
?{?
字串2
????Directory?directory?=?FSDirectory.getDirectory("testIndexPackage",false);?字串5?
????SegmentReader?segmentReader1=new?SegmentReader(new?SegmentInfo(segment1,1,directory));?字串6?
????SegmentReader?segmentReader2=new?SegmentReader(new?SegmentInfo(segment2,1,directory));?字串8?
????//第三个参数为是否创建.cfs文件?
字串2
????SegmentMerger?segmentMerger?=new?SegmentMerger(directory,segmentMerged,false);?
字串7
????segmentMerger.add(segmentReader1);?字串7?
????segmentMerger.add(segmentReader2);?
字串8
????segmentMerger.merge();?
字串3
????segmentMerger.closeReaders();?
字串4
????directory.close();?字串7?
?}?字串7?
?//将segment即Index的子索引的所有内容展示给你看。?字串4?
?public?static?void?printSegment(String?segment)?throws?Exception?字串8?
?{?字串2?
????Directory?directory?=FSDirectory.getDirectory("testIndexPackage",false);?
字串4
????SegmentReader?segmentReader?=?new?SegmentReader(new?SegmentInfo(segment,1,directory));?字串2?
????//display?documents?
字串8
???for(int?i=0;i<segmentReader.numDocs();i++)?
字串2
??????System.out.println(segmentReader.document(i));?字串3?
????TermEnum?termEnum?=?segmentReader.terms();//此处实际为SegmentTermEnum?
字串6
????//display?term?and?term?positions,termDocs?字串5?
????while(termEnum.next())?
字串6
????{?字串1?
??????System.out.print(termEnum.term().toString2());?字串2?
??????System.out.println("?DocumentFrequency="?+?termEnum.docFreq());?字串2?
??????TermPositions?termPositions=?segmentReader.termPositions(termEnum.term());?字串3?
??????int?i=0;?字串4?
??????while(termPositions.next())?字串5?
??????{?
字串1?