深入学习 Lucene 3.0 目录段

2012-10-25

深入学习 Lucene 3.0 索引段Lucene索引index由若干段(segment)组成，每一段由若干的文档（document）组成，每

深入学习 Lucene 3.0 索引段

Lucene索引index由若干段(segment)组成，每一段由若干的文档（document）组成，每一个文档由若干的域（field）组成，每一个域由若干的项（term）组成。
生成索引的代码：

// 创建两个 Document 对象File f1 = new File("d:/lucene/demo1.txt");File f2 = new File("d:/lucene/demo2.txt");Document doc1 = new Document();doc1.add(new Field("path", f1.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));doc1.add(new Field("content", new FileReader(f1)));Document doc2 = new Document();doc2.add(new Field("path", f2.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));doc2.add(new Field("content", new FileReader(f2)));// 创建索引对象IndexWriter writer = new IndexWriter(FSDirectory.open(indexPath),new StandardAnalyzer(Version.LUCENE_30), true,IndexWriter.MaxFieldLength.LIMITED);// 是否复合索引writer.setUseCompoundFile(false);writer.addDocument(doc1);writer.addDocument(doc2);writer.optimize();writer.close();

测试生成的索引文件：_0.fdt 、_0.fdx、_0.fnm、_0.frq、_0.nrm、_0.prx、_0.tii、_0.tis、segments.gen、segments_2
测试生成的复合索引文件：_0.cfs、_0.cfx、segments.gen、segments_2
其实无论是否复合索引，两个 segments 开头的文件内容是一样的。它存储了段的详细信息，也是下面讨论的主要内容。

1、segments.gen 文件
该文件格式很简单：
version 版本号，占用4个字节。当前版本为 -2
gen0 段号0，占用8个字节
gen1 段号1，占用8个字节

版本号的代码：
参考 org.apache.lucene.index.SegmentInfos 类第 61 行
public static final int FORMAT_LOCKLESS = -2; // 该变量为 final 类型，不能被修改

写入该文件的代码：
参考 org.apache.lucene.index.SegmentInfos 类第 594 - 604 行

       int version = genInput.readInt();       if (version == FORMAT_LOCKLESS) {         long gen0 = genInput.readLong();         long gen1 = genInput.readLong();         message("fallback check: " + gen0 + "; " + gen1);         if (gen0 == gen1) {           // The file is consistent.           genB = gen0;           break;         }       }

读取该文件的代码：
参考 org.apache.lucene.index.SegmentInfos 类第 849 - 856 行

      IndexOutput genOutput = dir.createOutput(IndexFileNames.SEGMENTS_GEN);      try {        genOutput.writeInt(FORMAT_LOCKLESS);        genOutput.writeLong(generation);        genOutput.writeLong(generation);      } finally {        genOutput.close();      }

测试生成的 segments.gen 文件十六进制表示分为三部分：
1、FFFFFFFE 显示版本号，占用 4 个字节
2、0000000000000002 显示 gen0 号，占用 8 个字节，转换十进制为 -2
2、0000000000000002 显示 gen1 号，占用 8 个字节
所以文件大小共 20 个字节

2、segments_N 文件
该文件格式比较复杂，：
FORMAT 索引文件格式的版本号。整型占用 4 个字节。
version 索引的版本号，记录了IndexWriter将修改提交到索引文件中的次数。第一次值为当前时间。长整型占用 8 个字节。
counter 是下一个新段(Segment)的段名。整型占用 4 个字节。
infos 段(Segment)的个数。整型占用 4 个字节。
info 段对象的信息：
    name 段的名称。第 1 个字节是后面占用的字节数。占用空间取决于名称的长度。
    docCount 段中包含的文档数。整型占用 4 个字节。
    delGen .del文件的版本号。长整型占用 8 个字节。
    docStoreOffset 段中如果共享其它段的域和词向量，该值为偏移地址，否则为 -1 。整型占用 4 个字节。
    docStoreSegment 段中共享其它段的域和词向量的段名称。占用空间取决于名称的长度。
    docStoreIsCompoundFile 数据是否存储在 *.cfx 文件中。占用 1 个字节。
    hasSingleNormFile 是否存在单独的标准化因子文件。占用 1 个字节。
    normGen 如果每个域有单独的标准化因子文件，则此数组描述了每个文件的版本号。占用空间取决于文件的数量，每个文件占用 8 个字节。
    IsCompoundFile 是否保存为复合文件。占用 1 个字节。
    delCount 记录了此段中删除的文档的数目。整型占用 4 个字节。
    hasProx 如果至少有一个段omitTf为false，也即词频(term freqency)需要被保存，则HasProx为1，否则为0。占用 1 个字节。
    diagnostics 调试信息。占用空间取决于调试的数量，一般值为 0，占用 4 个字节。
userData 用户信息。占用空间取决于调试的数量，一般值为 0，占用 4 个字节。
checksum 校验信息。长整型占用 8 个字节。

写入段信息的代码：
1、参考 org.apache.lucene.index.SegmentInfos 类第 338 - 347 行

      segnOutput.writeInt(CURRENT_FORMAT); // write FORMAT      segnOutput.writeLong(++version); // every write changes                                   // the index      segnOutput.writeInt(counter); // write counter      segnOutput.writeInt(size()); // write infos      for (int i = 0; i < size(); i++) {        info(i).write(segnOutput); // 此处参考 2      }      segnOutput.writeStringStringMap(userData);// 此处参考 4      segnOutput.prepareCommit();// 此处写入长整型的校验码

2、参考 org.apache.lucene.index.SegmentInfo 类第 540 - 564 行

  void write(IndexOutput output)    throws IOException {    output.writeString(name);// 此处参考 3    output.writeInt(docCount);    output.writeLong(delGen);    output.writeInt(docStoreOffset);    if (docStoreOffset != -1) {      output.writeString(docStoreSegment);      output.writeByte((byte) (docStoreIsCompoundFile ? 1:0));    }    output.writeByte((byte) (hasSingleNormFile ? 1:0));    if (normGen == null) {      output.writeInt(NO);    } else {      output.writeInt(normGen.length);      for(int j = 0; j < normGen.length; j++) {        output.writeLong(normGen[j]);      }    }    output.writeByte(isCompoundFile);    output.writeInt(delCount);    output.writeByte((byte) (hasProx ? 1:0));    output.writeStringStringMap(diagnostics); // 此处参考 4 和 5   }

3、参考 org.apache.lucene.store.IndexOutput 类第 103 - 107 行

  public void writeString(String s) throws IOException {    UnicodeUtil.UTF16toUTF8(s, 0, s.length(), utf8Result);    writeVInt(utf8Result.length);// 写入名称的长度    writeBytes(utf8Result.result, 0, utf8Result.length);// 写入名称的字节数组，长度为 utf8Result.length  }  public void writeVInt(int i) throws IOException {    while ((i & ~0x7F) != 0) {// 8 位以上是否存在数据      writeByte((byte)((i & 0x7f) | 0x80));// 第 8 位设置为 1 ，表示高位还有数据      i >>>= 7;// 算术右移 7 位    }    writeByte((byte)i);  }

4、参考 org.apache.lucene.store.IndexOutput 类第 214 - 223 行

    if (map == null) {      writeInt(0);    } else {      writeInt(map.size());      for(final Map.Entry<String, String> entry: map.entrySet()) {        writeString(entry.getKey()); // 此处参考 3        writeString(entry.getValue());      }    }  }

5、参考 org.apache.lucene.index.IndexWriter 类第 4159 - 4170 行

    Map<String,String> diagnostics = new HashMap<String,String>();    diagnostics.put("source", source);    diagnostics.put("lucene.version", Constants.LUCENE_VERSION); // 大家可以看一下 Constants 类，其实它取得 Java 的环境变量    diagnostics.put("os", Constants.OS_NAME+"");    diagnostics.put("os.arch", Constants.OS_ARCH+"");    diagnostics.put("os.version", Constants.OS_VERSION+"");    diagnostics.put("java.version", Constants.JAVA_VERSION+"");    diagnostics.put("java.vendor", Constants.JAVA_VENDOR+"");    if (details != null) {      diagnostics.putAll(details);    }    info.setDiagnostics(diagnostics);

测试生成的 segments_2 文件的十六进制表示为：
首先是所有段的公共信息
1、FFFFFFF7 索引文件格式的版本号，转换十进制为 -9
2、00000130A66E4ECA 索引的版本号，通过如下转换为知为当前的时间
//省略前面的 0 并声明为长整型
long i = 0x130A66E4ECAL;
//转化为日期类型，输出为 2011-6-19 13:45:04
System.out.println(new Date(i).toLocaleString());
3、00000001 下一个新段的段名，现在只有一个段名称为 0
4、00000001 索引中段的个数

下面每个段的详细信息
5、025F30 段的名称，02 为占用字节的个数，5F30 是UTF8编码为 _0
6、00000002 段中包含的文档数，测试使用 2 个文档
7、FFFFFFFF .del文件的版本号，如果没有删除文档则默认为 -1
8、00000000 段中如果共享其它段的域和词向量的偏移地址。
9、025F30 段的名称，02 为占用字节的个数，5F30 是UTF8编码为 _0
10、00 上面的段是否复合索引文件
11、01 是否单独的标准化因子文件
12、FFFFFFFF 因子文件的个数。测试中未生成标准化因子文件，则为 -1
13、FF 当前索引是否为复合索引文件，否为 -1
14、00000000 删除文档的数量。测试未删除为 0
15、01 词频需要被保存

下面是调用信息和用户信息
16、00000007 调试信息的数量，存在 Map 中。此处为 7 ，后面即为 7 个 key 和 value 的值
可能通过 UltraEdit 查看该项最后一个字节为 2E ，它 Sun Microsystems Inc. 的最后一个点
17、00000000 用户信息的数量，与调试信息结构相同

最后是验证码
18、00000000B171E8F7 整个索引的验证码

热点排行

编程

深入学习 Lucene 3.0 目录段