Bobo源码札记1

2012-09-28

Bobo源码笔记1?Bobo将field的数据缓存到内存中，与lucene中的FieldCache类的作用相似。抽象类 FacetHandler

Bobo源码笔记1

Bobo将field的数据缓存到内存中，与lucene中的FieldCache类的作用相似。

抽象类 FacetHandler<D> 是最基本的类，对于BoboIndexReader，它像一个工具，用于获取数据，然后存储到内存中。

以最简单的实现类SimpleFacetHandler做分析。SimpleFacetHandler只能做简单的归类，比如在电商应用中，一个Docment表示一个产品，其中一个field代表该商品的颜色，blue、red或者green，那么可以通过SimpleFacetHandler对搜索结果进行过滤，例如只要red颜色的商品～

class SimpleFacetHandler extends FacetHandler<FacetDataCache> implements FacetScoreable

其中load（）函数用于IndexReader中获取Field数据，并将这些数据存储内存中，是Bobo初始化的主要工作。

public FacetDataCache load(BoboIndexReader reader) throws IOException {FacetDataCache dataCache = new FacetDataCache();dataCache.load(_indexFieldName, reader, _termListFactory);return dataCache;}

?FacetDataCache、MultiValueFacetDataCache类是具体进行获取、存储数据的，load()是主要函数。

?1、其中FacetDataCache是指field中只能有一个term，也就是说该属性是唯一的，用上面电商商品的例子，就是一个??

商品的颜色只能是blue或者red，而不能包括多个。

2、MultiValueFacetDataCache是field中可以有多个term，多个属性值的

FacetDataCache类的 load( )函数：

public void load(String fieldName,IndexReader reader,TermListFactory<T> listFactory) throws IOException  {    String field = fieldName.intern();    int maxDoc = reader.maxDoc();    //order实际上就是一个大数组，下标是docid，对应的值对应该文档的属性值(唯一的term)的index，通过index可以从freqlist、valArray等获取到该term的数据    BigSegmentedArray order = this.orderArray;    if (order == null) // we want to reuse the memory    {      order = newInstance(_termCountSize, maxDoc);    } else    {      order.ensureCapacity(maxDoc); // no need to fill to 0, we are reseting the                                    // data anyway    }    this.orderArray = order;    IntArrayList minIDList = new IntArrayList();    IntArrayList maxIDList = new IntArrayList();    IntArrayList freqList = new IntArrayList();    int length = maxDoc + 1;    TermValueList<T> list = listFactory == null ? (TermValueList<T>) new TermStringList()        : listFactory.createTermList();    TermDocs termDocs = reader.termDocs();    TermEnum termEnum = reader.terms(new Term(field, ""));    int t = 0; // current term number 就是上面order中对应的index，每个term对应自己的index    list.add(null);    minIDList.add(-1);    maxIDList.add(-1);    freqList.add(0);    // int df = 0;    t++;    try    {      do      {        Term term = termEnum.term();        if (term == null || term.field() != field)          break;        if (t > order.maxValue())        {          throw new IOException("maximum number of value cannot exceed: "              + order.maxValue());        }        // store term text        // we expect that there is at most one term per document        if (t >= length)          throw new RuntimeException("there are more terms than "              + "documents in field "" + field              + "", but it's impossible to sort on " + "tokenized fields");        list.add(term.text());//将该term的text存到valArray 中        termDocs.seek(termEnum);        // freqList.add(termEnum.docFreq()); // doesn't take into account        // deldocs        int minID = -1;        int maxID = -1;        int df = 0;        if (termDocs.next())        {          df++;          int docid = termDocs.doc();          order.add(docid, t);          minID = docid;          while (termDocs.next())          {            df++;            docid = termDocs.doc();            order.add(docid, t);//记录下该doc对应的term          }          maxID = docid;        }        freqList.add(df);//添加term的df        minIDList.add(minID);//该term的倒排表中最小的docid        maxIDList.add(maxID);//该term的倒排表中最大的docid        t++;      } while (termEnum.next());    } finally    {      termDocs.close();      termEnum.close();    }    list.seal();    this.valArray = list;    this.freqs = freqList.toIntArray();    this.minIDs = minIDList.toIntArray();    this.maxIDs = maxIDList.toIntArray();  }

MultiValueFacetDataCache 的load()函数与FacetDataCache基本一样，只是存储docid和index对应关系的地方不同，不是用大数组存储而是用BufferedLoader 来存储。

?下边分析一下BufferedLoader ，其成员_info 是BigIntArray类型，就是个大数组

????????? _buffer是BigIntBuffer类型，是一个动态分配的大数组

?? BigIntBuffer其实也相当于一个大数组，只不过是动态分配的，也算做了个hash，对数组分页，1024一个page

public class BigIntBuffer{  private static final int PAGESIZE = 1024;  private static final int MASK = 0x3FF;  private static final int SHIFT = 10;  private ArrayList<int[]> _buffer;  private int _allocSize;  private int _mark;  public BigIntBuffer()  {    _buffer = new ArrayList<int[]>();    _allocSize = 0;    _mark = 0;//表示当前数组中当前可以存储数据的指针  }    public int alloc(int size)  {    if(size > PAGESIZE) throw new IllegalArgumentException("size too big");        if((_mark + size) > _allocSize)    {      int[] page = new int[PAGESIZE];//每次申请1024个int大小的内存，一页      _buffer.add(page);      _allocSize += PAGESIZE;    }    int ptr = _mark;    _mark += size;//将_mark指针向后移size个位置    return ptr;  }    public void reset()  {    _mark = 0;  }    public void set(int ptr, int val)  {    int[] page = _buffer.get(ptr >> SHIFT);    page[ptr & MASK] = val;  }    public int get(int ptr)  {    int[] page = _buffer.get(ptr >> SHIFT);    return page[ptr & MASK];  }}

BufferedLoader类的add（int docid，int val）函数：

_info 是BigIntArray类型，就是个大数组，他申请的size是maxdoc的2倍，当field中的属性值(即term个数)小于等于2时，不需要_buffer存储，使用_info存储就可以了。大于2时就要使用_buffer了

//_info 是BigIntArray类型，就是个大数组 //_buffer是BigIntBuffer类型，是一个动态分配的大数组?public final boolean add(int id, int val)    {      int ptr = _info.get(id << 1);      if(ptr == EOD)      {        // 第一次插入，即插入id文档的第一个term        _info.add(id << 1, val);        return true;      }            int cnt = _info.get((id << 1) + 1);      if(cnt == EOD)      {        // 第二次插入，即插入id文档的第二个term        _info.add((id << 1) + 1, val);        return true;      }            if(ptr >= 0)      {        //此id的文档已经有2个term插入过了，那么要使用_buffer了        int firstVal = ptr;        int secondVal = cnt;                ptr = _buffer.alloc(SEGSIZE);//这里SEGSIZE等于8        _buffer.set(ptr++, EOD);//申请的内存的第一个位置填EOD，非EOD表示该id下term大于8，前面还有该id的term        _buffer.set(ptr++, firstVal);        _buffer.set(ptr++, secondVal);        _buffer.set(ptr++, val);        cnt = 3;      }      else      {        ptr = (- ptr);        if (cnt >= _maxItems) return false; // exceeded the limit              if((ptr % SEGSIZE) == 0)//意味着上一次申请的SEGSIZE个位置已经填满了        {          int oldPtr = ptr;//保存上一次申请的内存的指针          ptr = _buffer.alloc(SEGSIZE);//再申请一块SEGSIZE大小的内存          _buffer.set(ptr++, (- oldPtr));//将上一次啊申请内存指针写入新内存的第一个位置        }        _buffer.set(ptr++, val);//储存新加入的元素        cnt++;      }            _info.add(id << 1, (- ptr));      _info.add((id << 1) + 1, cnt);            return true;    }//从id的文档中，读取对应的term（属性）private final int readToBuf(int id, int[] buf)    {      int ptr = _info.get(id << 1);      int cnt = _info.get((id << 1) + 1);      int i;            if(ptr >=0)      {//ptr>0说明该文档的属性小于等于2（<2）        // read in-line data        i = 0;        buf[i++] = ptr;        if(cnt >= 0) buf[i++] = cnt;        return i;      }            // read from segments      //ptr<0 要从内存中读取term了      i = cnt;      while(ptr != EOD)      {        ptr = (- ptr) - 1;//得到最后的那个term的位置        int val;        while((val = _buffer.get(ptr--)) >= 0)//不是某次申请的SEGSIZE个位置的第一个位置        {          buf[--i] = val;        }        ptr = val;//如果这里的ptr不是EOD，那么指向上一个申请的内存块的最后一个位置      }      if(i > 0)      {        throw new RuntimeException("error reading buffered data back");      }            return cnt;    }

热点排行

开源软件

Bobo源码札记1