paoding分词工具的字典如何构建
分词工具不管如何变,其肯定会包含字典管理模块(当然,这是针对按字符串匹配分词),就算是基于语义分词也得有语义字典,基于统计需要词频字典等等。
在调研了mmseg4j,ictclas4j(imdict和ictclas4j属于一类,只不过其为了效率去掉了ictclas4j的命名实体识别部分),IKAnalyzer,paoding 等分词器后,发现他们的字典管理基本大同小异。一下以paoding为例,解释下分词工具的字典管理模块。
先说下paoding的字典数据结构。下面代码是字典接口,BinaryDictionary 和 HashBinaryDictionary 都实现该接口。其采用面向接口编程思想,其好处就是主要逻辑不用修改,易扩展。
public interface Dictionary {public int size();public Word get(int index);public Hit search(CharSequence input, int offset, int count);}public class HashBinaryDictionary implements Dictionary {/** * 字典中所有词语,用于方便{@link #get(int)}方法 */private Word[] ascWords;/** * 首字符到分词典的映射 */private Map/* <Object, SubDictionaryWrap> */subs;private final int hashIndex;private final int start;private final int end;private final int count;} public class BinaryDictionary implements Dictionary {// -------------------------------------------------private Word[] ascWords;private final int start;private final int end;private final int count;}Arrays.sort(array);这个方法对数组中的字典按升序排序,这样方便后续的二叉查找。
public synchronized Dictionary getVocabularyDictionary()这个方法,加载字典文件中的词条到数组中,然后通过HashBinaryDictionary的构造方法开始构建字典的Hash数据结构,以key=词汇的首字母为分词典的索引键值(这个是BinaryDictionary的最终方式,HashBinaryDictionary如果词条的个数大于一定值,就按照词条的第二个字建立BinaryDictionary结构,这个过程是一个递归的过程)。看一下paoding中的这段代码:
public HashBinaryDictionary(Word[] ascWords, int hashIndex, int start,int end, int initialCapacity, float loadFactor) {this.ascWords = ascWords;this.start = start;this.end = end;this.count = end - start;this.hashIndex = hashIndex;subs = new HashMap(initialCapacity , loadFactor);createSubDictionaries();}protected void createSubDictionaries() {if (this.start >= ascWords.length) {return;}// 定位相同头字符词语的开头和结束位置以确认分字典int beginIndex = this.start;int endIndex = this.start + 1;char beginHashChar = getChar(ascWords[start], hashIndex);char endHashChar;for (; endIndex < this.end; endIndex++) {endHashChar = getChar(ascWords[endIndex], hashIndex);if (endHashChar != beginHashChar) {addSubDictionary(beginHashChar, beginIndex, endIndex);beginIndex = endIndex;beginHashChar = endHashChar;}}addSubDictionary(beginHashChar, beginIndex, this.end);}protected void addSubDictionary(char hashChar, int beginIndex, int endIndex) {Dictionary subDic = createSubDictionary(ascWords, beginIndex, endIndex);SubDictionaryWrap subDicWrap = new SubDictionaryWrap(hashChar,subDic, beginIndex);subs.put(keyOf(hashChar), subDicWrap);}protected Dictionary createSubDictionary(Word[] ascWords, int beginIndex,int endIndex) {int count = endIndex - beginIndex;if (count < 16) {return new BinaryDictionary(ascWords, beginIndex, endIndex);} else {return new HashBinaryDictionary(ascWords, hashIndex + 1,beginIndex, endIndex, getCapacity(count), 0.75f);}}