统计常用字的一种步骤

2013-11-09

统计常用字的一种方法最近工作中需要一个常用字集合，网上找了些，都不太满意，所以打算自己做一个。统计常用

统计常用字的一种方法

最近工作中需要一个常用字集合，网上找了些，都不太满意，所以打算自己做一个。统计常用字最难的地方就是数据收集这块，想到后台一直在网上爬东西，觉得可以利用一下，这样最难的问题就解决了。

废话少说，整体流程如下：

1.使用爬虫从网上抓取海量网页数据，采用utf-16格式保存成文件。

2.统计文件中每个字的次数，排序，生成一个log文件，该文件输出了每个字及其出现的次数，另一个文件就是常用字库啦。

一、抓取网页

采用scrapy抓取网页，这是一个开源的python库，简单易用，扩展性强，虽然我对html啥的不太熟，但还是凑活完成功能。该库的用法网上有介绍，但是个人觉得官网的介绍最给力了。在抓取网页中，主要碰到了3个问题：初始url怎么选、网页编码如何确定、重复url如何去除。

1. 初始url选择：主要看你关注哪个方面的常用字了，我的方法很简单，在百度上搜关键字，然后把链接作为初始url。

2.网页编码如何确定：网上说的方法主要有3种：一是从response对象获取编码方式，二是从网页的meta中的charset获得编码，三是采用codedet库自动分析得到，我采用第三个，因为觉得它最简单也最靠谱。

3.重复url如何去除：由于抓取的链接可能会有互相引用，导致出现重复的url，这样对常用字的统计有干扰，因此需要去掉。scrapy自带了去重的功能，但是默认不启用，不知道是出于什么考虑，启用的方法是重载爬虫的一个方法就好。

下面试爬虫的关键代码：

// libgencommonword.cpp : Defines the entry point for the console application.//#include "stdafx.h"#include <locale.h>#include <map>#include <algorithm>#include <vector>using std::vector;typedef std::pair<unsigned short int, __int64> WordPair;void GeneratorLog(vector<WordPair>& data){FILE* fp = NULL;if(fopen_s(&fp, "charset.log", "w")){printf("create log file error\n");return;}std::for_each(data.begin(), data.end(), [&](const WordPair& item){wchar_t buffer[2] = {item.first};char buffer_mbc[5];if(static_cast<size_t>(-1) == wcstombs(buffer_mbc, buffer, 5))strcpy_s(buffer_mbc, "null");fprintf(fp, "%04X,%s,%I64d\r\n", item.first, buffer_mbc, item.second);});fclose(fp);}void GenerateLib(vector<WordPair>& data){FILE* fp = NULL;if(fopen_s(&fp, "charsetlib.dat", "wb")){printf("create lib file error.\n");return;}std::for_each(data.begin(), data.end(), [&](const WordPair& item){fwrite(&item.first, sizeof(item.first), 1, fp);});fclose(fp);}int _tmain(int argc, _TCHAR* argv[]){setlocale(LC_ALL,"");__int64 word_count[65536] = {0};FILE* fp = NULL;if(argc < 2){printf("Usage:this.exe filename\n");return -1;}HANDLE hfile = CreateFile(argv[1],GENERIC_READ,FILE_SHARE_READ,NULL,OPEN_EXISTING,0, NULL);if(INVALID_HANDLE_VALUE == hfile){printf("open file error:%s\n", argv[1]);return -1;}unsigned short int ch[2048];DWORD readed;while(ReadFile(hfile, ch, sizeof(ch), &readed, NULL)){if(!readed)break;readed /= sizeof(ch[0]);for(DWORD i = 0; i < readed; i++){if(!iswgraph(ch[i]) && ch[i] != 0x20)continue;word_count[ch[i]] += 1;}}CloseHandle(hfile);vector<WordPair> count_vec;for(size_t i = 0; i < sizeof(word_count)/sizeof(word_count[0]); i++){if(word_count[i])count_vec.push_back(WordPair(i, word_count[i]));}std::sort(count_vec.begin(), count_vec.end(), [](WordPair& item1, WordPair& item2){return item1.second > item2.second;});GenerateLib(count_vec);GeneratorLog(count_vec);return 0;}

代码运行完生成两个文件，一个是常用字库，另一个是统计结果，6G的文件还是要跑一会的，估计得几分钟。

参考了不少别人的文章，在此表示感谢，如下一些链接可能对你有帮助：

字符转码的网站：http://bianma.51240.com/

scrapy去重，分析的很好：http://blog.pluskid.org/?p=381

scrapy的官网：http://doc.scrapy.org/en/0.18/

utf-16的编码：http://www.fileformat.info/info/charset/UTF-16/list.htm，http://www.fileformat.info/info/charset/index.htm

热点排行

其他相关

统计常用字的一种步骤