文本去重算法,该如何解决

2012-04-16

文本去重算法文本url.txt 内容有5m比较大：Assembly codehttp://images.sohu.com/http://egou.focus.cn/htt

文本去重算法
文本url.txt 内容有5m比较大：

Assembly code

http://images.sohu.com/http://egou.focus.cn/http://images.sohu.com/http://egou.focus.cn/http://images.sohu.com/http://egou.focus.cn/http://images.sohu.com/http://egou.focus.cn/http://images.sohu.com/http://images.sohu.com/http://sy.brand.sogou.com/http://txt.go.sohu.com/http://house.focus.cn/http://images.sohu.com/http://house.focus.cn/http://images.sohu.com/http://house.focus.cn/http://images.sohu.com/http://house.focus.cn/http://images.sohu.com/http://house.focus.cn/http://images.sohu.com/

用什么方法，可以更好的去掉重复的。

[解决办法]
直接从文件全部读入set,然后写出.
[code=C++]
#include <iostream>
#include <set>
#include <fstream>
#include <string>
using namespace std;
int main()
{
set <string> s;
string str;
ifstream in("url.txt");
ofstream out("unique.txt");
while(in.getlin(str),!in.eof())
{
s.insert(str);
}
for(set <string>::iterator it=s.begin();it!=s.end();++s)
out < <*it < <"\n";
in.close();
out.close();
}
[/code]
[解决办法]
用unordered_set会快一点。

热点排行