请教怎么快速，准确的读取和分析超大日志文件（超过100M）？

2011-12-23

请问如何快速，准确的读取和分析超大日志文件（超过100M）？在线等~~~现在要做一个日志分析工具，日志文件格式

请问如何快速，准确的读取和分析超大日志文件（超过100M）？在线等~~~
现在要做一个日志分析工具，日志文件格式我还没拿到手，但据说大小可能超过100M，
假如格式是这样的

2007-11-01 18:20:42,983 [4520] INFO GetXXX() SERVICE START
2007-11-01 18:21:42,983 [4520] WARING 某某错误将要发生
2007-11-01 18:22:42,983 [4520] ERROR 某某错误发生
2007-11-01 18:23:59,968 [4520] INFO 程序结束

假如要统计某一天发生了多少错误，发生了多少次警告等等。

我现在的思路是用两个线程分别读文件和分析文件，
共同操作一块10行的Buffer快，每次读10行，分析10行，
然后用生产者消费者的多线程模式同步读取和分析操作，

现在的问题是感觉这样做还不是很高效，
1. 能不能同时开多个线程分块读取日志文件，比如开五个线程，把文件分成五块来读取和分析，
如果可以，应该如何同步，多线程分块读取文件应该如何实现？
2.文件I/O操作我不是很懂，请问读文件是用BinaryReader快还是BufferedReader快，
要构造成我的数据结构又是哪个方便呢？

暂时我只能想到这些增快分析速度了，请问大家还有什么更好的想法和提议吗？如果有现成的例子那就更好了：）

[解决办法]
用正则啊，比如一个Regex.Macths("2007-11-01")

就可以知道2007-11-01有多少错误，什么错误了
[解决办法]
对于大文件的检索要避免频繁的读取文件，访问磁盘最消耗时间
这样就需分块载入到内存中再检索
另外日期时间是顺序的，可以使用对分查询进一步提高速度
[解决办法]
这是以前写得一段代码，供参考

C# code

//检索超大日志//样本//<166>Mar 31 2007 23:38:50: %PIX-6-302013: Built outbound TCP connection 731528465 for outside:62.241.53.2/443 (62.241.53.2/443) to inside:10.65.160.105/2918 (61.167.117.238/35049)////<167>Mar 31 2007 23:38:50: %PIX-7-710005: UDP request discarded from 10.65.156.20/137 to inside:10.65.255.255/netbios-ns//string vFileName = @"C:\temp\sunday2007-04-01.log"; //检索文件名DateTime vDateTime = DateTime.Parse("Apr 01 2007 01:09:25"); //检索日期byte[] vBuffer = new byte[0x1000]; //缓冲区int vReadLength; //读取长度long vCurrPostion; //当前检索位置long vBeginPostion; //检索范围开始点long vEndPostion; //检索范围结束点FileStream vFileStream = new FileStream(vFileName, FileMode.Open, FileAccess.Read);vBeginPostion = 0;vEndPostion = vFileStream.Length;while (true){    vCurrPostion = vBeginPostion + (vEndPostion - vBeginPostion) / 2; //从新计算检索位置    vFileStream.Seek(vCurrPostion, SeekOrigin.Begin);    vReadLength = vFileStream.Read(vBuffer, 0, vBuffer.Length);    string vText = Encoding.ASCII.GetString(vBuffer, 0, vReadLength);    Match vMatch = Regex.Match(vText,         @"(\r\n)?<\d+>(?<datetime>\w+ \d+ \d+ \d+:\d+:\d+):");    if (!vMatch.Success) break; //没有找到日期    DateTime vTempTime = DateTime.Parse(vMatch.Result("${datetime}"));    if (vTempTime == vDateTime)    {        vBeginPostion = vCurrPostion;        vEndPostion = vCurrPostion;    }    else if (vDateTime > vTempTime)    {        vBeginPostion = vCurrPostion; //如果该位置的日期小，就向后检索    }    else    {        vEndPostion = vCurrPostion; //如果该位置的日期大，就向前检索    }    if (vEndPostion - vBeginPostion < 0x1000) break;}vCurrPostion = Math.Min(vBeginPostion, vEndPostion); //大概位置已经找到//向前检索string vTemp = string.Empty; // 连接处的字符串vBeginPostion = Math.Max(vCurrPostion - 0x1000, 0);vEndPostion = vBeginPostion + 0x1000;while (true){    bool vLoop = false; //是否继续循环    vFileStream.Seek(vBeginPostion, SeekOrigin.Begin);    vReadLength = vFileStream.Read(vBuffer, 0, vBuffer.Length);    string vText = Encoding.ASCII.GetString(vBuffer, 0, vReadLength) + vTemp;    MatchCollection vMatches = Regex.Matches(vText,        @"(\r\n)?<\d+>(?<datetime>\w+ \d+ \d+ \d+:\d+:\d+):[^\r\n]+\r\n");    if (vMatches.Count <= 0) break;    for (int i = 0; i < vMatches.Count; i++)    {        DateTime vTempTime = DateTime.Parse(vMatches[i].Result("${datetime}"));        if (vTempTime == vDateTime)        {                   if (i == 0 && vBeginPostion > 0)            {                // 需要继续向前检索                if (vBeginPostion - 0x1000 >= 0)                {                    vTemp = vText.Substring(0, 180);                    vBeginPostion = vBeginPostion - 0x1000;                    vLoop = true;                }                else                {                    vTemp = string.Empty;                    vBeginPostion = 0;                }            }            Console.WriteLine(vMatches[i]);        }    }    if (!vLoop)    {        vTemp = vText.Substring(Math.Max(vText.Length - 180, 0));         break;    }}//向后检索while (true){    bool vLoop = false; //是否继续循环    vFileStream.Seek(vEndPostion, SeekOrigin.Begin);    vReadLength = vFileStream.Read(vBuffer, 0, vBuffer.Length);    string vText = vTemp + Encoding.ASCII.GetString(vBuffer, 0, vReadLength);    MatchCollection vMatches = Regex.Matches(vText,        @"(\r\n)?<\d+>(?<datetime>\w+ \d+ \d+ \d+:\d+:\d+):[^\r\n]+\r\n");    if (vMatches.Count <= 0) break;    for (int i = 0; i < vMatches.Count; i++)    {        DateTime vTempTime = DateTime.Parse(vMatches[i].Result("${datetime}"));        if (vTempTime == vDateTime)        {            if (i == 0 && vEndPostion < vFileStream.Length)            {                // 需要继续向后检索                if (vEndPostion + 0x1000 <= vFileStream.Length - 1)                {                    vTemp = vText.Substring(0, 180);                    vEndPostion = vEndPostion + 0x1000;                    vLoop = true;                }                else                {                    vTemp = vText.Substring(0, 180);                    vEndPostion = vFileStream.Length - vEndPostion;                }            }            Console.WriteLine(vMatches[i]);        }    }    if (!vLoop) break;}vFileStream.Close(); 
 
[解决办法]
我认为可以 先创建一个内存表，在用TextReader把整个文件的数据读出，在用string的substring和length等属性综合运用，把日志中对应的不同信息分别储存在内存中，然后再根据情况检索储存在内存中的分解后的信息就行了。

[解决办法]
100M的文件貌似可以一次读到内存中,不过要求这台计算机只为这个软件服务.
另千万不要试图使用多线程进行硬盘访问,LZ可以做做测试在不是硬盘阵列的情况下多线程读取文件不会比单线程快的..

热点排行