正则提取 Html页码内容解决思路

2013-04-20

正则提取 Html页码内容div idbox_9_0hrefhttp://baike.baidu.com/view/6590.htm?fromId756347#3 t

正则提取 Html页码内容
<div id='box_9_0'

                                                href="http://baike.baidu.com/view/6590.htm?fromId=756347#3" target="_blank" title="">设计目标
                                            </a> | <a onclick="reportUrl(this,'1','1');st_get(this,'w.2.10.2',1,2,2);"
                                                href="http://baike.baidu.com/view/6590.htm?fromId=756347#4" target="_blank" title="">语言结构
                                            </a></span>
                                    </p>
                                </div>
                                <div class="result_summary">
                                    <div class="url">
                                        <cite>内容3 2012-1-12</cite></div>
                                    <div class="sp">
                                        <span class="line">-</span><span class="summaryshare" id="sws_9_0"><span class="yl1"
                                            onfocus="blur();">sss</span></span><span class="line2">-</span><span class="preview"
                                                id="pws_9_0"><span class="iPre" onfocus="blur();"><span class="iPreBox"><em class="iPreArr"></em></span></span></span></div>

                                </div>
                            </div>
                            <div alt="正则提取 Html页码内容解决思路" /> HTML 正则
[解决办法]


string pattern = @"(?is)<div\s*id='box_9_0'[^>]*?class=""selected boxGoogleList""[^>]*?>.*?<a\s*href=""(?<href>[^""]*?)""\s*class=""tt tu""[^>]*?>(?<txt1>.*?)</a>.*?<p\s*class=""ds"">(?<txt2>.*?)</p>.*?<div\s*class=""url"">\s*<cite>(?<txt3>.*?)</cite>";
            string htmlsource = File.ReadAllText(@"C:\1.txt", Encoding.GetEncoding("GB2312"));

            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["href"].Value);
            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["txt1"].Value);
            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["txt2"].Value);
            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["txt3"].Value);

[解决办法]
木有规则，木有正则
[解决办法]

引用:

内容3里，只要里面的时间。谢谢


string pattern = @"(?is)<div\s*id='box_9_0'[^>]*?class=""selected boxGoogleList""[^>]*?>.*?<a\s*href=""(?<href>[^""]*?)""\s*class=""tt tu""[^>]*?>(?<txt1>.*?)</a>.*?<p\s*class=""ds"">(?<txt2>.*?)</p>.*?<div\s*class=""url"">\s*<cite>内容3\s*(?<txt3>.*?)</cite>";
            string htmlsource = File.ReadAllText(@"C:\1.txt", Encoding.GetEncoding("GB2312"));

            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["href"].Value);
            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["txt1"].Value);
            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["txt2"].Value);
            Console.WriteLine(Regex.Match(htmlsource, pattern).Groups["txt3"].Value);

热点排行

asp.net

正则提取 Html页码内容解决思路