首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > .NET > C# >

c#抓取网页内容出现乱码,该怎么解决

2012-05-30 
c#抓取网页内容出现乱码我做了如下代码读取网页内容,但在分析不同编码的网页时,出现乱码//取得HTML源码pri

c#抓取网页内容出现乱码
我做了如下代码读取网页内容,但在分析不同编码的网页时,出现乱码
//取得HTML源码
  private string getHtmlInfo(string urlSelet)
  {
  string strResult = "";
  if(urlSelet.Equals(""))
  return strResult = "";
  //HTML源码
  Console.WriteLine("**********************=" + urlSelet);
  try
  {
  //声明一个HttpWebRequest请求 
  HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(urlSelet);
  webRequest.Method = "GET";
  webRequest.UserAgent = "Opera/9.25 (Windows NT 6.0; U; en)";
  HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
  //Encoding encoding = Encoding.GetEncoding("GB2312");
  //取得要取得的网页的编码方式
  Encoding encoding = GetEncoding(webResponse);
  using (System.IO.Stream stream = webResponse.GetResponseStream())
  {
  using (System.IO.StreamReader reader = new StreamReader(stream,encoding))
  {
  strResult = reader.ReadToEnd();
  }
  }
  }
  catch (Exception exp)
  {
   
  MessageBox.Show("出错:" + exp.Message);

  }
  return strResult;
  }
//取得要取得的网页的编码方式
 public Encoding GetEncoding(HttpWebResponse response) 
  { 
  Encoding code = Encoding.Default;
  string charset = null;
  //如果发现content-type头  
  string ctypeLower = response.Headers["content-type"];
  string ctypeOrder = response.Headers["Content-Type"];
  string ctype="";
  if (!ctypeLower.Equals(""))
  ctype = ctypeLower;
  if (!ctypeOrder.Equals(""))
  ctype = ctypeOrder;
  Console.WriteLine("ctype:" + ctype);
  if (ctype != null)
  {
  int ind = ctype.IndexOf("charset=");
  if (ind != -1)
  {
  charset = ctype.ToLower().Substring(ind + 8);
  }
  }
  Console.WriteLine("charset编码格式:" + charset);
  if (charset != "") 
  { 
  try 
  {
  code = Encoding.GetEncoding(charset); 
  } 
  catch{} 
  }
  return code; 
 } 

发现取网页编码时有时能取到,有时去不到,所以还会有显示乱码问题
请求哪位大侠帮我一下

[解决办法]
今天下午抽时间研究了一下这个页面:
http://www.dunsh.org/forums/thread-984-1-1.html
实际上页面返回的HTTP如下:
HTTP/1.1 200 OK
Proxy-Connection: close
Connection: close
Content-Length: 33906
Via: 1.1 MSSZISA02
Date: Thu, 03 Dec 2009 08:02:37 GMT
Content-type: text/html
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-Powered-By: PHP/5.2.9-1
Set-Cookie: cdb_sid=plpmwC; expires=Thu, 10-Dec-2009 08:02:37 GMT; path=/
Set-Cookie: cdb_oldtopics=D984D; expires=Thu, 03-Dec-2009 09:02:37 GMT; path=/
Set-Cookie: cdb_visitedfid=9; expires=Sat, 02-Jan-2010 08:02:37 GMT; path=/

从这个HTTP里看到的Content-type: text/html,而微软的HttpWebResponse.CharacterSet,在没有指定编码类型的时候默认返回“ISO-8859-1”编码类型(来源http://channel9.msdn.com/ShowPost.aspx?PostID=166867)而网页实际的编码为UTF-8,因此出现乱码



解决办法:检查CharacterSet属性,如果是ISO-8859-1编码,则默认使用utf-8解码,并使用正则取html中
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
检测出正确的charset,再使用正确的编码对页面进行解码
大致代码如下:
string sUrl = "http://www.dunsh.org/forums/thread-984-1-1.html";
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(sUrl);
req.UserAgent = "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)";
req.Accept = "*/*";
req.Headers.Add("Accept-Language", "zh-cn,en-us;q=0.5");
req.ContentType = "text/xml";

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
Encoding enc;
try
{
if( resp.CharacterSet != "ISO-8859-1")
enc = Encoding.GetEncoding(resp.CharacterSet);
else
enc = Encoding.UTF8;
}
catch
{
// *** Invalid encoding passed
enc = Encoding.UTF8;
}
string sHTML = string.Empty;
using (StreamReader read = new StreamReader(resp.GetResponseStream(), enc))
{
sHTML = read.ReadToEnd();
Match charSetMatch = Regex.Match(sHTML, "charset=(?<code>[a-zA-Z0-9\\-]+)", RegexOptions.IgnoreCase);
string sChartSet = charSetMatch.Groups["code"].Value;
//if it's not utf-8,we should redecode the html.
if (!string.IsNullOrEmpty(sChartSet) && !sChartSet.Equals("utf-8",StringComparison.OrdinalIgnoreCase))
{
enc = Encoding.GetEncoding(sChartSet);
using (StreamReader read1 = new StreamReader(resp.GetResponseStream(), enc))
{
sHTML = read1.ReadToEnd();
}
}
}
Console.WriteLine(sHTML);

热点排行