首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > JAVA > J2EE开发 >

正则,可能其他方式来获取网页的内容

2012-08-27 
正则,或者其他方式来获取网页的内容!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//ENht

正则,或者其他方式来获取网页的内容
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  <style type="text/css" media="all">
  @import url("/ReviewToolWS/css/maven-base.css");
  @import url("/ReviewToolWS/css/maven-theme.css");
  @import url("/ReviewToolWS/css/site.css");
  @import url("/ReviewToolWS/css/screen.css");
  </style>
  <script type="text/javascript">
  function WrapScript(){
  var tds=document.getElementsByTagName("td");
  for(var i=0;i<tds.length;i++)
  {
  tds[i].wrap="yes";
  }
  }
  </script>
  <link rel="stylesheet" href="/ReviewToolWS/css/print.css" type="text/css" media="print" />
  </head>
  <body topmargin="0" marginheight="0" marginwidth="0" bottommargin="0" rightmargin="0">
  <form id="frm1" >
 <table cellspacing="0" cellpadding="0" style="width:100%;valign:top" border="1">
<tr>
<td colspan=4>
 
<table class="simple" style="width:1820;align:center" id="displaytag1">
<thead>
<tr>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=0&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Turn No<br/>轮次</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=1&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Reviewer<br/>评审人员</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=2&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Description<br/>描述</a></th>
<th class="sortable sorted order1">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=3&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=1">Confirmation<br/>问题确认</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=4&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Author<br/>回复者</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=5&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Remark<br/>回复内容</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=6&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Rectified<br/>是否已改正</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=7&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Location(Page/Sec/All)<br/>位置(页/段)</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=8&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Defect/Query<br/>缺陷/疑问</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=9&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Defect Severity<br/>严重程度</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=10&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Question Root Object<br/>问题根源对象</a></th>


<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=11&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Defect Type<br/>缺陷类型</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=12&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Defect Trigger Factor<br/>缺陷触发因素</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=13&amp;projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&amp;d-7272271-o=2">Defect Qualifier<br/>缺陷界定</a></th></tr></thead>
<tbody>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">s65404</td>
<td style="width:400">uwDdrDataBeginAddr 在pusch中是按照逻辑顺序放置的,不是按照commtbl里的小区指示放的</td>
<td style="width:60">接受</td>
<td style="width:50">l00140929</td>
<td style="width:300">已修改</td>
<td style="width:70">是</td>
<td style="width:80">\LBB\UL\CTRL\src\LBB_UL_CTRL_ADM_PucchMainProc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="even">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">不用的注释去掉</td>
<td style="width:60">接受</td>
<td style="width:50">l00146658</td>
<td style="width:300">o</td>
<td style="width:70">是</td>
<td style="width:80">\wuyl_work\unimath\ClearCaseView\w00170701_view4\WL_BESA_LTE_3X_CODE\LBB\UL\PUCCH\src\LBB_UL_PUCCH_SpreadSeqCalc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">l00140929</td>
<td style="width:400">最后一个Core可能非LogicCoreId=0的Core,这里就会有问题</td>
<td style="width:60">接受</td>
<td style="width:50">l00140929</td>
<td style="width:300">已修改</td>
<td style="width:70">是</td>
<td style="width:80">\LBB\UL\CTRL\src\LBB_UL_CTRL_ADM_PucchMainProc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Major严重</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">考虑扩展CP是否有问题</td>
<td style="width:60">接受</td>
<td style="width:50">l00146658</td>
<td style="width:300">本轮迭代先不考虑扩展CP</td>
<td style="width:70">否</td>
<td style="width:80">\LBB\UL\PUCCH\src\LBB_UL_PUCCH_Report.c</td>


<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="even">
<td style="width:20">1</td>
<td style="width:50">f00174951</td>
<td style="width:400">可以在前面++</td>
<td style="width:60">接受</td>
<td style="width:50">l00140929</td>
<td style="width:300">已修改</td>
<td style="width:70">是</td>
<td style="width:80">\BESE300CODE\WL_BESA_LTE_3X_CODE\LBB\UL\PUSCH\src\LBB_UL_PUSCH_Uci.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="even">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">不用的注释去掉</td>
<td style="width:60">重复</td>
<td style="width:50">l00146658</td>
<td style="width:300">o</td>
<td style="width:70">否</td>
<td style="width:80">\wuyl_work\unimath\ClearCaseView\w00170701_view4\WL_BESA_LTE_3X_CODE\LBB\UL\PUCCH\src\LBB_UL_PUCCH_ParamCalc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">同上</td>
<td style="width:60">重复</td>
<td style="width:50">l00146658</td>
<td style="width:300">o</td>
<td style="width:70">否</td>
<td style="width:80">\wuyl_work\unimath\ClearCaseView\w00170701_view4\WL_BESA_LTE_3X_CODE\LBB\UL\PUCCH\src\LBB_UL_PUCCH_ParamCalc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr></tbody></table>
</td></tr>
<tr><td colspan=4><font size=2>Total 总共:<font color=red size=3>37</font> Records 条记录!</font>
</td></tr>
</table>
</form>
 </body>
</html>





上述的HTML 我从网页中取得了内容,怎么才能变成 一行一行的纯文本。得到下面的这种效果: (空格的话,保留空格,要一次的根据表头,来存入数据库中)()

1 s65404 uwDdrDataBeginAddr 在pusch中是按照逻辑顺序放置的,不是按照commtbl里的小区指示放的 接受 l00140929 已修改 是 \LBB\UL\CTRL\src\LBB_UL_CTRL_ADM_PucchMainProc.c Defect缺陷 Minor一般

[解决办法]
这个HTML文档结构挺好的,比较规范。

直接用HTMLParser这个组件吧,可以很方便的将HTML像 DOM 结构一样进行访问,定位查找节点等。

你只需要把id="displaytag1"的<TABLE>节点先找到,然后循环其下所有<tr>节点就行了。

把每行<tr>节点中的<td>取出来,拼成Insert语句,就可以写数据库了。


[解决办法]

探讨
引用:
用循环来遍历 nodeList,不是仅仅if

for (int i=0;i<nodeList.size();i++) {
Node nameNode = nodeList.elementAt(i);
String name = nameNode.toPlainTextString().trim();
System.out.println(name);
}
……

热点排行