正则,或者其他方式来获取网页的内容
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<style type="text/css" media="all">
@import url("/ReviewToolWS/css/maven-base.css");
@import url("/ReviewToolWS/css/maven-theme.css");
@import url("/ReviewToolWS/css/site.css");
@import url("/ReviewToolWS/css/screen.css");
</style>
<script type="text/javascript">
function WrapScript(){
var tds=document.getElementsByTagName("td");
for(var i=0;i<tds.length;i++)
{
tds[i].wrap="yes";
}
}
</script>
<link rel="stylesheet" href="/ReviewToolWS/css/print.css" type="text/css" media="print" />
</head>
<body topmargin="0" marginheight="0" marginwidth="0" bottommargin="0" rightmargin="0">
<form id="frm1" >
<table cellspacing="0" cellpadding="0" style="width:100%;valign:top" border="1">
<tr>
<td colspan=4>
<table class="simple" style="width:1820;align:center" id="displaytag1">
<thead>
<tr>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=0&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Turn No<br/>轮次</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=1&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Reviewer<br/>评审人员</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=2&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Description<br/>描述</a></th>
<th class="sortable sorted order1">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=3&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=1">Confirmation<br/>问题确认</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=4&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Author<br/>回复者</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=5&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Remark<br/>回复内容</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=6&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Rectified<br/>是否已改正</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=7&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Location(Page/Sec/All)<br/>位置(页/段)</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=8&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Defect/Query<br/>缺陷/疑问</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=9&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Defect Severity<br/>严重程度</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=10&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Question Root Object<br/>问题根源对象</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=11&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Defect Type<br/>缺陷类型</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=12&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Defect Trigger Factor<br/>缺陷触发因素</a></th>
<th class="sortable">
<a href="/ReviewToolWS//center.jsp?d-7272271-s=13&projectId=0F865F13-21D6-437E-BAC2-021FC1E9BD9A&d-7272271-o=2">Defect Qualifier<br/>缺陷界定</a></th></tr></thead>
<tbody>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">s65404</td>
<td style="width:400">uwDdrDataBeginAddr 在pusch中是按照逻辑顺序放置的,不是按照commtbl里的小区指示放的</td>
<td style="width:60">接受</td>
<td style="width:50">l00140929</td>
<td style="width:300">已修改</td>
<td style="width:70">是</td>
<td style="width:80">\LBB\UL\CTRL\src\LBB_UL_CTRL_ADM_PucchMainProc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="even">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">不用的注释去掉</td>
<td style="width:60">接受</td>
<td style="width:50">l00146658</td>
<td style="width:300">o</td>
<td style="width:70">是</td>
<td style="width:80">\wuyl_work\unimath\ClearCaseView\w00170701_view4\WL_BESA_LTE_3X_CODE\LBB\UL\PUCCH\src\LBB_UL_PUCCH_SpreadSeqCalc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">l00140929</td>
<td style="width:400">最后一个Core可能非LogicCoreId=0的Core,这里就会有问题</td>
<td style="width:60">接受</td>
<td style="width:50">l00140929</td>
<td style="width:300">已修改</td>
<td style="width:70">是</td>
<td style="width:80">\LBB\UL\CTRL\src\LBB_UL_CTRL_ADM_PucchMainProc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Major严重</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">考虑扩展CP是否有问题</td>
<td style="width:60">接受</td>
<td style="width:50">l00146658</td>
<td style="width:300">本轮迭代先不考虑扩展CP</td>
<td style="width:70">否</td>
<td style="width:80">\LBB\UL\PUCCH\src\LBB_UL_PUCCH_Report.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="even">
<td style="width:20">1</td>
<td style="width:50">f00174951</td>
<td style="width:400">可以在前面++</td>
<td style="width:60">接受</td>
<td style="width:50">l00140929</td>
<td style="width:300">已修改</td>
<td style="width:70">是</td>
<td style="width:80">\BESE300CODE\WL_BESA_LTE_3X_CODE\LBB\UL\PUSCH\src\LBB_UL_PUSCH_Uci.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="even">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">不用的注释去掉</td>
<td style="width:60">重复</td>
<td style="width:50">l00146658</td>
<td style="width:300">o</td>
<td style="width:70">否</td>
<td style="width:80">\wuyl_work\unimath\ClearCaseView\w00170701_view4\WL_BESA_LTE_3X_CODE\LBB\UL\PUCCH\src\LBB_UL_PUCCH_ParamCalc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr>
<tr class="odd">
<td style="width:20">1</td>
<td style="width:50">w00170701</td>
<td style="width:400">同上</td>
<td style="width:60">重复</td>
<td style="width:50">l00146658</td>
<td style="width:300">o</td>
<td style="width:70">否</td>
<td style="width:80">\wuyl_work\unimath\ClearCaseView\w00170701_view4\WL_BESA_LTE_3X_CODE\LBB\UL\PUCCH\src\LBB_UL_PUCCH_ParamCalc.c</td>
<td style="width:70">Defect缺陷</td>
<td style="width:70">Minor一般</td>
<td style="width:100"></td>
<td style="width:80"></td>
<td style="width:80"></td>
<td style="width:55"></td></tr></tbody></table>
</td></tr>
<tr><td colspan=4><font size=2>Total 总共:<font color=red size=3>37</font> Records 条记录!</font>
</td></tr>
</table>
</form>
</body>
</html>
上述的HTML 我从网页中取得了内容,怎么才能变成 一行一行的纯文本。得到下面的这种效果: (空格的话,保留空格,要一次的根据表头,来存入数据库中)()
1 s65404 uwDdrDataBeginAddr 在pusch中是按照逻辑顺序放置的,不是按照commtbl里的小区指示放的 接受 l00140929 已修改 是 \LBB\UL\CTRL\src\LBB_UL_CTRL_ADM_PucchMainProc.c Defect缺陷 Minor一般
[解决办法]
这个HTML文档结构挺好的,比较规范。
直接用HTMLParser这个组件吧,可以很方便的将HTML像 DOM 结构一样进行访问,定位查找节点等。
你只需要把id="displaytag1"的<TABLE>节点先找到,然后循环其下所有<tr>节点就行了。
把每行<tr>节点中的<td>取出来,拼成Insert语句,就可以写数据库了。
[解决办法]