首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > 软件管理 > PowerDesigner >

开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作

2012-10-25 
开发中有关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作?关于这七种文档,我相信应该是最常用

开发中有关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作

?

关于这七种文档,我相信应该是最常用的文档了

在以下的介绍中会提到POI,现介绍下POI吧

poi处理WORD,EXCEL比较好:http://jakarta.apache.org/poi/

poi处理至少需要如下几个JAR包


开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
?
PDFbox处理PDF比较好:http://pdfbox.apache.org/download.html


下面一一介绍了

第一和第二是只支持03版的word和excel文档

?? 第一、首先来看WORD文档:
我这里用的是poi,相关jar包自己去下载,然后加到工程中(以下所要用的jar包也是,不再重复说)

Java代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?medium;"><span?style="font-size:?large;">public?static?String?readWord(String?path)?throws?Exception?{??
  2. String?bodyText?=?null;??
  3. try?{??
  4. FileInputStream?is?=?new?FileInputStream(path);??
  5. bodyText?=?new?WordExtractor(is).getText();??
  6. }?catch?(Exception?e)?{??
  7. System.out.println("=======");??
  8. }??
  9. return?bodyText;??
  10. }</span></span>??

?



? 第二、Exel的文档

?

?

Java代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?medium;"><span?style="font-size:?large;">public?static?String?ReadExcel(String?path)?throws?IOException?{??
  2. InputStream?inputStream?=?null;??
  3. String?content?=?null;??
  4. try?{??
  5. inputStream?=?new?FileInputStream(path);??
  6. HSSFWorkbook?wb?=?new?HSSFWorkbook(inputStream);??
  7. ExcelExtractor?extractor?=?new?ExcelExtractor(wb);??
  8. extractor.setFormulasNotResults(true);??
  9. extractor.setIncludeSheetNames(false);??
  10. content?=?extractor.getText();??
  11. }?catch?(FileNotFoundException?e)?{??
  12. e.printStackTrace();??
  13. }??
  14. return?content;??
  15. }</span></span>??

?

??




?针对07版的word和excel的操作

???package com.test;

Java代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?large;">?????
  2. /**????
  3. ?*?需要的jar包:????
  4. ?*?poi-3.0.2-FINAL-20080204.jar????
  5. ?*?poi-contrib-3.0.2-FINAL-20080204.jar????
  6. ?*?poi-scratchpad-3.0.2-FINAL-20080204.jar????
  7. ?*?poi-3.5-beta6-20090622.jar????
  8. ?*?geronimo-stax-api_1.0_spec-1.0.jar????
  9. ?*?ooxml-schemas-1.0.jar????
  10. ?*?openxml4j-bin-beta.jar????
  11. ?*?poi-ooxml-3.5-beta6-20090622.jar????
  12. ?*?xmlbeans-2.3.0.jar????
  13. ?*?dom4j-1.6.1.jar????
  14. ?*/?????
  15. ?????
  16. import?java.io.FileInputStream;??????
  17. import?java.io.IOException;??????
  18. import?java.io.InputStream;??????
  19. ?????
  20. import?org.apache.poi.POIXMLDocument;??????
  21. import?org.apache.poi.POIXMLTextExtractor;??????
  22. import?org.apache.poi.hssf.usermodel.HSSFCell;??????
  23. import?org.apache.poi.hssf.usermodel.HSSFRow;??????
  24. import?org.apache.poi.hssf.usermodel.HSSFSheet;??????
  25. import?org.apache.poi.hssf.usermodel.HSSFWorkbook;??????
  26. import?org.apache.poi.hwpf.extractor.WordExtractor;??????
  27. import?org.apache.poi.openxml4j.exceptions.OpenXML4JException;??????
  28. import?org.apache.poi.openxml4j.opc.OPCPackage;??????
  29. import?org.apache.poi.xssf.usermodel.XSSFCell;??????
  30. import?org.apache.poi.xssf.usermodel.XSSFRow;??????
  31. import?org.apache.poi.xssf.usermodel.XSSFSheet;??????
  32. import?org.apache.poi.xssf.usermodel.XSSFWorkbook;??????
  33. import?org.apache.poi.xwpf.extractor.XWPFWordExtractor;??????
  34. import?org.apache.xmlbeans.XmlException;??????
  35. ?????
  36. public?class?WordAndExcelExtractor?{??????
  37. ?public?static?void?main(String[]?args){??????
  38. ??try{??????
  39. ???String?wordFile?=?"D:/松山血战.docx";??????
  40. ???String?wordText2007?=?WordAndExcelExtractor.extractTextFromDOC2007(wordFile);??????
  41. ???System.out.println("wordText2007======="+wordText2007);??????
  42. ?????????
  43. ???InputStream?is?=?new?FileInputStream("D:/XXX研发中心技术岗位职位需求.xls");?????????
  44. ???String?excelText?=?WordAndExcelExtractor.extractTextFromXLS(is);?????????
  45. ???System.out.println("text2003=========="?+?excelText);??????
  46. ?????????
  47. ???String?excelFile?=?"D:/Hello2007.xlsx";?????????
  48. ???String?excelText2007?=?WordAndExcelExtractor.extractTextFromXLS2007(excelFile);??????
  49. ???System.out.println("excelText2007=========="?+?excelText2007);??????
  50. ?????
  51. ?????????
  52. ??}catch(Exception?e?){??????
  53. ???e.printStackTrace();??????
  54. ??}??????
  55. ?}??????
  56. ???????
  57. ?/**????
  58. ??*?@Method:?extractTextFromDOCX????
  59. ??*?@Description:?从word?2003文档中提取纯文本????
  60. ??*????
  61. ??*?@param?????
  62. ??*?@return?String????
  63. ??*?@throws????
  64. ??*/?????
  65. ????public?static?String?extractTextFromDOC(InputStream?is)?throws?IOException?{??????
  66. ????????WordExtractor?ex?=?new?WordExtractor(is);?//is是WORD文件的InputStream???????
  67. ?????
  68. ????????return?ex.getText();??????
  69. ????}??????
  70. ???????
  71. ?/**????
  72. ??*?@Method:?extractTextFromDOCX????
  73. ??*?@Description:?从word?2007文档中提取纯文本????
  74. ??*????
  75. ??*?@param?????
  76. ??*?@return?String????
  77. ??*?@throws????
  78. ??*/?????
  79. ????public?static?String?extractTextFromDOC2007(String?fileName)?throws?IOException,?OpenXML4JException,?XmlException?{??????
  80. ?????OPCPackage?opcPackage?=?POIXMLDocument.openPackage(fileName);??????
  81. ?????POIXMLTextExtractor?ex?=?new?XWPFWordExtractor(opcPackage);?????????
  82. ?????
  83. ????????return?ex.getText();??????
  84. ????}??????
  85. ???????
  86. ?/**????
  87. ??*?@Method:?extractTextFromXLS????
  88. ??*?@Description:?从excel?2003文档中提取纯文本????
  89. ??*????
  90. ??*?@param?????
  91. ??*?@return?String????
  92. ??*?@throws????
  93. ??*/?????
  94. ????@SuppressWarnings("deprecation")??????
  95. ?private?static?String?extractTextFromXLS(InputStream?is)??????
  96. ????????throws?IOException?{??????
  97. ????????StringBuffer?content??=?new?StringBuffer();??????
  98. ????????HSSFWorkbook?workbook?=?new?HSSFWorkbook(is);?//创建对Excel工作簿文件的引用???????
  99. ?????
  100. ????????for?(int?numSheets?=?0;?numSheets?<?workbook.getNumberOfSheets();?numSheets++)?{??????
  101. ????????????if?(null?!=?workbook.getSheetAt(numSheets))?{??????
  102. ????????????????HSSFSheet?aSheet?=?workbook.getSheetAt(numSheets);?//获得一个sheet??????
  103. ?????
  104. ????????????????for?(int?rowNumOfSheet?=?0;?rowNumOfSheet?<=?aSheet.getLastRowNum();?rowNumOfSheet++)?{??????
  105. ????????????????????if?(null?!=?aSheet.getRow(rowNumOfSheet))?{??????
  106. ????????????????????????HSSFRow?aRow?=?aSheet.getRow(rowNumOfSheet);?//获得一行??????
  107. ?????
  108. ????????????????????????for?(short?cellNumOfRow?=?0;?cellNumOfRow?<=?aRow.getLastCellNum();?cellNumOfRow++)?{??????
  109. ????????????????????????????if?(null?!=?aRow.getCell(cellNumOfRow))?{??????
  110. ????????????????????????????????HSSFCell?aCell?=?aRow.getCell(cellNumOfRow);?//获得列值??????
  111. ??????????????????????????????????????????????????????????????????????
  112. ????????????????????????????????if(aCell.getCellType()?==?HSSFCell.CELL_TYPE_NUMERIC){??????
  113. ?????????????????????????????????content.append(aCell.getNumericCellValue());??????
  114. ????????????????????????????????}else?if(aCell.getCellType()?==?HSSFCell.CELL_TYPE_BOOLEAN){??????
  115. ?????????????????????????????????content.append(aCell.getBooleanCellValue());??????
  116. ????????????????????????????????}else?{??????
  117. ?????????????????????????????????content.append(aCell.getStringCellValue());??????
  118. ????????????????????????????????}??????
  119. ????????????????????????????}??????
  120. ????????????????????????}??????
  121. ????????????????????}??????
  122. ????????????????}??????
  123. ????????????}??????
  124. ????????}??????
  125. ?????
  126. ????????return?content.toString();??????
  127. ????}??????
  128. ??????????
  129. ????/**????
  130. ?????*?@Method:?extractTextFromXLS2007????
  131. ?????*?@Description:?从excel?2007文档中提取纯文本????
  132. ?????*????
  133. ?????*?@param?????
  134. ?????*?@return?String????
  135. ?????*?@throws????
  136. ?????*/?????
  137. ????private?static?String?extractTextFromXLS2007(String?fileName)?throws?Exception{??????
  138. ?????StringBuffer?content?=?new?StringBuffer();??????
  139. ???????????
  140. ?????//构造?XSSFWorkbook?对象,strPath?传入文件路径??????????
  141. ??XSSFWorkbook?xwb?=?new?XSSFWorkbook(fileName);??????
  142. ????????
  143. ??//循环工作表Sheet??????
  144. ??for(int?numSheet?=?0;?numSheet?<?xwb.getNumberOfSheets();?numSheet++){??????
  145. ???XSSFSheet?xSheet?=?xwb.getSheetAt(numSheet);???????
  146. ???if(xSheet?==?null){??????
  147. ????continue;??????
  148. ???}??????
  149. ?????????
  150. ???//循环行Row??????
  151. ???for(int?rowNum?=?0;?rowNum?<=?xSheet.getLastRowNum();?rowNum++){??????
  152. ????XSSFRow?xRow?=?xSheet.getRow(rowNum);??????
  153. ????if(xRow?==?null){??????
  154. ?????continue;??????
  155. ????}??????
  156. ??????????
  157. ????//循环列Cell??????
  158. ????for(int?cellNum?=?0;?cellNum?<=?xRow.getLastCellNum();?cellNum++){??????
  159. ?????XSSFCell?xCell?=?xRow.getCell(cellNum);??????
  160. ?????if(xCell?==?null){??????
  161. ??????continue;??????
  162. ?????}??????
  163. ???????????
  164. ?????if(xCell.getCellType()?==?XSSFCell.CELL_TYPE_BOOLEAN){??????
  165. ??????content.append(xCell.getBooleanCellValue());??????
  166. ?????}else?if(xCell.getCellType()?==?XSSFCell.CELL_TYPE_NUMERIC){??????
  167. ??????content.append(xCell.getNumericCellValue());??????
  168. ?????}else{??????
  169. ??????content.append(xCell.getStringCellValue());??????
  170. ?????}??????
  171. ????}??????
  172. ???}??????
  173. ??}??????
  174. ????????
  175. ??return?content.toString();??????
  176. ????}??????
  177. ??????????
  178. }??????
  179. </span>??

?

? 第三、PowerPoint的文档

?

Java代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?medium;"><span?style="font-size:?large;">public?static?String?readPowerPoint(String?path)?{??
  2. StringBuffer?content?=?new?StringBuffer("");??
  3. try?{??
  4. SlideShow?ss?=?new?SlideShow(new?HSLFSlideShow(new?FileInputStream(??
  5. path)));//?is??
  6. //?为文件的InputStream,建立SlideShow??
  7. Slide[]?slides?=?ss.getSlides();//?获得每一张幻灯片??
  8. for?(int?i?=?0;?i?<?slides.length;?i++)?{??
  9. TextRun[]?t?=?slides[i].getTextRuns();//?为了取得幻灯片的文字内容,建立TextRun??
  10. for?(int?j?=?0;?j?<?t.length;?j++)?{??
  11. content.append(t[j].getText());//?这里会将文字内容加到content中去??
  12. }??
  13. }??
  14. }?catch?(Exception?ex)?{??
  15. System.out.println(ex.toString());??
  16. }??
  17. return?content.toString();??
  18. }</span></span>??

?

?






?第四、PDF的文档

?

Java代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?medium;"><span?style="font-size:?large;">public?static?String?readPdf(String?path)?throws?Exception?{??
  2. StringBuffer?content?=?new?StringBuffer("");??
  3. FileInputStream?fis?=?new?FileInputStream(path);??
  4. PDFParser?p?=?new?PDFParser(fis);??
  5. p.parse();??
  6. PDFTextStripper?ts?=?new?PDFTextStripper();??
  7. content.append(ts.getText(p.getPDDocument()));??
  8. fis.close();??
  9. return?content.toString().trim();??
  10. }</span></span>??

?

??





?? 第五、HTML的文档,要说明的是,HTML文档我们要获取其TITLE,BODY中的内容就要先获取源文件,然后再对源文件进行标签上的过滤,很麻烦

?

Html代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?large;">public?static?String?readHtml(String?urlString)?{??
  2. StringBuffer?content?=?new?StringBuffer("");??
  3. File?file?=?new?File(urlString);??
  4. FileInputStream?fis?=?null;??
  5. try?{??
  6. fis?=?new?FileInputStream(file);??
  7. BufferedReader?reader?=?new?BufferedReader(new?InputStreamReader(??
  8. fis,?"utf-8"));??
  9. String?line?=?null;??
  10. while?((line?=?reader.readLine())?!=?null)?{??
  11. content.append(line?+?"\n");??
  12. }??
  13. reader.close();??
  14. }?catch?(Exception?e)?{??
  15. e.printStackTrace();??
  16. }??
  17. String?contentcontentString?=?content.toString();??
  18. String?htmlStr?=?contentString;?//?含html标签的字符串??
  19. String?textStr?=?"";??
  20. java.util.regex.Pattern?p_script;??
  21. java.util.regex.Matcher?m_script;??
  22. java.util.regex.Pattern?p_style;??
  23. java.util.regex.Matcher?m_style;??
  24. java.util.regex.Pattern?p_html;??
  25. java.util.regex.Matcher?m_html;??
  26. try?{??
  27. String?regEx_script?=?"<[\\s]*?script[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?script[\\??
  28. String?regEx_style?=?"<[\\s]*?style[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?style[\\s]*??
  29. String?regEx_html?=?"<[^>]+>";?//?定义HTML标签的正则表达式??
  30. p_script?=?Pattern.compile(regEx_script,?Pattern.CASE_INSENSITIVE);??
  31. m_script?=?p_script.matcher(htmlStr);??
  32. htmlStr?=?m_script.replaceAll("");?//?过滤script标签??
  33. p_style?=?Pattern.compile(regEx_style,?Pattern.CASE_INSENSITIVE);??
  34. m_style?=?p_style.matcher(htmlStr);??
  35. htmlStr?=?m_style.replaceAll("");?//?过滤style标签??
  36. p_html?=?Pattern.compile(regEx_html,?Pattern.CASE_INSENSITIVE);??
  37. m_html?=?p_html.matcher(htmlStr);??
  38. htmlStr?=?m_html.replaceAll("");?//?过滤html标签??
  39. textStr?=?htmlStr;??
  40. }?catch?(Exception?e)?{??
  41. System.err.println("Html2Text:?"?+?e.getMessage());??
  42. }??
  43. return?textStr;//?返回文本字符串??
  44. }</span>??
?


?

?第六、TXT的文档,给TXT文本建立索引时要注意
?? 本项目实现了组合查询的功能
? //这一步如果不设置为GBK,TXT内容将全部乱码 BufferedReader reader=new BufferedReader(new InputStreamReader(is,"GBK")); 具体代码如下?

?

Java代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?medium;"><span?style="font-size:?large;">public?static?String?readTxt(String?path)?throws?IOException?{??
  2. StringBuffer?sb?=?new?StringBuffer("");??
  3. InputStream?is?=?new?FileInputStream(path);??
  4. //?必须设置成GBK,否则将出现乱码??
  5. BufferedReader?reader?=?new?BufferedReader(new?InputStreamReader(is,??
  6. "GBK"));??
  7. try?{??
  8. String?line?=?"";??
  9. while?((line?=?reader.readLine())?!=?null)?{??
  10. sb.append(line?+?"\r");??
  11. }??
  12. }?catch?(FileNotFoundException?e)?{??
  13. e.printStackTrace();??
  14. }??
  15. return?sb.toString().trim();??
  16. }</span></span>??

?

??





?

第七、RTF文档,rtf的转换则在javax中就有


?

Java代码??开发中相关读取pdf,html,word,rtf,txt,powerpoint,excel等文档的操作
  1. <span?style="font-size:?medium;"><span?style="font-size:?large;">public?static?String?readRtf(String?path)?{??
  2. String?result?=?null;??
  3. File?file?=?new?File(path);??
  4. try?{??
  5. DefaultStyledDocument?styledDoc?=?new?DefaultStyledDocument();??
  6. InputStream?is?=?new?FileInputStream(file);??
  7. new?RTFEditorKit().read(is,?styledDoc,?0);??
  8. result?=?new?String(styledDoc.getText(0,?styledDoc.getLength())??
  9. .getBytes("iso8859-1"),?"gbk");??
  10. //?提取文本,读取中文需要使用gbk编码,否则会出现乱码??
  11. }?catch?(IOException?e)?{??
  12. e.printStackTrace();??
  13. }?catch?(BadLocationException?e)?{??
  14. e.printStackTrace();??
  15. }??
  16. return?result;??
  17. }</span></span>??

?

?

热点排行