首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > 软件管理 > PowerDesigner >

Apache tika - 解析多品种型(word、pdf、txt 等)文件

2012-09-29 
Apache tika -- 解析多种类型(word、pdf、txt 等)文件!apache 是个伟大的组织。在lucene 检索 如火如荼时, ap

Apache tika -- 解析多种类型(word、pdf、txt 等)文件!

apache 是个伟大的组织。

在lucene 检索 如火如荼时, apache不忘继续努力,近期提供了对各种格式文件进行解析的解决方案 -- apache旗下的tika. 虽然还没有1.0版? , 但已经很好用:

/** * 解析各种类型文件 * @param 文件路径 * @return 文件内容字符串 */public static String parse(String path) {String result = "";TikaConfig tikaConfig = TikaConfig.getDefaultConfig();try {result = ParseUtils.getStringContent(new File(path), tikaConfig);}catch (Exception e) {log.debug("[by ninja.hzw]" + e);}return result;}

?

很简单,可以解析各种文件,返回文档内容字符串, word2003/2007 、 pdf? 、 txt 都经过测试,均能解析且无乱码问题。??

?

oh, Great Apach

?

Tika 的下载和打包:

下载不用多说,google 一下“apache tika” 找到其官网下载即可。

To build Tika from sources you first need to either download a source release or checkout the latest sources from version control.Once you have the sources, you can build them using the Maven 2 build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.mvn install

?

apache 已经说得很清楚,进入下载后的tika 目录 ,然后执行maven install 即可。(当然这里需要您懂得maven2的使用。当然不会的朋友可以联系我^^ . 还需注意,必须为jdk1.5 + 才能成功编译打包。)

打包完后产生以下 jar:

tika-core/target/tika-core-0.7.jarTika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.tika-parsers/target/tika-parsers-0.7.jarTika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.tika-app/target/tika-app-0.7.jarTika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.tika-bundle/target/tika-bundle-0.7.jarTika bundle. An OSGi bundle that includes everything you need to use all Tika functionality in an OSGi environment.

?

?我们要想做文档解析,只需引入tika-core 和 tika-parsers 即可。

?

当然如果您的项目是maven 构建的,那更好了。在pom里加上依赖:

  <dependency>    <groupId>org.apache.tika</groupId>    <artifactId>tika-core</artifactId>    <version>0.7</version>  </dependency>

?

以及

  <dependency>    <groupId>org.apache.tika</groupId>    <artifactId>tika-parsers</artifactId>    <version>0.7</version>  </dependency>

?

maven 会自动下载。(感谢maven官方的支持。)

1 楼 eidolonprince 2011-04-13   您好,我按照您给出的方法在lucene里使用了tika,现在有个问题,就是每次在解析pdf的时候,不报错,但是会给出一大堆信息,解析其他格式的时候都不存在这个问题,能否给我一些建议:
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{16, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{15, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{13, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{14, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{17, 0}
2011-04-13 19:50:23,275 DEBUG [http-9080-2] org.apache.pdfbox.pdmodel.font.PDSimpleFont: Debug: Could not find encoding for COSName{Identity-H}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{P}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSDictionary{(COSName{MCID}:COSInt{0}) }
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BDC}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BT}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{F1}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{10.56}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tf}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{90.024}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{758.28}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tm}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{g}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{G}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSArray{[COSString{08?}, COSInt{11}, COSString{-;-;}, COSInt{11}, COSString{??}, COSInt{11}, COSString{7-
V}, COSInt{11}, COSString{>?L}, COSInt{11}, COSString{*?}]}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{TJ}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{ET}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{EMC}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{P}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSDictionary{(COSName{MCID}:COSInt{1}) }
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BDC}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BT}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{F2}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{10.56}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tf}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{226.61}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{758.28}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tm}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSArray{[COSString{ }]}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{TJ}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{ET}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{EMC}

热点排行