Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

对Doc文件的解析

需要poi-scratchpad/3.7.jar

POI-HWPF - A Quick Guide

基本的文本提取

有两个输入参数：inputstream,HWPFDocument,

getText()方法是得到所有的文本内容，

getParagraphText()是得到每一段的文本内容，

getTextFromPieces()是得到每一页的文本内容

特定文本属性提取

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

第一步：创建HWPFDocument

第二步：得到Range

getRange()： Returns the range which covers the whole of the document, but excludes any headers（页眉） and footers（页脚）.

int numParagraphs() Used to get the number of paragraphs in a range.

int numSections() Used to get the number of sections in a range（这个是“节”，就是插入、分隔符中的“节”）

第三步：得到段落

getParagraph()：

getText()

public static void main(String[] args) throws Exception {

        InputStream istream = new FileInputStream(

                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");

        HWPFDocument doc = new HWPFDocument(istream);

        Range range = doc.getRange();// Returns the range which covers the whole

                                        // of the document, but excludes any

                                        // headers and footers.

        for (int i = 0; i < range.numParagraphs(); i++) {

            Paragraph poiPara = range.getParagraph(i);

            int j = 0;

            while (true) {

                CharacterRun run = poiPara.getCharacterRun(j++);

                System.out.println("Color " + run.getColor());//颜色

                System.out.println("Font size " + run.getFontSize());//字体大小

                System.out.println("Font Name " + run.getFontName());//字体名称

                System.out.println(run.isBold() + " " + run.isItalic() + " "

                        + run.getUnderlineCode());//加粗，斜体，下划线

                System.out.println("Text is " + run.text());//文本内容

                if (run.getEndOffset() == poiPara.getEndOffset()) {

                    break;

                }

            }

        }

    }

对Docx文件的解析

需要poi-ooxml/3.7.jar

http://poi.apache.org/document/quick-guide-xwpf.html

package test;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.InputStream;

import java.util.ArrayList;

import java.util.List;

import org.apache.poi.hwpf.HWPFDocument;

import org.apache.poi.hwpf.usermodel.CharacterRun;

import org.apache.poi.hwpf.usermodel.Paragraph;

import org.apache.poi.hwpf.usermodel.Range;

import org.apache.poi.xwpf.usermodel.XWPFDocument;

import org.apache.poi.xwpf.usermodel.XWPFParagraph;

import org.apache.poi.xwpf.usermodel.XWPFRun;

public class ParseWordDocxTest {

    /**

     * @param args

     * @throws Exception

     */

    public static void main(String[] args) throws Exception {

        InputStream istream = new FileInputStream(

                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");

        XWPFDocument docx = new XWPFDocument(istream);

        List<XWPFParagraph> paraGraph = docx.getParagraphs();

        for(XWPFParagraph para :paraGraph ){

            List<XWPFRun> run = para.getRuns();

            for(XWPFRun r : run){

                int i = 0;

                System.out.println("字体颜色："+r.getColor());

                System.out.println("字体名称:"+r.getFontFamily());

                System.out.println("字体大小："+r.getFontSize());

                System.out.println("Text:"+r.getText(i++));

                System.out.println("粗体？："+r.isBold());

                System.out.println("斜体？："+r.isItalic());

            }

        }

    }

}

Tika解析word文件的更多相关文章

C#仪器数据文件解析-Word文件（doc、docx）
不少仪器数据报告输出为Word格式文件,同Excel文件,Word文件doc和docx的存储格式是不同的,相应的解析Word文件的方式也类似,主要有以下方式: 1.通过MS Word应用程序的DCOM ...
用python解析word文件（二）：table
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph) (二)表格篇(table)(本篇) (三)样式篇(style) 选你所需即可.下面开始正文. 上一篇我们讲了用python-do ...
用python解析word文件（一）：paragraph
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph)(本篇) (二)表格篇(table) (三)样式篇(style) 选你所需即可.下面开始正文. 最近公司的项目,需要在页面上显示w ...
用python解析word文件（三）：style
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph) (二)表格篇(table) (三)样式篇(style)(本篇) 选你所需即可.下面开始正文. 在前两篇中,我们已经解析出了par ...
用python解析word文件（段落篇（paragraph）表格篇（table）样式篇（style））
首先需要安装相应的支持库: 直接在命令行执行pip install python-docx 示例代码如下: import docxfrom docx import Document #导入库 path ...
用python读取word文件里的表格信息【华为云技术分享】
在企查查查询企业信息的时候,得到了一些word文件,里面有些控股企业的数据放在表格里,需要我们将其提取出来. word文件看起来很复杂,不方便进行结构化.实际上,一个word文档中大概有这么几种类型的 ...
NodeJs之word文件生成与解析
NodeJs之word文件生成与解析一,介绍与需求 1.1,介绍 1,officegen模块可以为Microsoft Office 2007及更高版本生成Office Open XML文件.此模块不 ...
Apache-Tika解析Word文档
通常在使用爬虫时,爬取到网上的文章都是各式各样的格式处理起来比较麻烦,这里我们使用Apache-Tika来处理Word格式的文章,如下: package com.mengyao.tika.app; i ...
Java读取word文件，字体，颜色
在Android读取Word文件时,在网上查看时可以用tm-extractors,但好像没有提到怎么读取Word文档中字体的颜色,字体,上下标等相关的属性.但由于需要,要把doc文档中的内容(字体,下 ...

随机推荐

[SCOI2005]繁忙的都市 (最小生成树)
题目链接 Solution 裸的最小生成树. Code #include<bits/stdc++.h> using namespace std; const int maxn=500008 ...
hdu 1558 线段相交+并查集路径压缩
Segment set Time Limit: 3000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total ...
C语言指针与数组
C语言指针与数组数组的下标应该从0还是1开始? 我提议的妥协方案是0.5,可惜他们未予认真考虑便一口回绝 -- Stan Kelly-Bootle 1. 数组并非指针为什么很多人会认为指 ...
SPOJ CIRU The area of the union of circles
You are given N circles and expected to calculate the area of the union of the circles ! Input The f ...
标准C程序设计七---115
Linux应用编程深入语言编程标准C程序设计七---经典C11程序设计以下内容为阅读: <标准C程序设计>(第7版) 作者 ...
hdu 1077(单位圆覆盖问题)
Catching Fish Time Limit: 10000/5000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others)To ...
python 操作系统和进程
一. 操作系统介绍多道程序系统多道程序设计技术所谓多道程序设计技术,就是指允许多个程序同时进入内存并运行.即同时把多个程序放入内存,并允许它们交替在CPU中运行,它们共享系统中的各种 ...
Druid连接池工具类
package cn.zmh.PingCe; import com.alibaba.druid.pool.DruidDataSourceFactory; import javax.sql.DataSo ...
javascript --- 对象之间的继承
了解这一章之前,先把我们之前讲到的以构造函数创建对象为前提的继承抛到一边. 首先,我们先用一个var o = {}创建一个没有任何属性的空对象作为我们的‘画板’,然互在逐步向这个画板里添加属性,和方法 ...
CSRF攻击 & XSS攻击
之前有几篇文章写了 SQL注入类问题: http://www.cnblogs.com/charlesblc/p/5987951.html (介绍) http://www.cnblogs.com/cha ...

Tika解析word文件

Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

POI-HWPF - A Quick Guide

Tika解析word文件的更多相关文章

随机推荐

热门专题