C#解析PDF
C#解析PDF的方式有很多,比较好用的有ITestSharp和PdfBox。
PDF内容页如果是图片类型,例如扫描件,则需要进行OCR(光学字符识别)。
文本内容的PDF文档,解析的过程中,我目前仅发现能以字符串的形式读取的,不能够读取其中的表格。据说PDF文档结构中是没有表格概念的,因此这个自然是读不到的,如果果真如此,则PDF中表格内容的解析,只能对获取到的字符串按照一定的逻辑自行解析了。
ITestSharp是一C#开源项目,PdfBox为Java开源项目,借助于IKVM在.Net平台下有实现。
Pdf转换Image,使用的是GhostScript,可以以API的方式调用,也可以以Windows命令行的方式调用。
OCR使用的是Asprise,识别效果较好(商业),另外还可以使用MS的ImageScaning(2007)或OneNote(2010)(需要依赖Office组件),Tessert(HP->Google)(效果很差)。
附上ITestSharp、PdfBox对PDF的解析代码。
ITestSharp辅助类
using System;
using System.Collections.Generic;
using System.Text; using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO; namespace eyuan
{
public static class ITextSharpHandler
{
/// <summary>
/// 读取PDF文本内容
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public static string ReadPdf(string fileName)
{
if (!File.Exists(fileName))
{
LogHandler.LogWrite(@"指定的PDF文件不存在:" + fileName);
return string.Empty;
}
//
string fileContent = string.Empty;
StringBuilder sbFileContent = new StringBuilder();
//打开文件
PdfReader reader = null;
try
{
reader = new PdfReader(fileName);
}
catch (Exception ex)
{
LogHandler.LogWrite(string.Format(@"加载PDF文件{0}失败,错误:{1}", new string[] { fileName, ex.ToString() })); if (reader != null)
{
reader.Close();
reader = null;
} return string.Empty;
} try
{
//循环各页(索引从1开始)
for (int i = ; i <= reader.NumberOfPages; i++)
{
sbFileContent.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i)); } }
catch (Exception ex)
{
LogHandler.LogWrite(string.Format(@"解析PDF文件{0}失败,错误:{1}", new string[] { fileName, ex.ToString() })); }
finally
{
if (reader != null)
{
reader.Close();
reader = null;
}
}
//
fileContent = sbFileContent.ToString();
return fileContent;
}
/// <summary>
/// 获取PDF页数
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public static int GetPdfPageCount(string fileName)
{
if (!File.Exists(fileName))
{
LogHandler.LogWrite(@"指定的PDF文件不存在:" + fileName);
return -;
}
//打开文件
PdfReader reader = null;
try
{
reader = new PdfReader(fileName);
}
catch (Exception ex)
{
LogHandler.LogWrite(string.Format(@"加载PDF文件{0}失败,错误:{1}", new string[] { fileName, ex.ToString() })); if (reader != null)
{
reader.Close();
reader = null;
} return -;
}
//
return reader.NumberOfPages;
}
}
}
PDFBox辅助类
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text; namespace eyuan
{
public static class PdfBoxHandler
{
/// <summary>
/// 使用PDFBox组件进行解析
/// </summary>
/// <param name="input">PDF文件路径</param>
/// <returns>PDF文本内容</returns>
public static string ReadPdf(string input)
{
if (!File.Exists(input))
{
LogHandler.LogWrite(@"指定的PDF文件不存在:" + input);
return null;
}
else
{
PDDocument pdfdoc = null;
string strPDFText = null;
PDFTextStripper stripper = null; try
{
//加载PDF文件
pdfdoc = PDDocument.load(input);
}
catch (Exception ex)
{
LogHandler.LogWrite(string.Format(@"加载PDF文件{0}失败,错误:{1}", new string[] { input, ex.ToString() })); if (pdfdoc != null)
{
pdfdoc.close();
pdfdoc = null;
} return null;
} try
{
//解析PDF文件
stripper = new PDFTextStripper();
strPDFText = stripper.getText(pdfdoc); }
catch (Exception ex)
{
LogHandler.LogWrite(string.Format(@"解析PDF文件{0}失败,错误:{1}", new string[] { input, ex.ToString() })); }
finally
{
if (pdfdoc != null)
{
pdfdoc.close();
pdfdoc = null;
}
} return strPDFText;
} }
}
}
另外附上PDF转Image,然后对Image进行OCR的代码。
转换PDF为Jpeg图片代码(GhostScript辅助类)
using System;
using System.Collections;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text; namespace eyuan
{
public class GhostscriptHandler
{ #region GhostScript Import
/// <summary>创建Ghostscript的实例
/// This instance is passed to most other gsapi functions.
/// The caller_handle will be provided to callback functions.
/// At this stage, Ghostscript supports only one instance. </summary>
/// <param name="pinstance"></param>
/// <param name="caller_handle"></param>
/// <returns></returns>
[DllImport("gsdll32.dll", EntryPoint = "gsapi_new_instance")]
private static extern int gsapi_new_instance(out IntPtr pinstance, IntPtr caller_handle);
/// <summary>This is the important function that will perform the conversion
///
/// </summary>
/// <param name="instance"></param>
/// <param name="argc"></param>
/// <param name="argv"></param>
/// <returns></returns>
[DllImport("gsdll32.dll", EntryPoint = "gsapi_init_with_args")]
private static extern int gsapi_init_with_args(IntPtr instance, int argc, IntPtr argv);
/// <summary>
/// Exit the interpreter.
/// This must be called on shutdown if gsapi_init_with_args() has been called,
/// and just before gsapi_delete_instance().
/// 退出
/// </summary>
/// <param name="instance"></param>
/// <returns></returns>
[DllImport("gsdll32.dll", EntryPoint = "gsapi_exit")]
private static extern int gsapi_exit(IntPtr instance);
/// <summary>
/// Destroy an instance of Ghostscript.
/// Before you call this, Ghostscript must have finished.
/// If Ghostscript has been initialised, you must call gsapi_exit before gsapi_delete_instance.
/// 销毁实例
/// </summary>
/// <param name="instance"></param>
[DllImport("gsdll32.dll", EntryPoint = "gsapi_delete_instance")]
private static extern void gsapi_delete_instance(IntPtr instance);
#endregion #region 变量
private string _sDeviceFormat;
private int _iWidth;
private int _iHeight;
private int _iResolutionX;
private int _iResolutionY;
private int _iJPEGQuality;
private Boolean _bFitPage;
private IntPtr _objHandle;
#endregion #region 属性
/// <summary>
/// 输出格式
/// </summary>
public string OutputFormat
{
get { return _sDeviceFormat; }
set { _sDeviceFormat = value; }
}
/// <summary>
///
/// </summary>
public int Width
{
get { return _iWidth; }
set { _iWidth = value; }
}
/// <summary>
///
/// </summary>
public int Height
{
get { return _iHeight; }
set { _iHeight = value; }
}
/// <summary>
///
/// </summary>
public int ResolutionX
{
get { return _iResolutionX; }
set { _iResolutionX = value; }
}
/// <summary>
///
/// </summary>
public int ResolutionY
{
get { return _iResolutionY; }
set { _iResolutionY = value; }
}
/// <summary>
///
/// </summary>
public Boolean FitPage
{
get { return _bFitPage; }
set { _bFitPage = value; }
}
/// <summary>Quality of compression of JPG
/// Jpeg文档质量
/// </summary>
public int JPEGQuality
{
get { return _iJPEGQuality; }
set { _iJPEGQuality = value; }
}
#endregion #region 初始化(实例化对象)
/// <summary>
///
/// </summary>
/// <param name="objHandle"></param>
public GhostscriptHandler(IntPtr objHandle)
{
_objHandle = objHandle;
}
public GhostscriptHandler()
{
_objHandle = IntPtr.Zero;
}
#endregion #region 字符串处理
/// <summary>
/// 转换Unicode字符串到Ansi字符串
/// </summary>
/// <param name="str">Unicode字符串</param>
/// <returns>Ansi字符串(字节数组格式)</returns>
private byte[] StringToAnsiZ(string str)
{
//' Convert a Unicode string to a null terminated Ansi string for Ghostscript.
//' The result is stored in a byte array. Later you will need to convert
//' this byte array to a pointer with GCHandle.Alloc(XXXX, GCHandleType.Pinned)
//' and GSHandle.AddrOfPinnedObject()
int intElementCount;
int intCounter;
byte[] aAnsi;
byte bChar;
intElementCount = str.Length;
aAnsi = new byte[intElementCount + ];
for (intCounter = ; intCounter < intElementCount; intCounter++)
{
bChar = (byte)str[intCounter];
aAnsi[intCounter] = bChar;
}
aAnsi[intElementCount] = ;
return aAnsi;
}
#endregion #region 转换文件
/// <summary>
/// 转换文件
/// </summary>
/// <param name="inputFile">输入的PDF文件路径</param>
/// <param name="outputFile">输出的Jpeg图片路径</param>
/// <param name="firstPage">第一页</param>
/// <param name="lastPage">最后一页</param>
/// <param name="deviceFormat">格式(文件格式)</param>
/// <param name="width">宽度</param>
/// <param name="height">高度</param>
public void Convert(string inputFile, string outputFile,
int firstPage, int lastPage, string deviceFormat, int width, int height)
{
//判断文件是否存在
if (!System.IO.File.Exists(inputFile))
{
LogHandler.LogWrite(string.Format("文件{0}不存在", inputFile));
return;
}
int intReturn;
IntPtr intGSInstanceHandle;
object[] aAnsiArgs;
IntPtr[] aPtrArgs;
GCHandle[] aGCHandle;
int intCounter;
int intElementCount;
IntPtr callerHandle;
GCHandle gchandleArgs;
IntPtr intptrArgs;
string[] sArgs = GetGeneratedArgs(inputFile, outputFile,
firstPage, lastPage, deviceFormat, width, height);
// Convert the Unicode strings to null terminated ANSI byte arrays
// then get pointers to the byte arrays.
intElementCount = sArgs.Length;
aAnsiArgs = new object[intElementCount];
aPtrArgs = new IntPtr[intElementCount];
aGCHandle = new GCHandle[intElementCount];
// Create a handle for each of the arguments after
// they've been converted to an ANSI null terminated
// string. Then store the pointers for each of the handles
for (intCounter = ; intCounter < intElementCount; intCounter++)
{
aAnsiArgs[intCounter] = StringToAnsiZ(sArgs[intCounter]);
aGCHandle[intCounter] = GCHandle.Alloc(aAnsiArgs[intCounter], GCHandleType.Pinned);
aPtrArgs[intCounter] = aGCHandle[intCounter].AddrOfPinnedObject();
}
// Get a new handle for the array of argument pointers
gchandleArgs = GCHandle.Alloc(aPtrArgs, GCHandleType.Pinned);
intptrArgs = gchandleArgs.AddrOfPinnedObject();
intReturn = gsapi_new_instance(out intGSInstanceHandle, _objHandle);
callerHandle = IntPtr.Zero;
try
{
intReturn = gsapi_init_with_args(intGSInstanceHandle, intElementCount, intptrArgs);
}
catch (Exception ex)
{
LogHandler.LogWrite(string.Format("PDF文件{0}转换失败.\n错误:{1}",new string[]{inputFile,ex.ToString()})); }
finally
{
for (intCounter = ; intCounter < intReturn; intCounter++)
{
aGCHandle[intCounter].Free();
}
gchandleArgs.Free();
gsapi_exit(intGSInstanceHandle);
gsapi_delete_instance(intGSInstanceHandle);
}
}
#endregion #region 转换文件
/// <summary>
///
/// </summary>
/// <param name="inputFile"></param>
/// <param name="outputFile"></param>
/// <param name="firstPage"></param>
/// <param name="lastPage"></param>
/// <param name="deviceFormat"></param>
/// <param name="width"></param>
/// <param name="height"></param>
/// <returns></returns>
private string[] GetGeneratedArgs(string inputFile, string outputFile,
int firstPage, int lastPage, string deviceFormat, int width, int height)
{
this._sDeviceFormat = deviceFormat;
this._iResolutionX = width;
this._iResolutionY = height;
// Count how many extra args are need - HRangel - 11/29/2006, 3:13:43 PM
ArrayList lstExtraArgs = new ArrayList();
if (_sDeviceFormat == "jpg" && _iJPEGQuality > && _iJPEGQuality < )
lstExtraArgs.Add("-dJPEGQ=" + _iJPEGQuality);
if (_iWidth > && _iHeight > )
lstExtraArgs.Add("-g" + _iWidth + "x" + _iHeight);
if (_bFitPage)
lstExtraArgs.Add("-dPDFFitPage");
if (_iResolutionX > )
{
if (_iResolutionY > )
lstExtraArgs.Add("-r" + _iResolutionX + "x" + _iResolutionY);
else
lstExtraArgs.Add("-r" + _iResolutionX);
}
// Load Fixed Args - HRangel - 11/29/2006, 3:34:02 PM
int iFixedCount = ;
int iExtraArgsCount = lstExtraArgs.Count;
string[] args = new string[iFixedCount + lstExtraArgs.Count];
/*
// Keep gs from writing information to standard output
"-q",
"-dQUIET", "-dPARANOIDSAFER", // Run this command in safe mode
"-dBATCH", // Keep gs from going into interactive mode
"-dNOPAUSE", // Do not prompt and pause for each page
"-dNOPROMPT", // Disable prompts for user interaction
"-dMaxBitmap=500000000", // Set high for better performance // Set the starting and ending pages
String.Format("-dFirstPage={0}", firstPage),
String.Format("-dLastPage={0}", lastPage), // Configure the output anti-aliasing, resolution, etc
"-dAlignToPixels=0",
"-dGridFitTT=0",
"-sDEVICE=jpeg",
"-dTextAlphaBits=4",
"-dGraphicsAlphaBits=4",
*/
args[] = "pdf2img";//this parameter have little real use
args[] = "-dNOPAUSE";//I don't want interruptions
args[] = "-dBATCH";//stop after
//args[3]="-dSAFER";
args[] = "-dPARANOIDSAFER";
args[] = "-sDEVICE=" + _sDeviceFormat;//what kind of export format i should provide
args[] = "-q";
args[] = "-dQUIET";
args[] = "-dNOPROMPT";
args[] = "-dMaxBitmap=500000000";
args[] = String.Format("-dFirstPage={0}", firstPage);
args[] = String.Format("-dLastPage={0}", lastPage);
args[] = "-dAlignToPixels=0";
args[] = "-dGridFitTT=0";
args[] = "-dTextAlphaBits=4";
args[] = "-dGraphicsAlphaBits=4";
//For a complete list watch here:
//http://pages.cs.wisc.edu/~ghost/doc/cvs/Devices.htm
//Fill the remaining parameters
for (int i = ; i < iExtraArgsCount; i++)
{
args[ + i] = (string)lstExtraArgs[i];
}
//Fill outputfile and inputfile
args[ + iExtraArgsCount] = string.Format("-sOutputFile={0}", outputFile);
args[ + iExtraArgsCount] = string.Format("{0}", inputFile);
return args;
}
#endregion }
}
OCR,识别Image代码(AsPrise辅助类)
using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text; namespace PDFCaptureService
{
public static class AspriseOCRHandler
{
#region 外部引用
[DllImport("AspriseOCR.dll", EntryPoint = "OCR", CallingConvention = CallingConvention.Cdecl)]
public static extern IntPtr OCR(string file, int type);
[DllImport("AspriseOCR.dll", EntryPoint = "OCRpart", CallingConvention = CallingConvention.Cdecl)]
static extern IntPtr OCRpart(string file, int type, int startX, int
startY, int width, int height);
[DllImport("AspriseOCR.dll", EntryPoint = "OCRBarCodes", CallingConvention = CallingConvention.Cdecl)]
static extern IntPtr OCRBarCodes(string file, int type);
[DllImport("AspriseOCR.dll", EntryPoint = "OCRpartBarCodes", CallingConvention = CallingConvention.Cdecl)]
static extern IntPtr OCRpartBarCodes(string file, int type, int
startX, int startY, int width, int height);
#endregion /// <summary>
///
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public static string ReadImage(string fileName)
{
IntPtr ptrFileContent = OCR(fileName, -);
string fileContent = Marshal.PtrToStringAnsi(ptrFileContent);
//
return fileContent;
}
}
}
调用示例
GhostscriptHandler ghostscriptHandler = new GhostscriptHandler();
string tempJpgFileName = string.Format(GhostScriptImageName, Guid.NewGuid().ToString());
int pdfPageCount = ITextSharpHandler.GetPdfPageCount(fileName);
ghostscriptHandler.Convert(fileName, tempJpgFileName, , pdfPageCount, "jpeg", , );
fileContent = AspriseOCRHandler.ReadImage(fileName);
C#解析PDF的更多相关文章
- WPF解析PDF为图片
偶遇需要解析PDF文件为单张图,此做, http://git.oschina.net/jiailiuyan/OfficeDecoder using System; using System.Colle ...
- Apache-Tika解析PDF文档
通常在使用爬虫时,爬取到网上的文章都是各式各样的格式处理起来比较麻烦,这里我们使用Apache-Tika来处理PDF格式的文章,如下: package com.mengyao.tika.app; im ...
- Python解析PDF三法
span{line-height:2em} --> 最近做调研想知道一些NZ当地的旅游信息,于是在NZ留学的友人自高奋勇地帮我去各个加油站拿了一堆旅游小册子,扫描了发给我. 但是他扫描出的高清图 ...
- Python使用PDFMiner解析PDF
近期在做爬虫时有时会遇到网站只提供pdf的情况,这样就不能使用scrapy直接抓取页面内容了,只能通过解析PDF的方式处理,目前的解决方案大致只有pyPDF和PDFMiner.因为据说PDFMiner ...
- LIMS系统仪器数据采集-使用xpdf解析pdf内容
不同语言解析PDF内容都有各自的库,比如Java的pdfbox,.net的itextsharp. c#解析PDF文本,关键代码可参考: http://www.cnblogs.com/mahongbia ...
- C#仪器数据文件解析-PDF文件
不少仪器工作站输出的数据报告文件为PDF格式,PDF格式用于排版打印,但不易于数据解析,因此解析PDF数据需要首先读取到PDF文件中的文本内容,然后根据内容规则解析有意义的数据信息. C#解析PDF文 ...
- Java仪器数据文件解析-PDF文件
一.概述 使用pdfbox可生成Pdf文件,同样可以解析PDF文本内容. pdfbox链接:https://pdfbox.apache.org/ 二.PDF文本内容解析 File file = new ...
- PHP通过PDFParser解析PDF文件
之前一直找到的资料都是教你怎么生成pdf文档,比如:TCPDF.FPDF.wkhtmltopdf.而我碰到的项目里需要验证从远程获取的pdf文件是否受损.文件内容是否一致这些问题,这些都不能直接提供给 ...
- 代码片段,使用TIKA来解析PDF,WORD和EMAIL
/** * com.jiaoyiping.pdstest.TestTika.java * Copyright (c) 2009 Hewlett-Packard Development Company, ...
随机推荐
- eclipse如何汉化,把eclipse改成中文版
eclipse默认是英文版的,对于中国人来说使用英文语言的软件是件痛苦的事情.下面我来详细说一下如何把eclipse改成中文版的. 工具/原料 eclipse英文版 eclipse中文插件 方法/ ...
- [Swift实际操作]八、实用进阶-(3)闭包在定时任务、动画和线程中的使用实际操作
闭包的使用相当广泛,它是可以在代码中被传递和引用的具有独立功能的模块.双击打开之前创建的空白项目.本文将演示闭包在定时任务.动画和线程中的使用.在左侧的项目导航区,打开视图控制器的代码文件:ViewC ...
- python 对mongodb进行压力测试
最近对mongoDB数据库进行性能分析,需要对数据库进行加压. 加压时,最初采用threading模块写了个多线程程序,测试的效果不理想. 单机读数据库每秒请求数只能达到1000次/s.而开发的jav ...
- the type initializer for 'system.drawingcore.gdiplus' threw an exception
Centos 7 yum install libgdiplus-devel reboot之后生效 apt install libgdiplus cp /usr/lib/libgdiplus.so ~/ ...
- Laravel - 从百草园到三味书屋 "From Apprentice To Artisan"目录
Laravel - 从百草园到三味书屋 "From Apprentice To Artisan"目录 https://my.oschina.net/zgldh/blog/38924 ...
- linux 将进程或者线程绑定到指定的cpu上
基本概念 cpu亲和性(affinity) CPU的亲和性, 就是进程要在指定的 CPU 上尽量长时间地运行而不被迁移到其他处理器,也称为CPU关联性:再简单的点的描述就将指定的进程或线程绑定到相应的 ...
- 关于npm run dev报错npm ERR! missing script: dev
出现这个问题应当重新使用 vue init webpack 来初始化工程. 在执行 npm run dev 就可以执行了.
- springcloud(五)-Ribbon
前言 先发句牢骚,最近太TM忙了,一直没时间静下心来继续写微服务架构!EMMMMMM..... 经过前文的讲解,我们已经实现了微服务的注册与发现.启动各个微服务时,Eureka Client会把自己的 ...
- 对于maven创建spark项目的pom.xml配置文件(图文详解)
不多说,直接上干货! http://mvnrepository.com/ 这里,怎么创建,见 Spark编程环境搭建(基于Intellij IDEA的Ultimate版本)(包含Java和Scala版 ...
- 数据库~Mysql里的Explain说明
对于mysql的执行计划可以在select前添加Explain来实现,它可以告诉我们你的语句性能如何. 下面是对explain的具体说明,也都是官方的,以后进行参考. id SELECT识别符.这是S ...