pdf 中内容的坐标系
PDF Page Coordinates (page size, field placement, etc.)
AcroForm, Basics, Automation
Page coordinates are used to add fields and annotations to a page, move fields and annotations, resize page boundries, locate words on a page, and for any other operation that involves page geometry. So understanding page coordinates is critical to many common automation activities. And is also very useful for form scripting.
Contents
- Coordinate Systems (Page Geometry)
- Page Rotations
- Field and Annotation Rotations
- Getting and Setting Page Size
- Placing and Moving Form Fields
- Converting Coordinates
- Finding Words, and Handling Word Locations
Related Resources
- Sample Files that demonstrate page geometry operations
- Bouncing Button
- 2D Matrix Mulitplier (Discussed in Converting Coordinates)
- Swat the Fly Game (Variation on Bouncing Button)
- Automation tools that demonstrate page geometry operations
Coordinate Systems
The coordinate system on a PDF page is called User Space. This is a flat 2- dimensional space, just like a piece of paper. And in fact that's a good way to think about it. The units of User Space are called "points" and there are 72 points/inch. The origin, or 0,0 point is located in the bottom left hand corner of the page. Horizontal, or X, coordinates increase to the rights and vertical, or Y, coordinates increase towards the top (Figure 1)
However, PDF pages are a bit more complex than they might seem from the user's perspective. The edges of a page are bound by several different page boxes (Figure 1). Ideally these boxes are concentric (each one fully inside the next) as shown in Figure 1, but this arrangement is not absolutley required. Each of these boxes has a different meaning and all of them except for the BBox can be modified with a script.
The outer most box is the Media box. This box represents the full page size. Originally this meant the paper size the page was to be printed on. And all the other bounding boxes are inside this one. The Media Box doesn't have quite the same importance for an interactive document displayed on the sceen. But it is still very important to page geomentry, as will be explained below.
The next 3 boxes, Art, Bleed, and Trim, have special meaning to printers. They represent imporant stages in the printing of a document, but are invisible to the average user and unimportant for our purposes here, so they won't be discussed.
The BBox, or the Bounding Box, is the smallest rectangle that can enclose all the content on the page. This box is calculated by Acrobat and so it cannot be modified by a script. Ideally, a line drawn around the BBox would touch the edges of the visible content on all four sides of the page. Unfortunately this is rarely true. PDF content often includes invisible boundaries that extend beyond the visible elements. These empty areas are usually the result of an inefficient or in some cases a down right poor PDF creation process.
The two most important boxes for scripting, and handling page geometry in general, are the Media Box and the Crop Box. As explained earlier, the Media Box is meant to represent what the user would see if they printed out the PDF page. The Crop box is what the user sees on the computer screen. These two boxes are often exactly the same, but they can of course be different sizes, and they can also be different rotations. The only restriction is that the Crop Box is always inside the Media Box. If a script attempts to make the Media Box smaller than the Crop Box, then Acrobat will automatically adjust the size of the Crop Box to be smaller. And vise-a-versa, if a script tries to make the Crop Box larger than the Media box Acrobat will automatically grow the Media Box.
To illustrate this idea imagine a standard letter size page designed to be viewed in landscape mode. For example it might display a very wide table of data. On the computer you'd want to see the page rotated with the long side down so you could see the content properly. But the page would still need to be printed in the regular letter size page orientation. Thus, we have two situations, the rotated cropped view that the user sees on the screen, and the unrotated full size page that gets printed.
To handle these two situations Acrobat JavaScript uses two different coordinate systems, Default User Space which represents the printed view, and Rotated User Space which represents the on-screen view. Default User Space is measured with the Media Box. In Default User Space the origin (0,0 point) is always the bottom left hand corner of the printed page. Rotated user space is measured with the Crop Box. In Rotated User Space the origin is always the bottom left hand corner of the page shown on the screen. This difference is illustrated in Figure 3 using the example discussed above.
You'll find in most documents that both these User Spaces are exactly the same, which means that the Crop and Media Boxes are also exactly the same. However, unless you know for sure that there isn't a difference, all code must be written to take rotations and cropping in to consideration.
Page Rotations
Because Rotation is part of the difference between the User Spaces it's useful to understand a bit about how page rotation works in Acrobat and PDF. Pages can only be rotated in 90° increments. For example, a page cannot be skewed 45°. Each page in a PDF document has it's own unique geometry. The size and rotation of pages in a PDF are unrelated to one another. Of course, in real documents it's a standard practice to make all the pages the same. But keep in mind that there is no guarentee. Code that works on every page in a PDF has to treat each page individually, unless it's known up front that document page geometry is homogeneous.
Unfortunately, the Acrobat user interface does not provide information on Page Rotation. Even though pages can be rotated with the "Document > Rotate Pages..." menu item, there is no feedback to indicate the current page rotation. Fortunately we can get this information from JavaScript with the doc.getPageRotation()function. The following line of code returns the page rotation for the current page in the current document. Try running it in the Console Window.
var nRot = this.getPageRotation(this.pageNum);
The doc.getPageRotation() function takes a single input, the page number, and returns one of 0, 90, 180, or 270. To set the page rotation use doc.setPageRotations(). This function is a little different because it operates on a range of pages. It takes 3 input arguments, a start page, an end page, and the rotation. The rotation must be one of 0, 90, 180, or 270. The function will throw an exception for invalid rotations. For example it will not take negative rotations or 360+ rotations, even though these are equivilent to the valid input values. And of course it will not accept a value such as 45. The following code rotates the first three pages in the PDF by 90°.
// nStart = 0; first page in PDF
// nEnd = 2; page 3 in PDF
// nRotation = 90°
this.setPageRotations(0,2,90);
Field and Annotation Rotation
When a page is rotated, the form fields and markup annotations on the page are rotated with it, so that the presentation of all the elements on the page stays in sync. The rotation value of markup annotations cannot be seen or controlled from the Acrobat UI. For all practical purposes it is whatever Acrobat decides to make it. But fields have a rotation property that can be set in the Properties Dialog (Figure 4). Setting the Rotation of a field sets the orientation of the content (the text), it does not rotate the field box.
In JavaScript, the rotation of fields can be ananlyzed and set through the "field.rotation" property.
var oFld = this.getField("MyText"); // Get Field Rotation.
var nCurRot = oFld.rotation; // Rotate 90° more
var nNewRot = nCurRot + 90; // Fix up for large rotations
if(nNewRot >= 360)
nNewRot -= 360; // new Rotation
oFld.roation = nNewRot;
The rotation value shown on the Properties Dialog is different from the value acquired in the JavaScript model. The Properties Dialog value shows the rotation of the field relative to the view of the page on screen. The rotation acquired from the Field's JavaScript object is the real rotation of the element that you'll need for doing geometry calculations. It's the rotation of the field in Default User Space, or to put it qualitatively, it's the fields rotation relative to the orientation of the printed page. And just like the rotation set in the Properties Dialog it only rotates text and graphics shown in the field. It does not rotate the field boundaries
Getting and Setting Page Size
As we've already seen, a page really has many different sizes depending on how we look at it. Or to be more specific, which bounding box and coordinate space we're looking at. The doc.getPageBox() function is used to acquire a bounding box for a specific page. This function will get the bounding box for any of the boxes shown in figure 2. But the coordinates returned are in always in Roatated User Space, which represents the view of the page the user sees on the screen. Because JavaScript operates in the viewer, Rotated user space tends be the standard coordinate system for most operations. The following code acquires the Crop Box for the first page in the current document.
var aCropRect = this.getPageBox("Crop",0);
This function has only two inputs, the name of the box we want to acquire, and the zero-based page number. The return value is an array of four values representing the coordinates of the four edges of the page rectangle.
var aPageBox = [nLeft, nTop, nRight, nBottom];
Remember that the origin of the Crop Box is always the lower left corner in Rotated user space (Figure 3). So the Left and Bottom values in the Crop Box rectangle will always be 0. The following example shows the return values for Page Boxes shown in Figure 3 and Figure 1.
//** For Figure 1
// Original Page is 8.5x11 inches, no rotation or cropping var aCropRect = this.getPageBox("Crop",0);
// Returns: [0,792,612,0] var aMediaRect = this.getPageBox("Media",0);
// Returns: [0,792,612,0] //** For Figure 3
// Original Page is 8.5x11 inches, rotated 90°,
// and cropped 1/2 inch on all sides
var aCropRect = this.getPageBox("Crop",0);
// Returns: [0,540,720,0] var aMediaRect = this.getPageBox("Media",0);
// Returns: [-36,576,756,-36]
Notice that in the example for Figure 1 the Crop and Media boxes are the same. This means that Default and Rotated User Space are also the same.
All of the bounding boxes, except for BBox, can be viewed and modified from the Cropping tool (Figure 5), the "Document > Crop Pages..." menu item in Acrobat.
Notice that changing the page size changes the Media Box. This reinforces the idea that the Media Box is the base page size (or paper size), and all the other boudaries are variations on the Media Box. This is why Default user space is based on the Media Box. What the user sees on the screen, Rotated User Space is a variation. This gives us some perspecitive on Adobe's thinking at the time they created PDF, i.e. that it was primarily a print, rather than screen, based model.
In JavaScript, the doc.setPageBoxes() function is used to set a bounding box for a range of pages. The following code uses this function to crop the current page by 1/2 inch on all sides.
// Original Page is 8.5x11 inches
// First, acquire the current Media Box
var aMediaRect = this.getPageBox("Media",this.pageNum);
// Returns: [0,792,612,0] // Now remove 1/2 inch, 36 points from all sides
var aNewCrop = aMediaRect;
aNewCrop[0] += 36; // Move Left edge to the right
aNewCrop[1] -= 36; // Move Top edge down
aNewCrop[2] -= 36; // Move Right edge to the Left
aNewCrop[3] += 36; // Move Bottom edge up
// Returns: [36,756,576,36] // Set Crop Box of the current page to the new rectangle
this.setPageBoxes("Crop",this.pageNum,this.pageNum, aNewCrop); // Get the new Crop Box
var aCropRect = this.getPageBox("Crop",this.pageNum);
// Returns: [0,720,540,0] // Get the new Media Box
var aMediaRect = this.getPageBox("Media",this.pageNum);
// Returns: [-36,756,576,-36]
The script gets the current Media Box, shrinks it by 1/2 inch on all sides, and then applies that new rectangle to the Crop Box. Notice that the setPageBoxes function takes four inputs, the name of the Page Box that will be changed, the start and end page numbers (zero based), and the new rectangle. The rectangle is always in Rotated User Space.
The last two lines of the script are purely for analysis. They acquire the Crop and Media Boxes after the change has been made. In the code, the Crop box was set to [36,756,576,36], but after the change the new crop box is [0,720,540,0]. The setPageBoxes function works exactly the way we expect. By comparing the Media and Crop Boxes we can see that the boundaries of the Crop Box are 1/2 inch inside the Media Box. But Remember that the getPageBox function returns Rotated User Space coordinates. And in Rotated user space the bottom left hand corner of the Crop Box is always (0,0). No matter what values the Crop Box is set to, Acrobat always readjusts all the page boxes so that bottom left corner of the Crop Box is 0,0 in Rotated User Space.
Placing Form Fields
One common automation activity is placing form fields on a page. For example putting a date field in the document header or footer, or navigation buttons along one of the edges of the page. For proper placement, the script needs to first locate one or more of page edges and then calculate a field placement rectangle. Interactive elements like fields are meant to be used when the PDF is displayed on the screen, as opposed to printed, so all field geometry operations use Rotated User Space. Also, fields are usually placed relative to the Crop Box so that they are visible to the user. Given this information the following script places a text field along the bottom edge of Crop Box on the first page of the document.
// First, acquire the Crop Box for Page #1
var aCropRect = this.getPageBox("Crop",0); // Calculate the placement rectangle for the text box.
// Field is centered so find the mid point of the page
var nMiddle = aCropRect[2]/2; // Field is along bottom edge of crop, 150 points wide, 20 points tall
var rectFld = [];
rectFld[0] = nMiddle-75; // Left side is 75pts to the left of the middle
rectFld[1] = aCropRect[3]+20; // Top side is 20pts above the bottom
rectFld[2] = nMiddle+75; // Right side is 75pts to the righ of the middle
rectFld[3] = aCropRect[3]; // Bottom is the same as the Crop Box // Add Field
var oFld = this.addField("MyDateFld", "text", 0, rectFld);
oFld.value = util.printd("ddd mmm d, yyyy",new Date());
oFld.alignment = "center";
The doc.addField() function takes 4 inputs, the name of the field, the field type, the zero based page number, and the field placement rectangle in Rotated User Space Coordinates. This rectangle is setup in exactly the same way as the Page Box Rectangles. It's an array of four numbers where each number is a coordinate for the edges of the field, in the order Left, Top, Right, Bottom.
Adjusting the location and size of the field is a simple matter of adjusting the parameters used to calculate the field placment rectangle. Fields can even be placed outside, or partially outside the crop box. If there is room in the viewing area Acrobat will draw fields that are outside the Crop Box, so this placement can be used to create visual effects. Becareful though, because this may be a bug in older versions and Adobe could easily make it so that nothing is drawn outside the crop area. But one legitimate use for placing fields off the page is for hidden fields you don't want the user interacting with.
The bounding rectangle of a form fields is accessed with the field.rect property. In addition to being able to get the field location, the field can also be moved at any time by changing this rectangle. Even when the PDF is displayed in Reader. Take a look at the Bouncing Button sample file. To try out a simple example of moving a field, place the following code into the Mouse Up event of a button field.
var rectFld = event.target.rect;
var newRect = [rectFld[0]+10, rectFld[1]+10, rectFld[2]+10, rectFld[3]+10];
event.target.rect = newRect;
The script adds 10 points to all edges of the button's rectangle, which moves it to the top right of the page. Since movement is done in User Space Coordinates it doesn't matter what rotation is applied to the page. It will always move the button towards the top right of the users view. Try rotating the page and then clicking on the button.
Converting Coordinates
All of the code shown so far, sizing pages and placing fields, has been done using Rotated User Space. But there are many functions and properties in the Acrobat JavaScript DOM that use Default User Space. Two very useful operations in particular are placing annotations and finding word locations. Using these operations in a script requires being able to convert back and forth between the two coordinate spaces. Fortunately Adobe has provided us with an easy, partially documented way to do coordinate conversions, the Matrix2D object.
It's partially documented because while you won't find it listed in the Acrobat JavaScript Reference, it is used for a couple offical examples within the reference. Matrix2D is a generic JavaScript defined in the "JSByteCodeWin.bin" file, which can be found in the Acrobat JavaScript folder. This is a binary file so you can't open it up find the code that defines the Matrix2D object. But if you are really interested in seeing the code run the following line in the console window.
Matrix2D.toString().replace(/([\;\{])/g,"$1\n")
The replace function is used to neaten up the code a bit. But you'll still need to do some hand editing to be able to make sense of it.
The main purpose of this object is to perform 3x3 matrix multiplications. This is the standard methodology for transforming and moving objects about in 2-D space. If you're interested in mathamatics and 2D graphics transformations then see the Matrix Multiplier example file.
Fortunately for the rest of us, the Matrix2D object contains some easy to use built-in functions for converting coordinates between our two User Spaces. The following code shows how to setup the Matrix2D object for these transformations. This code, or variations on it, will be used in all situations where coordinates need to be converted from one User Space to the other.
// Create matrix that converts from Rotated User Space
// to Default User Space
var mxToDefaultCoords = (new Matrix2D()).fromRotated(this,this.pageNum); // To transform a point or rect
var rectDefault = mxToDefaultCoords.transform(rectRotated); // Create Matrix that converts from Default User Space
// to Rotated User Space
var mxToRotatedCoords = mxToDefaultCoords.invert(); // To transform a point or rect
var rectRotated = mxToRotatedCoords.transform(rectDefault);
The matrix for converting to Default User Space is created with the Matrix2D.fromRotated() function. This function has two inputs, the document object and the zero based page within the document where the transform is needed. To convert in the other direction, from Default to Rotated User Space, the first matrix is inverted using the Matrix2D.invert() function
Coordinate conversions are performed with the Matrix2D.transform() function. The input to this function is a flat list of points such as the rectangle arrays used in the previous example. A single point is an (X,Y) coordinate pair. So the rectangle can be though of as two points. The first point is the Top Left corner of the rectangle and the second point is the Bottom Right corner. In Acrobat JavaScript here are many different kinds of coordinate structures and not all of them are flat like the rectangle array. Some are arrays within arrays, so these items will need to be reformatted to be used with the transform function, which we'll see in the next section
Finding Word Locations
The locations of individual words on a particular page are acquired with the doc.getPageNthWordQuads()function. This function returns an array of Quad structures. A Quad is an array of 8 numbers, which represents the (x,y) coordinate pairs for the 4 verticies of the rectangle bounding the word (Figure 6). This seems a bit overly complex for getting a rectangle, but the words on a page can be rotated, so the quad structure is a necessary complication.
The coordinates used by a Quad are in Default User Space. If we wanted to use a word location to place a form field or link on the page it would be necessary to convert the quad into a Default User Space rectangle, as shown in the following code:
// Create Matrix that converts from Default User Space
// to Rotated User Space
var mxToDefaultCoords = (new Matrix2D()).fromRotated(this,this.pageNum);
var mxToRotatedCoords = mxToDefaultCoords.invert(); // Get the quad for the 2nd word on the page
var quads = this.getPageNthWordQuads(this.pageNum,1); // Flatten Quads into simple array of numbers
var pts = quads.toString().split(",");
// Convert pts to Rotated User Space
var ptsNew = mxToRotatedCoords.transform(pts); // Find points of rectangle from min and max of points
// start off with rectangle set to first point
var rect = [ ptsNew[0], ptsNew[1], ptsNew[0], ptsNew[1]];
for(var i=0;i<ptsNew.length;i+=2)
{
// Test Left
if(ptsNew[i] < rect[0])
rect[0] = ptsNew[i];
// Test Right
if(ptsNew[i] > rect[2])
rect[2] = ptsNew[i];
// Test Top
if(ptsNew[i+1] > rect[1])
rect[1] = ptsNew[i+1];
// Test bottom
if(ptsNew[i+1] < rect[3])
rect[3] = ptsNew[i+1];
} // Add Link to page using rectangle in Rotated User Space
this.addLink(this.pageNum,rect);
The ultimate goal of the script is to place a link over a specific word on the page. The problem is that the coordinates for a single word can be in a series of Default User Space Quads, and the link is placed using a single rectangle in Rotated User Space. So it's not just a conversion between coordinate spaces, it's also a conversion between point structures, from multiple quads to a single rectangle. Because there are so many different ways to describe the geometric strutures used on a PDF, this type of conversion is acutally pretty common.
The first thing the script does is to create the Matrix2D object we'll need for the conversion from Default to Rotated User Space. Next it acquires the Quads for the word of interest. We don't know how many quads are used for the word so the conversion technique has to be generalized for any number. This is easily done by first converting the Quads structure into a string, and then splitting the string by the "," dilimiter. The fact that we can do this is an artifact of the "toString()" function. Try this line by itself in the Console Window to see how it works. The result is a large list of points, which are then converted to Rotated User Space using the matrix created at the top of the script.
Now we need to sort through all these points to find the edges of a rectangle. This is done by setting up a loop to find the largest and smallest x and y values, i.e., the code finds the boundaries of the set of points, and this is the placement rectangle that's used to create the link. For a real automation script it would be more efficient to place the code for converting a quad in to a rectangle into a function since we might need this ability in several places. Placing the code in a function also makes it easier to use in the future.
There are any number of tasks where it will be necessary to find and set page boundaries, word locations, field placement, and many other operations involving page geometry. It's often the case that the coordinate spaces will be the same. But you can't make this assumption. If a script will use functions and properties in different User Spaces it will be necessary to use the Matrix2D object to convert between them.
pdf 中内容的坐标系的更多相关文章
- 从PDF中提取信息----PDFMiner
今天由于某种原因需要将pdf中的文本提取出来,就去搜了下资料,发现PDFMiner是针对 内容提取的,虽然最后发现pdf里面的文本全都是图片,就没整成功,不过试了个文本可复制的 那种pdf文件,发现还 ...
- java 如何在pdf中生成表格
1.目标 在pdf中生成一个可变表头的表格,并向其中填充数据.通过泛型动态的生成表头,通过反射动态获取实体类(我这里是User)的get方法动态获得数据,从而达到动态生成表格. 每天生成一个文件夹存储 ...
- (转)原始图像数据和PDF中的图像数据
比较原始图像数据和PDF中的图像数据,结果见表1.1.表1.1中各种“解码器”的解释见本文后续的“PDF支持的图像格式”部分,“PDF中的图像数据”各栏中的数据来自开源的PdfView.如果您有兴趣查 ...
- C# 在PDF中创建和填充域
C# 在PDF中创建和填充域 众所周知,PDF文档通常是不能编辑和修改的.如果用户需要在PDF文档中签名或者填写其他内容时,就需要PDF文档中有可编辑的域.开发者也经常会遇到将数据以编程的方式填充到P ...
- 怎么编辑PDF文件内容,PDF文件编辑方法
怎样编辑PDF文件内容?这是一个常常困扰我们的问题,工作当中我们经常会收到PDF格式的文件,但有时的文件内容不是我们想要的或者是觉得不合理的需要改掉.但是每次有这样的问题时都没有什么好的解决方法,每次 ...
- java itext替换PDF中的文本
itext没有提供直接替换PDF文本的接口,我们可以通过在原有的文本区域覆盖一个遮挡层,再在上面加上文本来实现. 所需jar包: 1.先在PDF需要替换的位置覆盖一个白色遮挡层(颜色可根据PDF文字背 ...
- 编辑方法分享之如何编辑PDF文件内容
我们现在在工作中会经常使用到PDF文件,还会有遇到需要编辑PDF文件的时候,PDF文件的编辑问题一直是个大难题.很多朋友在面对PDF文件的时候束手无策,不知道该怎么对它进行编辑.下面小编就教给大家一个 ...
- 如何修改PDF文件内容,PDF怎么添加背景
很多的情况下,大家都会遇到PDF文件,不管是在学习中还是在工作中,对于PDF文件,文件的修改编辑是需要用到PDF编辑软件的,在编辑文件的时候,发现文件的页面是有背景颜色的,又该如何修改背景颜色呢,不会 ...
- 深入学习Python解析并解密PDF文件内容的方法
前面学习了解析PDF文档,并写入文档的知识,那篇文章的名字为深入学习Python解析并读取PDF文件内容的方法. 链接如下:https://www.cnblogs.com/wj-1314/p/9429 ...
随机推荐
- Java8新特性Function、BiFunction使用
闲话不多说,直接看代码,注释都写的很清楚了. package com; import java.util.function.BiFunction; import java.util.function. ...
- CentOS7编译安装httpd-2.4.41 php7.3
CentOS7编译安装httpd-2.4.41 php7.3 安装参考环境: CentOS Linux release 7.5.1804 (Core) 一.安装依赖包 httpd安装的依赖包 # yu ...
- POST请求接口实列
通过响应状态来判断是否读取数据与抛出异常,然后通过判断获取的字节数去读取数据或抛出异常 /** * 发送HttpPost请求 * @param strURL * 服务地址 * @param param ...
- Visio 撤销按钮无法使用 菜单显示:无法撤销
首先是借鉴地址:https://answers.microsoft.com/en-us/msoffice/forum/msoffice_visio-mso_other-mso_2007/visio-o ...
- SpringCloud学习心得—1.2—Eureka注册中心的密码认证、高可用的设置
SpringCloud学习心得—1.2—Eureka注册中心的密码认证.高可用的设置 这是相关代码 链接 Eureka开启密码配置 添加依赖 <dependency> <grou ...
- set/priority_queue的运算符重载
#include<bits/stdc++.h> using namespace std; struct cmp { bool operator ()(int a, int b) //重载小 ...
- java oracle的2种分页方法
java oracle的2种分页方法 一物理分页: <!-- 分页查询所有的博客信息 --> <select id="findBlogs" resultType= ...
- 用jquery快速解决IE输入框不能输入的问题_jquery
代码如下: 在IE10以上版本,微软为了提高IE输入框的便利性,增加了文本内容全部删除和密码眼睛功能,但是有些时候打开新的页面里,输入框却被锁定无法编辑,需要刷新一下页面,或者如果输入框有内容需要点击 ...
- php自定义函数之变量作用域
我们通过前面的章节函数定义部份的学习我们知道了几个不同的规矩: 函数定义时后括号里面接的变量是形式上的参数(形参),与函数体外的变量没有任何关系.仅仅是在函数内部执行大理石量具哪家好 函数内声明的变量 ...
- PHPstorm不停Indexing最新解决办法
PHPstorm不停Indexing最新解决办法 1.网络上千篇一律的解决办法 File -> Invalidate Caches / Restart... -> Invalidate ...