PDF Page Coordinates (page size, field placement, etc.)

AcroForm, Basics, Automation

Page coordinates are used to add fields and annotations to a page, move fields and annotations, resize page boundries, locate words on a page, and for any other operation that involves page geometry. So understanding page coordinates is critical to many common automation activities. And is also very useful for form scripting.

Related Resources

Sample Files that demonstrate page geometry operations

Automation tools that demonstrate page geometry operations

Coordinate Systems

The coordinate system on a PDF page is called User Space. This is a flat 2- dimensional space, just like a piece of paper. And in fact that's a good way to think about it. The units of User Space are called "points" and there are 72 points/inch. The origin, or 0,0 point is located in the bottom left hand corner of the page. Horizontal, or X, coordinates increase to the rights and vertical, or Y, coordinates increase towards the top (Figure 1)

Figure 1 - User Space coordinates on a PDF page.

However, PDF pages are a bit more complex than they might seem from the user's perspective. The edges of a page are bound by several different page boxes (Figure 1). Ideally these boxes are concentric (each one fully inside the next) as shown in Figure 1, but this arrangement is not absolutley required. Each of these boxes has a different meaning and all of them except for the BBox can be modified with a script.

Figure 2 - The different Page Boxes that define a page's boundaries.

The outer most box is the Media box. This box represents the full page size. Originally this meant the paper size the page was to be printed on. And all the other bounding boxes are inside this one. The Media Box doesn't have quite the same importance for an interactive document displayed on the sceen. But it is still very important to page geomentry, as will be explained below.

The next 3 boxes, Art, Bleed, and Trim, have special meaning to printers. They represent imporant stages in the printing of a document, but are invisible to the average user and unimportant for our purposes here, so they won't be discussed.

The BBox, or the Bounding Box, is the smallest rectangle that can enclose all the content on the page. This box is calculated by Acrobat and so it cannot be modified by a script. Ideally, a line drawn around the BBox would touch the edges of the visible content on all four sides of the page. Unfortunately this is rarely true. PDF content often includes invisible boundaries that extend beyond the visible elements. These empty areas are usually the result of an inefficient or in some cases a down right poor PDF creation process.

The two most important boxes for scripting, and handling page geometry in general, are the Media Box and the Crop Box. As explained earlier, the Media Box is meant to represent what the user would see if they printed out the PDF page. The Crop box is what the user sees on the computer screen. These two boxes are often exactly the same, but they can of course be different sizes, and they can also be different rotations. The only restriction is that the Crop Box is always inside the Media Box. If a script attempts to make the Media Box smaller than the Crop Box, then Acrobat will automatically adjust the size of the Crop Box to be smaller. And vise-a-versa, if a script tries to make the Crop Box larger than the Media box Acrobat will automatically grow the Media Box.

To illustrate this idea imagine a standard letter size page designed to be viewed in landscape mode. For example it might display a very wide table of data. On the computer you'd want to see the page rotated with the long side down so you could see the content properly. But the page would still need to be printed in the regular letter size page orientation. Thus, we have two situations, the rotated cropped view that the user sees on the screen, and the unrotated full size page that gets printed.

To handle these two situations Acrobat JavaScript uses two different coordinate systems, Default User Space which represents the printed view, and Rotated User Space which represents the on-screen view. Default User Space is measured with the Media Box. In Default User Space the origin (0,0 point) is always the bottom left hand corner of the printed page. Rotated user space is measured with the Crop Box. In Rotated User Space the origin is always the bottom left hand corner of the page shown on the screen. This difference is illustrated in Figure 3 using the example discussed above.

Figure 3 - Difference in Rotated vs Default User Space

You'll find in most documents that both these User Spaces are exactly the same, which means that the Crop and Media Boxes are also exactly the same. However, unless you know for sure that there isn't a difference, all code must be written to take rotations and cropping in to consideration.

Page Rotations

Because Rotation is part of the difference between the User Spaces it's useful to understand a bit about how page rotation works in Acrobat and PDF. Pages can only be rotated in 90° increments. For example, a page cannot be skewed 45°. Each page in a PDF document has it's own unique geometry. The size and rotation of pages in a PDF are unrelated to one another. Of course, in real documents it's a standard practice to make all the pages the same. But keep in mind that there is no guarentee. Code that works on every page in a PDF has to treat each page individually, unless it's known up front that document page geometry is homogeneous.

Unfortunately, the Acrobat user interface does not provide information on Page Rotation. Even though pages can be rotated with the "Document > Rotate Pages..." menu item, there is no feedback to indicate the current page rotation. Fortunately we can get this information from JavaScript with the doc.getPageRotation()function. The following line of code returns the page rotation for the current page in the current document. Try running it in the Console Window.

var nRot = this.getPageRotation(this.pageNum);

The doc.getPageRotation() function takes a single input, the page number, and returns one of 0, 90, 180, or 270. To set the page rotation use doc.setPageRotations(). This function is a little different because it operates on a range of pages. It takes 3 input arguments, a start page, an end page, and the rotation. The rotation must be one of 0, 90, 180, or 270. The function will throw an exception for invalid rotations. For example it will not take negative rotations or 360+ rotations, even though these are equivilent to the valid input values. And of course it will not accept a value such as 45. The following code rotates the first three pages in the PDF by 90°.

// nStart = 0;  first page in PDF

// nEnd = 2;  page 3 in PDF

// nRotation = 90°

this.setPageRotations(0,2,90);

Field and Annotation Rotation

When a page is rotated, the form fields and markup annotations on the page are rotated with it, so that the presentation of all the elements on the page stays in sync. The rotation value of markup annotations cannot be seen or controlled from the Acrobat UI. For all practical purposes it is whatever Acrobat decides to make it. But fields have a rotation property that can be set in the Properties Dialog (Figure 4). Setting the Rotation of a field sets the orientation of the content (the text), it does not rotate the field box.

Figure 4 - Setting Field Rotation only changes the orientation text on the field

In JavaScript, the rotation of fields can be ananlyzed and set through the "field.rotation" property.

var oFld = this.getField("MyText");

// Get Field Rotation.

var nCurRot = oFld.rotation;

// Rotate 90° more

var nNewRot = nCurRot + 90;

// Fix up for large rotations

if(nNewRot >= 360)

  nNewRot -= 360;

// new Rotation

oFld.roation = nNewRot;

The rotation value shown on the Properties Dialog is different from the value acquired in the JavaScript model. The Properties Dialog value shows the rotation of the field relative to the view of the page on screen. The rotation acquired from the Field's JavaScript object is the real rotation of the element that you'll need for doing geometry calculations. It's the rotation of the field in Default User Space, or to put it qualitatively, it's the fields rotation relative to the orientation of the printed page. And just like the rotation set in the Properties Dialog it only rotates text and graphics shown in the field. It does not rotate the field boundaries

Getting and Setting Page Size

As we've already seen, a page really has many different sizes depending on how we look at it. Or to be more specific, which bounding box and coordinate space we're looking at. The doc.getPageBox() function is used to acquire a bounding box for a specific page. This function will get the bounding box for any of the boxes shown in figure 2. But the coordinates returned are in always in Roatated User Space, which represents the view of the page the user sees on the screen. Because JavaScript operates in the viewer, Rotated user space tends be the standard coordinate system for most operations. The following code acquires the Crop Box for the first page in the current document.

var aCropRect = this.getPageBox("Crop",0);

This function has only two inputs, the name of the box we want to acquire, and the zero-based page number. The return value is an array of four values representing the coordinates of the four edges of the page rectangle.

var aPageBox = [nLeft, nTop, nRight, nBottom];

Remember that the origin of the Crop Box is always the lower left corner in Rotated user space (Figure 3). So the Left and Bottom values in the Crop Box rectangle will always be 0. The following example shows the return values for Page Boxes shown in Figure 3 and Figure 1.

//** For Figure 1

// Original Page is 8.5x11 inches, no rotation or cropping

var aCropRect = this.getPageBox("Crop",0);

// Returns:  [0,792,612,0]

var aMediaRect = this.getPageBox("Media",0);

// Returns:  [0,792,612,0]

//** For Figure 3

// Original Page is 8.5x11 inches, rotated 90°,

// and cropped 1/2 inch on all sides

var aCropRect = this.getPageBox("Crop",0);

// Returns:  [0,540,720,0]

var aMediaRect = this.getPageBox("Media",0);

// Returns:  [-36,576,756,-36]

Notice that in the example for Figure 1 the Crop and Media boxes are the same. This means that Default and Rotated User Space are also the same.

All of the bounding boxes, except for BBox, can be viewed and modified from the Cropping tool (Figure 5), the "Document > Crop Pages..." menu item in Acrobat.

Figure 5 - The Cropping Dialog can be used to modify any of the Page Boxes

Notice that changing the page size changes the Media Box. This reinforces the idea that the Media Box is the base page size (or paper size), and all the other boudaries are variations on the Media Box. This is why Default user space is based on the Media Box. What the user sees on the screen, Rotated User Space is a variation. This gives us some perspecitive on Adobe's thinking at the time they created PDF, i.e. that it was primarily a print, rather than screen, based model.

In JavaScript, the doc.setPageBoxes() function is used to set a bounding box for a range of pages. The following code uses this function to crop the current page by 1/2 inch on all sides.

// Original Page is 8.5x11 inches

// First, acquire the current Media Box

var aMediaRect = this.getPageBox("Media",this.pageNum);

// Returns:  [0,792,612,0]

// Now remove 1/2 inch, 36 points from all sides

var aNewCrop = aMediaRect;

aNewCrop[0] += 36;  // Move Left edge to the right

aNewCrop[1] -= 36;  // Move Top edge down

aNewCrop[2] -= 36;  // Move Right edge to the Left

aNewCrop[3] += 36;  // Move Bottom edge up

// Returns:  [36,756,576,36]

// Set Crop Box of the current page to the new rectangle

this.setPageBoxes("Crop",this.pageNum,this.pageNum, aNewCrop);

//  Get the new Crop Box

var aCropRect = this.getPageBox("Crop",this.pageNum);

// Returns:  [0,720,540,0]

//  Get the new Media Box

var aMediaRect = this.getPageBox("Media",this.pageNum);

// Returns:  [-36,756,576,-36]

The script gets the current Media Box, shrinks it by 1/2 inch on all sides, and then applies that new rectangle to the Crop Box. Notice that the setPageBoxes function takes four inputs, the name of the Page Box that will be changed, the start and end page numbers (zero based), and the new rectangle. The rectangle is always in Rotated User Space.

The last two lines of the script are purely for analysis. They acquire the Crop and Media Boxes after the change has been made. In the code, the Crop box was set to [36,756,576,36], but after the change the new crop box is [0,720,540,0]. The setPageBoxes function works exactly the way we expect. By comparing the Media and Crop Boxes we can see that the boundaries of the Crop Box are 1/2 inch inside the Media Box. But Remember that the getPageBox function returns Rotated User Space coordinates. And in Rotated user space the bottom left hand corner of the Crop Box is always (0,0). No matter what values the Crop Box is set to, Acrobat always readjusts all the page boxes so that bottom left corner of the Crop Box is 0,0 in Rotated User Space.

Placing Form Fields

One common automation activity is placing form fields on a page. For example putting a date field in the document header or footer, or navigation buttons along one of the edges of the page. For proper placement, the script needs to first locate one or more of page edges and then calculate a field placement rectangle. Interactive elements like fields are meant to be used when the PDF is displayed on the screen, as opposed to printed, so all field geometry operations use Rotated User Space. Also, fields are usually placed relative to the Crop Box so that they are visible to the user. Given this information the following script places a text field along the bottom edge of Crop Box on the first page of the document.

// First, acquire the Crop Box for Page #1

var aCropRect = this.getPageBox("Crop",0);

// Calculate the placement rectangle for the text box.

// Field is centered so find the mid point of the page

var nMiddle = aCropRect[2]/2;  

// Field is along bottom edge of crop, 150 points wide, 20 points tall

var rectFld = [];

rectFld[0] = nMiddle-75; // Left side is 75pts to the left of the middle

rectFld[1] = aCropRect[3]+20;  // Top side is 20pts above the bottom

rectFld[2] = nMiddle+75; // Right side is 75pts to the righ of the middle

rectFld[3] = aCropRect[3];  // Bottom is the same as the Crop Box

// Add Field

var oFld = this.addField("MyDateFld", "text", 0, rectFld);

oFld.value = util.printd("ddd mmm d, yyyy",new Date());

oFld.alignment = "center";

The doc.addField() function takes 4 inputs, the name of the field, the field type, the zero based page number, and the field placement rectangle in Rotated User Space Coordinates. This rectangle is setup in exactly the same way as the Page Box Rectangles. It's an array of four numbers where each number is a coordinate for the edges of the field, in the order Left, Top, Right, Bottom.

Adjusting the location and size of the field is a simple matter of adjusting the parameters used to calculate the field placment rectangle. Fields can even be placed outside, or partially outside the crop box. If there is room in the viewing area Acrobat will draw fields that are outside the Crop Box, so this placement can be used to create visual effects. Becareful though, because this may be a bug in older versions and Adobe could easily make it so that nothing is drawn outside the crop area. But one legitimate use for placing fields off the page is for hidden fields you don't want the user interacting with.

The bounding rectangle of a form fields is accessed with the field.rect property. In addition to being able to get the field location, the field can also be moved at any time by changing this rectangle. Even when the PDF is displayed in Reader. Take a look at the Bouncing Button sample file. To try out a simple example of moving a field, place the following code into the Mouse Up event of a button field.

   var rectFld = event.target.rect;

   var newRect = [rectFld[0]+10, rectFld[1]+10, rectFld[2]+10, rectFld[3]+10];

   event.target.rect = newRect;

The script adds 10 points to all edges of the button's rectangle, which moves it to the top right of the page. Since movement is done in User Space Coordinates it doesn't matter what rotation is applied to the page. It will always move the button towards the top right of the users view. Try rotating the page and then clicking on the button.

Converting Coordinates

All of the code shown so far, sizing pages and placing fields, has been done using Rotated User Space. But there are many functions and properties in the Acrobat JavaScript DOM that use Default User Space. Two very useful operations in particular are placing annotations and finding word locations. Using these operations in a script requires being able to convert back and forth between the two coordinate spaces. Fortunately Adobe has provided us with an easy, partially documented way to do coordinate conversions, the Matrix2D object.

It's partially documented because while you won't find it listed in the Acrobat JavaScript Reference, it is used for a couple offical examples within the reference. Matrix2D is a generic JavaScript defined in the "JSByteCodeWin.bin" file, which can be found in the Acrobat JavaScript folder. This is a binary file so you can't open it up find the code that defines the Matrix2D object. But if you are really interested in seeing the code run the following line in the console window.

Matrix2D.toString().replace(/([\;\{])/g,"$1\n")

The replace function is used to neaten up the code a bit. But you'll still need to do some hand editing to be able to make sense of it.

The main purpose of this object is to perform 3x3 matrix multiplications. This is the standard methodology for transforming and moving objects about in 2-D space. If you're interested in mathamatics and 2D graphics transformations then see the Matrix Multiplier example file.

Fortunately for the rest of us, the Matrix2D object contains some easy to use built-in functions for converting coordinates between our two User Spaces. The following code shows how to setup the Matrix2D object for these transformations. This code, or variations on it, will be used in all situations where coordinates need to be converted from one User Space to the other.

// Create matrix that converts from Rotated User Space

// to Default User Space

var mxToDefaultCoords = (new Matrix2D()).fromRotated(this,this.pageNum);

// To transform a point or rect

var rectDefault = mxToDefaultCoords.transform(rectRotated);

// Create Matrix that converts from Default User Space

// to Rotated User Space

var mxToRotatedCoords = mxToDefaultCoords.invert();

// To transform a point or rect

var rectRotated = mxToRotatedCoords.transform(rectDefault);

The matrix for converting to Default User Space is created with the Matrix2D.fromRotated() function. This function has two inputs, the document object and the zero based page within the document where the transform is needed. To convert in the other direction, from Default to Rotated User Space, the first matrix is inverted using the Matrix2D.invert() function

Coordinate conversions are performed with the Matrix2D.transform() function. The input to this function is a flat list of points such as the rectangle arrays used in the previous example. A single point is an (X,Y) coordinate pair. So the rectangle can be though of as two points. The first point is the Top Left corner of the rectangle and the second point is the Bottom Right corner. In Acrobat JavaScript here are many different kinds of coordinate structures and not all of them are flat like the rectangle array. Some are arrays within arrays, so these items will need to be reformatted to be used with the transform function, which we'll see in the next section

Finding Word Locations

The locations of individual words on a particular page are acquired with the doc.getPageNthWordQuads()function. This function returns an array of Quad structures. A Quad is an array of 8 numbers, which represents the (x,y) coordinate pairs for the 4 verticies of the rectangle bounding the word (Figure 6). This seems a bit overly complex for getting a rectangle, but the words on a page can be rotated, so the quad structure is a necessary complication.

Figure 6 - Quad Structures, arrays of 8 numbers, are used to describe the bounding box of a word on a PDF page

The coordinates used by a Quad are in Default User Space. If we wanted to use a word location to place a form field or link on the page it would be necessary to convert the quad into a Default User Space rectangle, as shown in the following code:

// Create Matrix that converts from Default User Space

// to Rotated User Space

var mxToDefaultCoords = (new Matrix2D()).fromRotated(this,this.pageNum);

var mxToRotatedCoords = mxToDefaultCoords.invert();

// Get the quad for the 2nd word on the page

var quads = this.getPageNthWordQuads(this.pageNum,1);

// Flatten Quads into simple array of numbers

var pts = quads.toString().split(",");

// Convert pts to Rotated User Space

var ptsNew = mxToRotatedCoords.transform(pts);

// Find points of rectangle from min and max of points

// start off with rectangle set to first point

var rect = [ ptsNew[0],  ptsNew[1],  ptsNew[0],  ptsNew[1]];

for(var i=0;i<ptsNew.length;i+=2)

{

   // Test Left

   if(ptsNew[i] < rect[0])

     rect[0] = ptsNew[i];

   // Test Right

   if(ptsNew[i] > rect[2])

     rect[2] = ptsNew[i];

   // Test Top

   if(ptsNew[i+1] > rect[1])

     rect[1] = ptsNew[i+1];

   // Test bottom

   if(ptsNew[i+1] < rect[3])

     rect[3] = ptsNew[i+1];

}

// Add Link to page using rectangle in Rotated User Space

this.addLink(this.pageNum,rect);

The ultimate goal of the script is to place a link over a specific word on the page. The problem is that the coordinates for a single word can be in a series of Default User Space Quads, and the link is placed using a single rectangle in Rotated User Space. So it's not just a conversion between coordinate spaces, it's also a conversion between point structures, from multiple quads to a single rectangle. Because there are so many different ways to describe the geometric strutures used on a PDF, this type of conversion is acutally pretty common.

The first thing the script does is to create the Matrix2D object we'll need for the conversion from Default to Rotated User Space. Next it acquires the Quads for the word of interest. We don't know how many quads are used for the word so the conversion technique has to be generalized for any number. This is easily done by first converting the Quads structure into a string, and then splitting the string by the "," dilimiter. The fact that we can do this is an artifact of the "toString()" function. Try this line by itself in the Console Window to see how it works. The result is a large list of points, which are then converted to Rotated User Space using the matrix created at the top of the script.

Now we need to sort through all these points to find the edges of a rectangle. This is done by setting up a loop to find the largest and smallest x and y values, i.e., the code finds the boundaries of the set of points, and this is the placement rectangle that's used to create the link. For a real automation script it would be more efficient to place the code for converting a quad in to a rectangle into a function since we might need this ability in several places. Placing the code in a function also makes it easier to use in the future.

There are any number of tasks where it will be necessary to find and set page boundaries, word locations, field placement, and many other operations involving page geometry. It's often the case that the coordinate spaces will be the same. But you can't make this assumption. If a script will use functions and properties in different User Spaces it will be necessary to use the Matrix2D object to convert between them.

pdf 中内容的坐标系的更多相关文章

从PDF中提取信息----PDFMiner
今天由于某种原因需要将pdf中的文本提取出来,就去搜了下资料,发现PDFMiner是针对内容提取的,虽然最后发现pdf里面的文本全都是图片,就没整成功,不过试了个文本可复制的那种pdf文件,发现还 ...
java 如何在pdf中生成表格
1.目标在pdf中生成一个可变表头的表格,并向其中填充数据.通过泛型动态的生成表头,通过反射动态获取实体类(我这里是User)的get方法动态获得数据,从而达到动态生成表格. 每天生成一个文件夹存储 ...
（转）原始图像数据和PDF中的图像数据
比较原始图像数据和PDF中的图像数据,结果见表1.1.表1.1中各种“解码器”的解释见本文后续的“PDF支持的图像格式”部分,“PDF中的图像数据”各栏中的数据来自开源的PdfView.如果您有兴趣查 ...
C# 在PDF中创建和填充域
C# 在PDF中创建和填充域众所周知,PDF文档通常是不能编辑和修改的.如果用户需要在PDF文档中签名或者填写其他内容时,就需要PDF文档中有可编辑的域.开发者也经常会遇到将数据以编程的方式填充到P ...
怎么编辑PDF文件内容，PDF文件编辑方法
怎样编辑PDF文件内容?这是一个常常困扰我们的问题,工作当中我们经常会收到PDF格式的文件,但有时的文件内容不是我们想要的或者是觉得不合理的需要改掉.但是每次有这样的问题时都没有什么好的解决方法,每次 ...
java itext替换PDF中的文本
itext没有提供直接替换PDF文本的接口,我们可以通过在原有的文本区域覆盖一个遮挡层,再在上面加上文本来实现. 所需jar包: 1.先在PDF需要替换的位置覆盖一个白色遮挡层(颜色可根据PDF文字背 ...
编辑方法分享之如何编辑PDF文件内容
我们现在在工作中会经常使用到PDF文件,还会有遇到需要编辑PDF文件的时候,PDF文件的编辑问题一直是个大难题.很多朋友在面对PDF文件的时候束手无策,不知道该怎么对它进行编辑.下面小编就教给大家一个 ...
如何修改PDF文件内容，PDF怎么添加背景
很多的情况下,大家都会遇到PDF文件,不管是在学习中还是在工作中,对于PDF文件,文件的修改编辑是需要用到PDF编辑软件的,在编辑文件的时候,发现文件的页面是有背景颜色的,又该如何修改背景颜色呢,不会 ...
深入学习Python解析并解密PDF文件内容的方法
前面学习了解析PDF文档,并写入文档的知识,那篇文章的名字为深入学习Python解析并读取PDF文件内容的方法. 链接如下:https://www.cnblogs.com/wj-1314/p/9429 ...

随机推荐

快速构建ceph可视化监控系统-转载
前言 ceph的可视化方案很多,本篇介绍的是比较简单的一种方式,并且对包都进行了二次封装,所以能够在极短的时间内构建出一个可视化的监控系统本系统组件如下: ceph-jewel版本 ceph_exp ...
mysql安装好需要启动和停止服务
启动mysql: mysql.server start 停止服务:mysql.server stop
Bicoloring (并查集/二分图)
题目链接题意: m个查询,每个查询输入a b,表示顶点a b之间涂色. 规定只能涂颜色0 或者颜色 1,一个节点相连的边必须涂成相同的颜色. 问 ,输入m组 a b之后,会不会犯规. 思路: 判 ...
Route all trafic for specific ip over specific network interface
15 I have a linux server that needs to get some routing. I'm fairly new at this and i don't find any ...
（java）selenium webdriver学习--通过id、name定位，输入内容，搜索，关闭操作、通过tagname查找元素
selenium webdriver学习--通过id.name定位,输入内容,搜索,关闭操作:通过tagname查找元素打开谷歌浏览器,输入不同的网站,搜索框的定位含有不同元素(有时为id,有时为n ...
pycharm注册使用
先在PyCharm官网下载安装包链接:https://www.jetbrains.com/pycharm/download/#section=linux 选择平台为Linux,可以看到当前版本为20 ...
Greenplum 激活standby 和恢复 master 原有角色
当Greenplum segment的primary出现问题时,FTS会监测到,GP会自动激活mirror.但是对于GP的master节点,虽然有standby,但是GP并不会自动来完成master和 ...
YAML_08 handlers触发器
ansible]# vim adhttp.yml --- - hosts: cache remote_user: root tasks: - copy: src: /r ...
CSS字体图标
一.什么是字体图标: 1. 字体图标可以和图片一样改变透明度,旋转度,等等 2.本质是文字,可以改变大小颜色等等比较适用于移动端总结;图标字体具有矢量效果,放大缩小不失真,而且可以使用CSS任意更改 ...
github提示Permission denied (publickey)，如何才能解决？
参考: https://my.oschina.net/u/1377923/blog/1822038 https://www.cnblogs.com/chjbbs/p/6637519.html

pdf 中内容的坐标系