Examining a page¶
Pages are dictionaries¶
In PDFs, the main data structure is the dictionary, a key-value data
structure much like a Python dict or attrdict. The major difference is
that the keys can only be names, while values can be any type, including
other dictionaries.
PDF dictionaries are represented as pikepdf.Dictionary, and names
are of type pikepdf.Name. A page is just another dictionary, with a
few required fields that give it special status as a page.
A pikepdf.Name that is, usually, an ASCII-encoded string beginning with
“/” followed by a capital letter.
In [1]: from pikepdf import Pdf
In [2]: example = Pdf.open('../tests/resources/congress.pdf')
In [3]: page1 = example.pages[0]
In [4]: page1
Out[4]:
<pikepdf.Dictionary(type_="/Page")({
"/Contents": pikepdf.Stream(stream_dict={
"/Length": 50
}, data=<...>),
"/MediaBox": [ 0, 0, 200, 304 ],
"/Parent": <reference to /Pages>,
"/Resources": {
"/XObject": {
"/Im0": pikepdf.Stream(stream_dict={
"/BitsPerComponent": 8,
"/ColorSpace": "/DeviceRGB",
"/Filter": [ "/DCTDecode" ],
"/Height": 1520,
"/Length": 192956,
"/Subtype": "/Image",
"/Type": "/XObject",
"/Width": 1000
}, data=<...>)
}
},
"/Type": "/Page"
})>
Item and attribute notation¶
Dictionary keys may be looked up using keys (page1['/MediaBox']) or
attributes (page1.MediaBox). The two conventions are equivalent.
In [5]: page1.MediaBox
Out[5]: pikepdf.Array([ 0, 0, 200, 304 ])
In [6]: page1['/MediaBox']