PDFMiner is a pdf parsing library written in Python by Yusuke Shinyama In addition to the pdf 2txt py and dump pdf py command line tools there is a way of analyzing the content tree of each page Since that's exactly the kind of programmatic parsing I wanted to use PDFMiner for this is a more complete example which continues
PDFMINER is a tool for extracting information from PDF documents Unlike other PDF tools it focuses exclusively on the receipt and analysis of text data Using PDFMINER you can get the exact position of the text on the page as well as other information such as symbols or lines
How does pdfminer work?
on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF ?les into other text formats (such as HTML).
What is lazy parsing in pdfminer?
Because a PDF file has such a big and complex structure, parsing a PDF file as a whole is time and memory consuming. However, not every part is needed for most PDF processing tasks. Therefore PDFMiner takes a strategy of lazy parsing, which is to parse the stuff only when it's necessary.
What is ltcurve in programming with pdfminer?
Programming with PDFMiner pdfminer, Release 0.0.1 Represents a rectangle. Could be used for framing another pictures or ?gures. LTCurve Represents a generic Bezier curve. Also, check outa more complete example by Denis Papathanasiou. 2.4Obtaining Table of Contents PDFMiner provides functions to access the document’s table of contents (“Outlines”).