If you are using Histor圜rawler, you can view the PDF with Okular. We will be using a 1923 book about the wildflowers of Kashmir from the Internet Archive. Let’s start by downloading a PDF to work with.
Instead you need to use a dedicated reader program to view PDFs, or command-line tools to extract information from them. Although PDFs can (and often do) contain text, they are not easily read using Linux commands like cat, less or vi. The apropos command shows all of the tools that we now have at our disposal for manipulating PDF files.Īdobe’s portable document format (PDF) is an open standard file format for representing documents. This package includes a number of useful tools. If you don’t get a man page for pdftotext, then install the Poppler Utilities with the following command. If you don’t get a man page for pdftk, then install it. If you don’t get a man page for xpdf, then install it with the following. Start your windowing system and open a terminal. I assume that you already have Tesseract OCR and ImageMagick installed from the previous lesson. Now we need to install tools for working with Adobe Acrobat PDF documents. Since we will be working with pictures of text as well as raw text files, we need to use a window manager or desktop environment.
Here we will use command line tools to extract text, images, page images and full pages from Adobe Acrobat PDF files. So it makes sense to try to convert our sources into text files whenever possible. In the previous post we used optical character recognition (OCR) to convert pictures of text into text files. As a result, we have a very wide variety of powerful tools for manipulating and analyzing text files. We have already seen that the default assumption in Linux and UNIX is that everything is a file, ideally one that consists of human- and machine-readable text.