Writing Text and Images to PDF with Apache PDFBox

This post has also been inspired by my weekly family newsletters. In my first post, I detailed how to extract text and attachments from Gmail using the Gmail API and save it into a directory structure. That is the first step in the program I wrote for saving my newsletters. In addition to saving the raw data to my hard drive, I generate PDF files that compile the newsletter text and images by quarter. For this post, I’ll go through creating a PDF file and writing text and images to it using Apache PDFBox.

Prerequisites

Java 1.8 or greater
Gradle 3.5 or greater
Internet access
A directory with image files in it – I created a work directory and downloaded a few stock photos from https://stocksnap.io

Gradle

compile 'org.apache.pdfbox:pdfbox:2.0.8'

The PDFWriter Class

The PDFWriter class is intended to represent one document per instance. The output directory and the intended filename are passed to the constructor. I do some very simple checks to avoid obvious mistakes: making sure the path ends in slash and that the file name ends in “.pdf”. The path and file name are saved in instance variables. There are two additional instance variables: one of type org.apache.pdfbox.pdmodel.PDDocument and one org.apache.pdfbox.pdmodel.font.PDFont.

public PDFWriter(String pdfOutputDirectory, String pdfFileName) {
   this.pdfOutputDirectory = pdfOutputDirectory;     
   if (!this.pdfOutputDirectory.endsWith("/")) {
      this.pdfOutputDirectory += "/";     
   }
   if (!pdfFileName.endsWith(".pdf")) {
      pdfFileName = pdfFileName + ".pdf";     
   }     
   this.pdfFileName = pdfFileName; 
}

Once the class is instantiated, there are a series of methods to call to do the work. The first one actually creates the PDFDocument and a PDFFont object that are stored in instance variables and used through the rest of the class. This example shows how to select a font from your PC. This isn’t strictly necessary for the example. I needed it in my newsletter application, because I needed a font that would handle certain special characters. You may have to make changes depending on where your fonts are stored and what fonts you have.

public void createPdfFile() {
   doc = new PDDocument();
   try {
      font = PDType0Font.load(doc, new File("/Windows/Fonts" +"/ARIALUNI.TTF"));
   } catch (IOException e) {
      font = PDType1Font.HELVETICA;
      e.printStackTrace();
   }
 }

The bulk of the work is done in the addPage method. The addPage method takes a header for the page, a StringBuffer containing the text, the image directory path and a list of image file names. The first thing the method does is very straightforward; it creates a page and adds it to the document. The org.apache.pdfbox.pdmodel.PDPageContentStream object actually writes the text and images to the page.

public boolean addPage(String pageHeader, StringBuffer pageText, String imageDirectory, List<String> imageFileNames) {
   boolean ok = false;
   //Create and add the page to the document
   PDPage page = new PDPage();
   doc.addPage(page);
   //The contents stream actually adds text and images to the page
   PDPageContentStream contents = null;

Since we don’t know how much room our text and images are going to take up, it’s necessary to keep track of where we are and how much space we need and then add additional pages as necessary. Our next step is to set up a set of variables that allow us to keep track of where we are and how much space we need.

fontSize – a default font size used for calculating the leading value.
leading – the height of a line. Used to move the Y axis down the page for text lines.
mediabox – the org.apache.pdfbox.pdmodel.common.PDRectangle object that contains the contents of the page. Used to get the page width and the axes.
margin – size for the margin. Used to calculate page width and start position.
width – width from the mediabox minus the two margins
startX – the starting position for the X axis accounting for the margin.
startY – the starting position for the Y axis accounting for the margin.
yOffset – keeps track of our place going down the page.

float fontSize = 12;
float leading = 1.5f*fontSize;
PDRectangle mediabox = page.getMediaBox();
float margin = 75;
float width = mediabox.getWidth() - 2*margin;
float startX = mediabox.getLowerLeftX() + margin;
float startY = mediabox.getUpperRightY() - margin;
float yOffset = startY;

The work of the method is done in a try block as there is the potential for an IOException to be thrown. The first step is the acquire a org.apache.pdfbox.pdmodel.PDPageContentStream to actually write the content to. The beginText() command tells PDFBox that we’re writing text out to the page. It must be closed with a call to endText(). To have the header stand out a bit, we set the font size to 14. The next two lines indicate that we’re starting a new line (the call to newLineAtOffset) and the adjust the yOffset downward. The showText command actually writes the pageHeader value to the page.

try {
  contents = new PDPageContentStream(doc, page);
  contents.beginText();
  contents.setFont(font, 14);
  contents.newLineAtOffset(startX, startY);
  yOffset-=leading;
  contents.showText(pageHeader);
  contents.newLineAtOffset(0, -leading);
  yOffset-=leading;

After outputting the header, it’s time to move on to the text provided. Since we have no idea how much text we’ve been supplied with, it’s necessary to parse the StringBuffer to parse out paragraphs and to split the text into lines that will fit in the page, but also don’t break in the middle of a word. This logic is split out into it’s own method.

List<String> lines = new ArrayList<>();
parseIndividualLines(pageText, lines, fontSize, font, width);

The parseIndividualLines method takes the supplied StringBuffer containing all the text, an array of strings that the method will populate with individual lines, the fontSize to be used, the org.apache.pdfbox.pdmodel.font.PDFont being used and the width of the page. The method’s first step is to split the text into paragraphs by splitting on the System line separator. After that the method loops through each paragraph and pieces together lines by substringing between spaces. When the size of a line exceeds the width, it creates a line by substringing to the last space found and adds the line to the collection. It then sets the paragraph it’s working with to only contain the text after the last space noted. The method moves on to the next paragraph when the length of the paragraph reaches zero.

private void parseIndividualLines(StringBuffer wholeLetter, List<String> lines, float fontSize, PDFont pdfFont, float width) throws IOException {
   String[] paragraphs = wholeLetter.toString().split(System.getProperty("line.separator"));
   for (int i = 0; i < paragraphs.length; i++) {
      int lastSpace = -1;
      lines.add(" ");
      while (paragraphs[i].length() > 0) {
         int spaceIndex = paragraphs[i].indexOf(' ', lastSpace + 1);
         if (spaceIndex < 0) {
            spaceIndex = paragraphs[i].length();
         }
         String subString = paragraphs[i].substring(0, spaceIndex);
         float size = fontSize * pdfFont.getStringWidth(subString) / 1000;
         if (size > width) {
            if (lastSpace < 0) {
               lastSpace = spaceIndex;
            }
            subString = paragraphs[i].substring(0, lastSpace);
            lines.add(subString);
            paragraphs[i] = paragraphs[i].substring(lastSpace).trim();
            lastSpace = -1;
         } else if (spaceIndex == paragraphs[i].length()) {
            lines.add(paragraphs[i]);
            paragraphs[i] = "";
         } else {
            lastSpace = spaceIndex;
         }
      }
   }
}

Once the text is separated into lines that will fit the page width, all we have to do is write them out to the content stream and add pages as necessary. As each line is written, the yOffset variable is decreased. When the yOffset reaches zero, the endText method is called on the content stream, the content stream is closed and then a new page and content stream are created and the yOffset is reset.

contents.setFont(font, fontSize);
for (String line:lines) { 
   contents.showText(line);
   contents.newLineAtOffset(0, -leading);
   yOffset-=leading;

   if (yOffset <= 0) {
      contents.endText();
      try {
         if (contents != null) contents.close();
      } catch (IOException e) {
         ok = false;
         e.printStackTrace();
      }
      page = new PDPage();
      doc.addPage(page);
      contents = new PDPageContentStream(doc, page);
      contents.beginText();
      contents.setFont(font, fontSize);
      yOffset = startY;
      contents.newLineAtOffset(startX, startY);
   }
}
contents.endText();

The concept behind writing the images is similar to the text in that we track the yOffset and create new pages as necessary. To manage and write images in PDFBox, we use the org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject class. We can create a PDImageXObject by providing it a path to an image file and the PDF document we’re using. Once we have the image object, we can calculate the scale of the image by dividing the page width by the image width. The scale is used maintain the aspect ratio when setting the height and used to calculate the yOffset for the image. As with the text writing, a new page is created when the yOffset reaches zero. Including closing the content stream and resetting the yOffset. To actually write the image, we call "drawImage" on the content stream and provide the image object and positioning and sizing information.

float scale = 1f;
for (String attachmentName : imageFileNames) {
   PDImageXObject pdImage = PDImageXObject.createFromFile(imageDirectory + attachmentName, doc);
   scale = width/pdImage.getWidth();
   yOffset-=(pdImage.getHeight()*scale);
   if (yOffset <= 0) {
     System.out.println("Starting a new page");
     try {
        if (contents != null) contents.close();
     } catch (IOException e) {
        ok = false;
        e.printStackTrace();
     }
     page = new PDPage();
     doc.addPage(page);
     contents = new PDPageContentStream(doc, page);
     yOffset = startY-(pdImage.getHeight()*scale);
   }
   System.out.println("yOffset: " + yOffset);
   System.out.println("page width: " + width + " imageWidth: " + pdImage.getWidth() + " imageHeight: " + (pdImage.getHeight()*scale) + " scale: " + scale);
   contents.drawImage(pdImage, startX, yOffset, width, pdImage.getHeight()*scale);
 }
 ok = true;

On the way out of the method, the content stream is closed in a finally block. For this example, I’ve used a simple boolean to indicate method success or failure.

} catch (IOException e) {
   e.printStackTrace();
   ok = false;
} finally {
   try {
      if (contents != null) contents.close();
   } catch (IOException e) {
      ok = false;
      e.printStackTrace();
   }
}

return ok;

Once the application has added as many pages as it needs, it calls the saveAndClose method. It saves the PDF Document to the directory and file name provided and closes the document.

public void saveAndClose() {
   try {
      doc.save(pdfOutputDirectory + pdfFileName);
   } catch (IOException e) {
      e.printStackTrace();
   } finally {
      try {
         doc.close();
      } catch (IOException e) {
         e.printStackTrace();
      }
   }
}

Conclusion

This post shows how to perform basic PDF file creation using Apache PDFBox and how to track and manipulate the page positioning to fit text and images. This example produces a basic PDF file as seen in the screenshot below.

PDF_Output_screenshot.

The example code is available at Github https://github.com/amdegregorio/PDFBox-Example

References

Apache PDFBox

Writing Text and Images to PDF with Apache PDFBox

Published by Amy DeGregorio

Leave a comment Cancel reply

Share this:

Published by Amy DeGregorio

Leave a comment Cancel reply