Downloading Emails from Gmail

Every Saturday, I write a newsletter detailing our week from the perspective of our two young children.  I attach a couple of pictures from our week and send it to distant family.  I’ve gotten repeated feedback about needing a way to preserve these emails for our kids to look back on when they’re older.  Generally only having them saved in Gmail was making me feel a bit squirrely as well.  Saving fifty two emails and their associated attachments each year was too cumbersome to do by hand, so I started looking for a programmatic solution.  Google does things right, so I figured there had to be an API for accessing email accounts.  For my inaugural blog post, I’m going to share a stripped down example of that I did to preserve my weekly newsletters.  I’m going to present an example that reads all the emails for a given search string (Google allows for elaborate email searches) and writes the raw data to a directory.  The email text will be placed in a simple unformatted text file in a subfolder and the attachments will be stored along with the text.

Setting up oauth access to gmail is outside the scope of this entry, but I found Google’s own documentation to be more than adequate for setting up access.

Prerequisites

Other setup

For the sake of this example, I added a label to my gmail called “BlogExamples.” Then I emailed myself  emails with a few types of attachments: images, PDF, .doc. and no attachment at all.

Gradle Dependencies

compile 'com.google.api-client:google-api-client:1.23.0'
compile 'com.google.oauth-client:google-oauth-client-jetty:1.23.0'
compile 'com.google.apis:google-api-services-gmail:v1-rev72-1.23.0'

GmailExtractor class

The work of processing and extracting the emails is done in the GmailExtractor class.  The GmailExtractor class is created by passing the com.google.api.services.gmail.Gmail object that you received after logging in to your constructor.

public class GmailExtractor {
   private Gmail gmailService = null;

   public GmailExtractor(Gmail gmailService) {
       this.gmailService = gmailService;
   }

We’ll do the bulk of the message processing work in a method called processMessages(). The first thing we’ll do in that method is define two variables that we’ll be passing to the Gmail object.  One is a user ID which we’ll just set to “me.” This is a special way of telling Google to just use the logged in user.  The second variable defines the query string we’re using to control which emails we’re extracting.  For this example, I’ve set it to “label:BlogExample.”  I used this nice forum entry to learn more about the Gmail Search Operators.  Once we’ve defined those variables, we can go out and get the messages from Gmail.  You’ll see that there’s a series of methods called ending with the execute() method which actually goes out and gets the messages.  We pass our userId variable to the list(userId) method to indicate that we’re getting that user’s messages and we pass our query variable to the setQ(query) method to filter those down rather than retrieving every single email in your account.

public void processMessages() throws IOException {
    String userId = "me";
    String query = "label:BlogExample";

    //Retrieves all the messages using the query string
    ListMessagesResponse response = gmailService.users().messages().list(userId).setQ(query).execute();

The messages are returned a page at a time through the com.google.api.services.gmail.model.ListMessageResponse object, so we’re going to page through and put all the messages in a collection.

List messages = new ArrayList();    
while (response.getMessages() != null) {
       messages.addAll(response.getMessages());
       if (response.getNextPageToken() != null) {
          String pageToken = response.getNextPageToken();
          response = gmailService.users().messages().list(userId).setQ(query).setPageToken(pageToken).execute();
       } else {
          break;
       } 
}

So now we have our messages from Gmail.  Except that we don’t really.  We actually have snippets of email messages.  In order to get the entire text of an email and it’s attachments, we have to take the com.google.api.services.gmail.model.Message objects we have and go back and ask the Google API for the rest of it using the ID on the message object. Additionally, I’m going to extract the Subject from the header and use that as both a sub-directory name and a name for the text file into which we’re going to place the email text.  In order to do these things and extract the full text and attachments from an email, we iterate the messages collection we created above.  Items of note in the first part of the loop are the seemingly repeated call to the gmailService and the call to Message.getPayload().  The first line in the loop is asking for a single Message object by it’s ID and is setting the format to “FULL” which tells Gmail that we want the whole message.  Next, we’re asking for the message payload so that we can examine the headers and save off the subject.  Once we have the payload, we’re looping through the headers until we find the one named “Subject.”  The remaining lines below just make some attempt to scrub the subject of special characters and then create a sub directory if it doesn’t exist.  OUTPUT_DIR is just a static string in the class that contains the path to the directory where I’m storing my output.

for (Message message : messages) { 
   message = gmailService.users().messages().get(userId,message.getId()).setFormat("FULL").execute(); 
   MessagePart messagePart = message.getPayload();
   String messageContent = "";
   String subject = "";    if (messagePart != null) {
      List headers = messagePart.getHeaders();
      for (MessagePartHeader header : headers) {
         //find the subject header.
         if (header.getName().equals("Subject")) {
            subject = header.getValue().trim();
            break;
         }
      }
   }

   //Create a sub-directory and file name from the subject
   //Parse the header to remove characters that can't be in a file path name
   String subdirName = subject.replaceAll("/", "-").replaceAll("<","-").replaceAll(">","-").replaceAll(":","-").replaceAll("\\\\","-").replaceAll("\\|","-").replaceAll("\\?","").replaceAll("\\*","-").replaceAll("\"","-").trim();
   String emailFileName = subdirName +".txt";

   File subDir = new File(OUTPUT_DIR + subdirName);
   if (!subDir.exists()) {
     subDir.mkdirs();
   }

Once we’ve done some setup, we need to actually build up the content.  The next line in the loop is messageContent = getContent(message). Multipart messages will have the email text in the payload parts and sometimes the parts have nested parts. The getContent method starts by calling the recursive method getPlainTextFromMessageParts and provides the MessageParts from the Payload and a StringBuilder to put the message in.   If it’s a plain text email, the StringBuilder comes back empty and we get the text from the payload body.  The text is encoded in Base64, so we decode the data into bytes and then create an UTF-8 string with them.

private String getContent(Message message) {
   StringBuilder stringBuilder = new StringBuilder(); 
   try {       
      getPlainTextFromMessageParts(message.getPayload().getParts(), stringBuilder);
      if (stringBuilder.length() == 0) {
         stringBuilder.append(message.getPayload().getBody().getData());
      }
      byte[] bodyBytes = Base64.decodeBase64(stringBuilder.toString());
      String text = new String(bodyBytes, "UTF-8");
      return text;
   } catch (UnsupportedEncodingException e) { 
      System.out.println("UnsupportedEncoding: " + e.toString());
      return message.getSnippet(); 
   }
}

The getPlainTextFromMessageParts method is a fairly straightforward recursive method.  We iterate the message parts provided.  If a part has a plain text mime type, we get the body data.  If the part has it’s own parts, we call the method with those parts.  The whole time, we’re building up the stringBuilder.

private void getPlainTextFromMessageParts(List messageParts, StringBuilder stringBuilder) {
   if (messageParts != null) {
      for (MessagePart messagePart : messageParts) {
         if (messagePart.getMimeType().equals("text/plain")) {
            stringBuilder.append(messagePart.getBody().getData());
         }
         if (messagePart.getParts() != null) {
            getPlainTextFromMessageParts(messagePart.getParts(), stringBuilder); 
         }
      }
   }
}

Back in the messages loop of our processMessages method, we’ve now got a string of email text.  Our next step is to create a text file and write the content to it.  This is straightforward file IO.  I place the subject at the top and put a couple of newlines before the content.

messageContent = getContent(message);
try {
   //Create a text file for the raw data
   File emailTextFile = new File(subDir, emailFileName);
   if (emailTextFile.exists() || emailTextFile.createNewFile()) {
      BufferedWriter bw = new BufferedWriter(new FileWriter(emailTextFile));
      bw.write(subject);      
      bw.newLine();
      bw.newLine();      
      bw.write(messageContent);
      bw.flush();      
      bw.close();
   }
} catch (IOException ioe) {
   ioe.printStackTrace();
}

Now that we’ve got our email text saved to a text file, we can download the attachments. The attachments are handled in the getAttachments method.  The method is provided with the message parts collection from the message payload, an empty collection for storing file names, the path to the sub-directory that we’re storing the message text and attachments to and that userId variable we defined at the beginning.

As we loop through the message parts, we look for a file name which indicates that the part is an attachment.  If there’s an attachment, we save off the file name and get the Attachment ID from the part’s body.  Using the userId provided, the part ID and the saved attachment ID, we then get the com.google.api.services.gmail.model.MessagePartBody from the message’s attachments object.  This MessagePartBody represents the attachment. Like the email text, the attachment is encoded in Base64.  We get the data and decode it into an array of bytes. Once we have that array of bytes it’s a simple File IO operation to write the attachment to our directory.

For message parts that have a mime type of “multipart/related,” we call the getAttachments directory recursively sending the message part’s collection of message parts.

private void getAttachments(List messageParts, List fileNames, String dir, String userId) {
   if (!dir.endsWith("/")) {
      dir += "/";
   }
 
   if (messageParts != null) {
      for (MessagePart part : messageParts) {
         //For each part, see if it has a file name, if it does it's an attachment
         if ((part.getFilename() != null && part.getFilename().length() > 0)) {
            String filename = part.getFilename();
            String attId = part.getBody().getAttachmentId();
            MessagePartBody attachPart;
            FileOutputStream fileOutFile = null;
            try {
               //Go get the attachment part and get the bytes
               attachPart = gmailService.users().messages().attachments().get(userId, part.getPartId(), attId).execute();
               byte[] fileByteArray = Base64.decodeBase64(attachPart.getData());
 
               //Write the attachment to the output dir
               fileOutFile = new FileOutputStream(dir + filename);
               fileOutFile.write(fileByteArray);
               fileOutFile.close();
               fileNames.add(filename);
            } catch (IOException e) {
               System.out.println("IO Exception processing attachment: " + filename);
            } finally {
               if (fileOutFile != null) {
                  try {
                     fileOutFile.close();
                  } catch (IOException e) {
                     // probably doesn't matter
                  }
               }
            }
         } else if (part.getMimeType().equals("multipart/related")) {
            if (part.getParts() != null) {
               getAttachments(part.getParts(), fileNames, dir, userId);
            }
         }
    }
   }
 }

Conclusion

This post covers the basics for creating an utility for extracting email text and attachments from your Gmail account and writing them into individual sub directories.

The full example is available on GitHub https://github.com/amdegregorio/Examples-ExampleOne

References

Advertisements