Analysing/Detecting Malicious PDF’s Primer

August 14, 2017

Over recent weeks I have been analysing threats that just come at me, from my personal email to friends who alert me when they see something suspicous. The techniques used are rudimentary and have been done for many years, the perculiar thing about these techniques is that they still seem to evade some antivirus solutions and even email protection providers. Google, the largest provider in emails has a sophisticated system which is used to identify malware, but some samples I witnessed were not picked up by the email provider, instead they were deemed clean. I’m going to discuss how you can analyse malicious PDF’s and Office documents not from a sandbox. Sandboxes are great, but sometimes can give you little information to what you actually need; new patterns, new techniques and new C2’s are sometimes not picked up by Sandboxes. Sometimes it’s best to go deep, luckily for us, the PDF and Microsoft Office formats are actually remarkably simple to understand. There are many other factors we can use to understand about malicious items, one common choice is entropy. This can be true in the PDF format, the PDF format has something called ‘streams’, these streams are compressed.

Large amounts of samples simply rely on the user accepting the PDF attachment, thus, there is not much content other than the ‘Embedded Object’, text and format specification requirements. Reviewing the entropy of files that are both PDF’s show us that there is a large amount of compressed data, in fact, this is a word document, so is a compressed file itself as a zip. The amount of red (Red indicating the highest amount of entropy) from a malicious PDF compared to a benign PDF is clear, although this cannot be a huge indicator, it can provide analysts help to quickly identify possible malicious documents. Large portions of highly compressed data cannot be images, in fact, this sample is primarily the embedded object.

If you open a PDF in a hex editor or a text editor you may be pleasantly surprised to find a surprising amount of ASCII strings allowing you to understand the format, in fact, hex combinations are not used to define different parts of the format. It is clear the most important part of PDF’s primarily are the streams, streams can be many things but are where payloads primarily reside. I don’t want to give complete indications because of exploits and you can never say never with file formats. Adobe’s PDF File Format manual is over 1,000 pages and you’ll be surprised to learn I haven’t read all of it. Streams are very important and can be identified through ASCII strings, terminators are set to the hex 0xA throughout the PDF format. The ASCII strings are fairly easy to distinguish, stream endstream, obj and endobj. We can also clearly see that one of the first objects specified in this PDF is an image, but this is unimportant to us, it’s completely dwarfed by the embedded object.

The embedded object is clear to see, much like the image, in stream objects we should also be sure to look at the Filter ASCII string as it’s parameter is important in decoding or inflating the content.  There are multiple variants of what the format could do, I’m not sure if it’s a complete list but you can find variants in a list in this resource. You can see at the end of this stream we also have some clear indications of an embedded object, this is our payload a word document which has macros. We have an indicator of comprimise for us by simply looking at the first stage of infection, the PDF. The name is set as we can see, look for “.docm”.

Embedded objects are pretty big indicators that something fishy is going on, but it’s not the only stream that we can rely on, there are many different types of functionality a PDF can perform and one of them is Adobe’s Javascript functionality. With this Javascript functionality we can achieve what an image above shows, although prompting the user this message is dangerous, it must work somewhat for them to use this technique. It can be convincing enough if done correcttly. The problem is this PDF was attempting to be seen as an image, using IMG_[random numbers here], which doesn’t really fall into place when you have to open a PDF and then accept you want to open a word document of some description. The ‘FlateDecode’ filter is zlib with Default Compression set, the hex value 0x78 is a indicator of zlib/deflate/gzip (There are differences, but very minor, if you know, go you!). The second value fully identifies zlib, Default compression is 9C, low or none is 01 and best is DA, if you want to know more about zlib, give the format specification a look over. There are some variant headers for zlib I must add, but these are the main ones to look out for if you’re manually analysing. There are multiple instances of 0x78, 0x9c in the file, this is also confirmed when I use a small tool to analyse the data to confirm it’s compressed data. In this example, I’m looking at the javascript of the PDF, highlighted in the picture, it’s small but we can see the magic number for zlib default compression. The plaintext Javascript is essentially attempting to launch the dialogue from the picture above, also giving indicators of comprimise by referencing document or embedded object strings. The actual embedded object is a confirmed word document, I analyse it through a hex editor and find it has ‘PK’ at the start, associated with zips, but office files have contents encapsulated in zips.

Hopefully you learnt something from the blog post. 🙂