Over recent weeks I have been analysing threats that just come at me, from my personal email to friends who alert me when they see something suspicous. The techniques used are rudimentary and have been done for many years, the perculiar thing about these techniques is that they still seem to evade some antivirus solutions and even email protection providers. Google, the largest provider in emails has a sophisticated system which is used to identify malware, but some samples I witnessed were not picked up by the email provider, instead they were deemed clean. I’m going to discuss how you can analyse malicious PDF’s and Office documents not from a sandbox. Sandboxes are great, but sometimes can give you little information to what you actually need; new patterns, new techniques and new C2’s are sometimes not picked up by Sandboxes. Sometimes it’s best to go deep, luckily for us, the PDF and Microsoft Office formats are actually remarkably simple to understand. There are many other factors we can use to understand about malicious items, one common choice is entropy. This can be true in the PDF format, the PDF format has something called ‘streams’, these streams are compressed.
Large amounts of samples simply rely on the user accepting the PDF attachment, thus, there is not much content other than the ‘Embedded Object’, text and format specification requirements. Reviewing the entropy of files that are both PDF’s show us that there is a large amount of compressed data, in fact, this is a word document, so is a compressed file itself as a zip. The amount of red (Red indicating the highest amount of entropy) from a malicious PDF compared to a benign PDF is clear, although this cannot be a huge indicator, it can provide analysts help to quickly identify possible malicious documents. Large portions of highly compressed data cannot be images, in fact, this sample is primarily the embedded object.
If you open a PDF in a hex editor or a text editor you may be pleasantly surprised to find a surprising amount of ASCII strings allowing you to understand the format, in fact, hex combinations are not used to define different parts of the format. It is clear the most important part of PDF’s primarily are the streams, streams can be many things but are where payloads primarily reside. I don’t want to give complete indications because of exploits and you can never say never with file formats. Adobe’s PDF File Format manual is over 1,000 pages and you’ll be surprised to learn I haven’t read all of it. Streams are very important and can be identified through ASCII strings, terminators are set to the hex 0xA throughout the PDF format. The ASCII strings are fairly easy to distinguish, stream endstream, obj and endobj. We can also clearly see that one of the first objects specified in this PDF is an image, but this is unimportant to us, it’s completely dwarfed by the embedded object.
The embedded object is clear to see, much like the image, in stream objects we should also be sure to look at the Filter ASCII string as it’s parameter is important in decoding or inflating the content. There are multiple variants of what the format could do, I’m not sure if it’s a complete list but you can find variants in a list in this resource. You can see at the end of this stream we also have some clear indications of an embedded object, this is our payload a word document which has macros. We have an indicator of comprimise for us by simply looking at the first stage of infection, the PDF. The name is set as we can see, look for “.docm”.
Hopefully you learnt something from the blog post. 🙂