The Case of the PDF Malware

I’ve been working with PDFs for the past few weeks and since I have a large collection of PDF files, going through them all has been taking a lot of time. 

I’ve seen a few different exploits, but there is one that especially caught my attention because I thought it was pretty wicked and nicely conceived.

It all started after I ran a couple of samples through a script a while ago.

The Strange PDF File

The PDF in question can be found on Virus Total:

MD5: 805538ff200ec714a735ef3bc1fff1f0

SHA256: e108432dd9dad6ff57c8de6e907fd6dd25b62673bd4799fa1a47b200db5acf7c

  Lxss

Figure 1. PDF parsing

In Figure 1, we can see that the PDF contains data that will be decoded using the /CCITTFaxDecode filter (see http://en.wikipedia.org/wiki/Portable_Document_Format). Since this is not very common, I decided that it was worth a little more investigation. We know that opening that file will cause Acrobat Reader to attempt to fetch something from a given web server (which is bad for obvious reasons).

Once we launch Acrobat Reader and open that file, nothing seems to happen. From a user’s standpoint, it may not be much of a concern, but from ours, it is worrisome. The PDF file gets loaded by AcroRd32.exe and, while the window doesn’t appear, the application is obviously doing something.

There is no Heap Spray attempt by any script inside the PDF but there is something running that should not be doing so. 

Catching Network Activity

Since we know that the malware is trying to retrieve something off of the internet, I put a breakpoint inside urlmon.dll. Most malware use urlmon!URLDownloadToFileA() because it’s very simple to call, as all that is required is a URL and a file name.

With my breakpoint set up, I let the malware run and lo and behold, we get a hit on the aforementioned API:

  URLDownloadToFileA

Figure 2. URLDownloadFileToA call

We can see here that the malware has made the call to the API and looking at the stack we can trace it back to see where is exploit is running from.

Let’s take a look at the stack:

API_Frame

Figure 3. URLDownloadFiletoA stack frame

The call seems to emanate from icucnv34.dll, which is a valid DLL loaded by Acrobat Reader. So far, nothing seems to be particularly wrong, but if we look at the caller address a little further, we can see that there’s something fishy about it.

As a reminder, before we go ahead, most shell code is located somewhere on the stack or on the heap. For instance, after a successful heap spray, the shell code will be running from some part of the heap allocated by the malicious script.

This is where it becomes interesting…

Where is the shell code?

As we saw in Figure 3, URLDownloadFileToA is called from somewhere within icucnv34.dll (or so it seems).

Let’s see what the PEB can tell us. After entering the !PEB command, here’s what Windbg gives us:

PEB_pic

Figure 4. PEB

We’ve spotted our DLL. The module start address is 0x4a800000.

In Figure 4, it says that the return address for URLDownloadFileToA is icucnv34+0x456b1, which is 0x4a8456b1.

This is rather interesting because the DLL itself is only 300K and obviously not all of it is code. Here are the details:

Module_info

Figure 5. Module Information

What did we get here?

  1. Base address = 0x4a800000
  2. Size of code = 0x21000

Since the return address of the API is icucnv34+0x456b1, the code is not running inside the DLL code segment!

So where is it exactly?

Memdump

Figure 6. Shell code memory dump

It appears that the code we are looking at is located right before the modules resources. Going to 0x4a8456b1 and back a little, we can determine the entry point of the shell code. As shown in Figure 6, the URL accessed by the API is located at the end of the code, which starts with a NOP (0x90) followed by a short jump (0xeb 0x11). Initially this shell code is XORed out and the code that is between the beginning and the first jump address is used to XOR the rest of it with 0xA6.

As we know, a Windows executable is structured something like this (rough description):

Exestruct

Figure 7. Executable structure

If some shell code is copied somewhere in between the data and the resources, it will appear to be part of the DLL.

Hiding with a module space

As we’ve just observed, the shell code is not copied on the heap but within the address space of a chosen DLL.

This is pretty wicked because regular shell code detection may be fooled by this, as it would expect the return address of an API to be on the heap/stack and therefore not inside a DLL address space. If your detection is based on checking whether a return address is within the boundaries of a module, this sort of shell code will evade it. What the detection engine must do is to narrow the address range to be within the code segment of the module.

In this particular case, the icucnv43.dll is loaded at 0x4a800000 and the code starts at base image + 0x1000, hence 0x4a801000.

As the [!dh] showed us, the code size is 0x21000 which means that the runnable code of the DLL is 0x4a801000 – 0x4A822000. Looking at Figure 7, it becomes obvious that the shell code we’re looking at is running outside the .text segment. Since the shell code executes way outside of this range, we can deduct that it is indeed some malicious code and we can report the network activity as well as the use of shell code for that sort of PDF exploit.