Pillars of Analyzing Malicious MS Office Documents — Part 1–3: Unveiling Document Format Structures

9 min readAug 3, 2023

Microsoft Office suite has been standing tall as a pillar of productivity and convenience for decades. From crafting presentations in PowerPoint to crunching numbers in Excel, and composing memos in Word, millions of users rely on these tools daily. I can’t imagine a standing business without MS Office documents, can you? Yet, beneath this veneer of normalcy, a sinister threat lurks, quietly exploiting the very software we trust.

Cybercriminals have ingeniously turned MS Office documents into a covert battlefield, not only by weaponizing them with sophisticated attack payloads, and exploits, but also they have employed some techniques to slow down analyzing and detecting them! In this series of stories, we’ll try to understand the structure of MS Office to reveal the techniques leveraged by cybercriminals to abuse those files.

This series of stories was inspired by the feedback we received at dPhish on a webinar session we produced to provide such knowledge to the Arabic community. Many people requested an English version of this session, so we decided to write this story down.
dPhish is a comprehensive Phishing Suite that assesses, enhances, and complements the company’s pre-, during, and post-phishing countermeasures by proactively evaluating and strengthening your human and technological defenses. In the event of an attack, our solutions respond by detecting, quarantining malicious emails, and searching for credentials being sold on the dark web in case of a successful undetected phishing attack.

[Webinar] Analyzing Malicious MS Office Documents (Arabic)

Disclaimer: This story is not intended to explain any tool, there are a lot of tutorials that cover them already.

The Motivation

A reasonable question that I’m being asked every time I mention the manual analysis of MS Office documents is, why do we need to do this?! we already have secure email gateways, powerful sandboxes, and robust anti-virus solutions, not to mention that our employees are cybersecurity aware fingers crossed 🤞 Actually, there are a plethora of reasons however I’m going to mention only a few of them.

As a blue teamer, your ultimate goal is to protect the organization and respond to cyber attacks, you must have a thorough grasp of the network. Understanding the network includes more than just the topology, it also includes baselining it to determine what is normal and what isn’t, including the Microsoft Office documents, for example, you need to be aware of the documents that are expected to contain macros, the functionality of those macros, and double checking the sandbox’s verdict about a document especially when you get a release request for an email that has been quarantined by your email gateway because it suspects the email’s attachment. This brings up the important fact that the email gateway and the sandbox could have false positive and false negative results that could affect the operation. Also, the fact that malicious MS Office documents are easy and trivial pieces of malware to be analyzed is good news for the fellows who want to begin their journey into the malware analysis world.

Another reason for SOC teams to consider is they might have a repository of playbooks, those playbooks should be helpful to the SOC newcomers however, we often see something like this in phishing playbooks.

This flowchart contains so many places where we can introduce some improvements, but the step where it says to “Investigate the email header, body, and attachments” is the one that always confuses less experienced analysts. This is what I imagine the analysts to look like when they see this step.

What should I do? — An Egyptian meme xDD

In this story, we’ll try to fill this gap by discussing the pillars of analyzing malicious MS Office documents.

Building a Knowledge Base

When we talk about analyzing malicious Microsoft Office documents, we should mention that it is as easy as building a knowledge base of:
1. Understanding the MS Office format structures.
2. Getting familiar with the common attack techniques.
3. Knowing how to bypass anti-analysis techniques.
Those are the three pillars of analyzing Microsoft Office documents even if you are analyzing an attack technique that you’re seeing for the first time.

Unveiling the MS Office Documents Format Structures

Understanding what is being evaluated is a key differentiator between analysts. Imagine an analyst who is following some scheme or cheat sheets to run some tools in a sequence to perform the analysis, and compare it to another analyst who truly understands the nature of what is being investigated. The first analyst’s success is dependent on the coverage of the scenarios and situations in the scheme he is following besides he might not notice if there is a bug in the tool or the script that he is running, therefore, he blindly trusts the output however, the power of understanding the nitty-gritty details of the files being investigated makes the second analyst capable of analyzing attacks that he never saw before, he knows what is normal and what is suspectable, a huge difference, isn’t it?

Microsoft Office documents come in three formats OLE2 (Object Linking and Embedding), OOXML (Office Open XML), and RTF (Rich Text Format), in this section of the story we’ll discuss the important information that the analysts need to know to perform successful analysis.

OLE2 — Object Linking and Embedding

You can think of OLE2 as a file system in a ZIP container, it’s also known as Structured Storage (SS) since it is made up of files and folders, however in the context of OLE2, files are referred to as streams, while folders are referred to as storages. The following screenshot shows rendering the OLE2 file, a doc file in this case, in SSView on the right and unzipping it on the left.

OLE2 files can be recognized by either the file extension (doc, xls, ppt, …) or the magic number of the file (D0 CF 11 E0 A1 B1 1A E1) a funny fact is the first 7 nibbles read as doc file. Bare attention to this information as we’ll use them later. A major concern about OLE2 files is they support macros. The OLE2 format is the oldest however, it’s still being used.

Important OLE2 Streams and Storages

WordDocument — This stream exists in the root as you can see in the following screenshot. This stream contains the text of the document in case of a Word doc file.

SummaryInformation — This stream contains metadata about the document such as the title, the author, the tags, etc. This field is very useful for threat actors’ attribution.

It also can be parsed easily via SSView.

ThisDocument — This stream can be located in the Macros storage under the VBA storage. The name could be changed as the name of the VBA module name changes, in this case, it was ThisDocument. This stream contains VBA code that has been added to the ThisDocument module. Many tools and scripts are looking directly into such streams to check the existence of a macro code and hopefully to spot some keywords.

_SRP — This stream can be located in the Macros storage under the VBA storage. When the attacker creates his malicious document by overwriting an existing one rather than creating a new one, the _SRP stream might contain older versions of the macro. It is particularly useful for attribution, comparing sample similarities, and gathering additional Threat Intelligence information that does not exist in the already working VBA code. The screenshot below displays an example of a document that has been modified, and we were able to find some earlier code containing IoCs that no longer exist in the new code.

OOXML — Office Open XML

Since MS Office 2007, OOXML has been the new format for MS Office documents. It is made up of XML files that are compressed as a ZIP file, it even has the ZIP file’s magic number: 50 4B 03 04. It is being identified by the extensions doc[x|m], xls[x|m], ppt[x|m], and so on. The files that end with x do not support macros, but the files that end with m do. This is important to know so that you can cut down the investigation possibilities. The following screenshot shows a docx and a docm file that have been unzipped like any other zip file.

The concern about this format is that it supports OLE object embedding which could be a macro besides the .rels files that manage the document relationships and it allows for remote relationships as well.

Important Files

[Content_Types].xml — This XML file exists in the root and acts as the index of the package and lists all content types of the parts that the package contains.

word and xl folders — These folders exist in the docm and the xlsm files respectively. These folders contain a very important file called vbaProject.bin, the file name might be changed.

vbaProject.bin — This is an OLE file that exists in the OOXML files that supports macros (the documents with extension m and contain a macro). The name could be changed by changing the VBA project name. This file is very crucial as it could be parsed to get the macro code without opening the document that is being analyzed.

RTF — Rich Text Format

Many users rely on word processing software for their daily tasks, but there are many different programs that can be used, which might affect the user experience if a document has been created in MS Word for example and it is opened in another program. For this reason, the idea of having a uniform way to describe word documents has been raised so a single document can be processed in different programs without any problems. To accomplish this, RTF encodes text, graphics, and objects.

Although macros are not supported by RTF files, OLE object embedding is possible. RTF files have the ability to automatically extract embedded objects to the user’s%TEMP% directory, which threat actors may abuse to embed malicious scripts. Additionally, RTF parsers usually have serious parsing flaws that could lead to high and critical vulnerabilities like DoS and RCEs.

Control Words and Groups — Contol words define the way the document is presented to the user, it can be identified with a backslash followed by the control word at the beginning of the group which is a control work that could be followed with some other text enclosed with curly braces. In the following screenshot, control words are highlighted with the red highlighter, the braces of the groups are highlighted with yellow, and the values are highlighted with green.

In the next story, we’ll touch on some anti-analysis techniques that threat actors leverage in order to slow down junior analysts and hopefully bypass some detections.