Friday, February 21, 2014

Do we need the Kernel Debugging Block?

I have written a blog article in the past describing the Kernel Debugging Block (KDBG) in detail http://scudette.blogspot.ch/2012/11/finding-kernel-debugger-block.html as it is used by Volatility in order to "bootstrap" the analysis process. Many plugins require a list of processes, and Volatility uses the KDBG in order to locate the PsActiveProcessHead symbol (which is the head of the doubly linked list holding the _EPROCESS objects together).
Recently, the Volatility blog reminded us that the KDBG is critical for memory analysis. In that post, the author recognizes that the KDBG block is encoded on Window 8 and is not readily scanned for using the usual kdbgscan plugin. In particular that blog post states:
An encoded KDBG can have a hugely negative effect on your ability to perform memory forensics. This structure contains a lot of critical details about the system, including the pointers to the start of the lists of active processes and loaded kernel modules, the address of the PspCid handle table, the ranges for the paged and non-paged pools, etc. If all of these fields are encoded, your day becomes that much more difficult.
We have previously demonstrated in our OSDFC training workshop that the KDBG block can be trivially overwritten without affecting system stability. Since the kdbgscan plugin simply scans for the plain text "KDBG" signature, by overwriting this signature it is impossible to locate the KDBG, nor bootstrap memory analysis. Indeed with Volatility you are going to have a really bad day. It is still possible to workaround this limitation, and our workshop describes all the workarounds available, but it is definitely not ideal.
This problem was also discussed in the Black Hat talk One-byte Modification for Breaking Memory Forensic Analysis.

Does Rekall use the KDBG?

Volatility windows profiles are typically generated using the pdbparse project, using the  pdb_tpi_vtypes.py script. They normally only contain the vtype definitions (embedded into python files, for example vista_sp0_x64_vtypes.py).
While developing the Rekall profile system (which is described in detail in previous blog posts), new profiles were generated for windows kernels. Rather than rely on the pdbparse project to parse the pdb files, we have implemented a complete Microsoft PDB parser within the Rekall framework (This will be described in a future blog post).
Microsoft PDB files contain a number of streams. One of the streams describes struct definitions and can be used to generate the vtypes. However, interestingly, there are a few more streams which extract global symbols from the PDB file. (The pdbparse project does provide am additional script to extract the constants from the pdb file, but that script is not currently used by Volatility).
In other words, the PDB file contains the addresses in memory of many symbols. This is akin to the System.map file we use when analyzing a Linux memory image. Lets examine a typical Rekall windows profile:
{
 "$CONSTANTS": {
.....
  "PromoteNode": 611168,
  "PropertyEval": 451884,
  "PsAcquireProcessExitSynchronization": 1157620,
  "PsActiveProcessHead": 96160,
  "PsAssignImpersonationToken": 1479504,
  "PsBoostThreadIo": 219912,
....
  "KdD3Transition": 805316,
  "KdDebuggerDataBlock": 2003056,
  "KdDebuggerEnabled": 2562992,
  "KdDebuggerInitialize0": 805256,
  "KdDebuggerInitialize1": 805244,
...
We can see that the typical Microsoft kernel PDB file contains a huge number of symbols which are not exported in the PE export table. In particular we see the symbol PsActiveProcessHead which is required to list processes. We also see the exact location of the Kernel Debugger block in KdDebuggerDataBlock symbol (Just in case we need it). The symbol offset is specified relative to the Kernel Base address (i.e. the MZ header where the kernel is mapped into memory).
Let us examine in detail the steps that Rekall goes through in the pslist module by enabling verbose logging:
$ rekall --verbose -f  ~/images/win7.elf pslist
.....
INFO:root:Autodetected physical address space Elf64CoreDump                     1
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//pe.gz
INFO:root:Loaded profile pe from URL:http://profiles.rekall.googlecode.com/git/ 2
DEBUG:root:Verifying profile GUID/F8E2A8B5C9B74BF4A6E4A48F180099942             3
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//GUID/F8E2A8B5C9B74BF4A6E4A48F180099942.gz
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//ntoskrnl.exe/AMD64/6.1.7600.16385/F8E2A8B5C9B74BF4A6E4A48F180099942.gz
INFO:root:Loaded profile ntoskrnl.exe/AMD64/6.1.7600.16385/F8E2A8B5C9B74BF4A6E4A48F180099942 from URL:http://profiles.rekall.googlecode.com/git/
INFO:root:Loaded profile GUID/F8E2A8B5C9B74BF4A6E4A48F180099942 from URL:http://profiles.rekall.googlecode.com/git/
DEBUG:root:Found _EPROCESS @ 0x2818140 (DTB: 0x187000)                          4
INFO:root:Detected ntkrnlmp.pdb with GUID F8E2A8B5C9B74BF4A6E4A48F180099942
  Offset (V)   Name                    PID   PPID   Thds     Hnds   Sess  Wow64 Start                    Exit
-------------- -------------------- ------ ------ ------ -------- ------ ------ ------------------------ ------------------------
INFO:root:Detected kernel base at 0xF8000261F000                                5
0xfa80008959e0 System                    4      0     84      511 ------  False 2012-10-01 21:39:51+0000 -
0xfa8001994310 smss.exe                272      4      2       29 ------  False 2012-10-01 21:39:51+0000 -
0xfa8002259060 csrss.exe               348    340      9      436      0  False 2012-10-01 21:39:57+0000 -
1Rekall auto-detects this image as contained in an EWF file.
2Rekall now contacts the profile repository to retrieve the parser for the PE file format.
3The PE profile is used to scan for RSDS signatures. These are verified so we can be pretty confident that we loaded the exact profile for this image.
4The Kernel DTB is located by scanning for the Idle process.
5We now find the kernel’s base address. Once that is known, the addresses of all symbols in the kernel’s virtual address space are known directly from the profile. i.e. We do not need to scan for anything, we already know where everything is.
Rekall generally does not need to use the KDBG at all. This is much faster since it does not need to scan for it, but more importantly, is much more robust because malware can not overwrite thePsActiveProcessHead symbol without crashing the system.
Since Rekall uses a profile repository we are able to locate the exact profile for the kernel we are analyzing. Therefore we do not need to scan for anything - we always prefer to just read the exact addresses from the profile without guessing. This makes analysis far more robust and simple.

Another example, the callbacks plugin.

Another example of this technique is the callbacks plugin. Here, Volatility resorts to disassembling various exported functions to try to locate the offset of a number of non-exported callback pointer tables (e.g. PsSetLoadImageNotifyRoutine is disassembled to get to PspLoadImageNotifyRoutine). This algorithm is pretty fragile and complex. It also only works on 32 bit systems at the moment, since signatures need to be developed for different architectures.
However, this algorithm is entirely not needed, if one uses the correct profile for the exact kernel version. You can simply look up the exact addresses of the (non-exported) symbols you need. Here is the Rekall code:
        routines = ["_PspLoadImageNotifyRoutine",             1
                    "_PspCreateThreadNotifyRoutine",
                    "_PspCreateProcessNotifyRoutine"]

        for symbol in routines:
            # The list is an array of 8 _EX_FAST_REF objects
            addrs = self.profile.get_constant_object(         2
                symbol,
                target="Array",
                target_args=dict(
                    count=8,
                    target='_EX_FAST_REF')
                )

            for addr in addrs:                                3
                callback = addr.dereference_as("_GENERIC_CALLBACK")
                if callback:
                    yield "GenericKernelCallback", callback.Callback, None
1We look up each one of these symbols by name.
2We use the profile directly to instanstiate an array of 8 _EX_FAST_REF.
3We dereference each of the addresses to find the callbacks.
There is no need to scan or disassemble anything to retrieve the symbol addresses, since we know exactly where they are already.

What else can we do with profile constants?

The amount of information provided in the kernel PDB files is truly extensive. Not only does Microsoft provide non-exported function names, but also global names, string names, import table entries and much more.
This is extremely useful when disassembling code in Rekall. Since Rekall disassembles the code which is resident in memory, all relocations, imports, exports etc have already been done by the kernel. In other words if we see a memory reference, we can resolve it to know where it is or what it is without considering imports.
Here is an example of disassembling the PsSetLoadImageNotifyRoutine routine on a 64 bit image (This is what Volatility is doing in the callbacks plugin).
$ rekall -f  ~/images/win7.elf dis 'ntoskrnl.exe!PsSetLoadImageNotifyRoutine'
   Address      Rel Op Codes             Instruction                    Comment
-------------- ---- -------------------- ------------------------------ -------
------ ntoskrnl.exe!PsSetLoadImageNotifyRoutine ------
0xf80002aa1050    0 48895c2408           MOV [RSP+0x8], RBX
0xf80002aa1055    5 57                   PUSH RDI
0xf80002aa1056    6 4883ec20             SUB RSP, 0x20
0xf80002aa105a    A 33d2                 XOR EDX, EDX
0xf80002aa105c    C e8bfb1feff           CALL 0xf80002a8c220            ntoskrnl.exe!ExAllocateCallBack
0xf80002aa1061   11 488bf8               MOV RDI, RAX
0xf80002aa1064   14 4885c0               TEST RAX, RAX
0xf80002aa1067   17 7507                 JNZ 0xf80002aa1070             ntoskrnl.exe!PsSetLoadImageNotifyRoutine + 0x20
0xf80002aa1069   19 b89a0000c0           MOV EAX, 0xffffffffc000009a
0xf80002aa106e   1E eb4a                 JMP 0xf80002aa10ba             ntoskrnl.exe!PsSetLoadImageNotifyRoutine + 0x6A
0xf80002aa1070   20 33db                 XOR EBX, EBX
0xf80002aa1072   22 488d0d27d4d9ff       LEA RCX, [RIP-0x262bd9]        0xFFFFF8A0001310BF ntoskrnl.exe!PspLoadImageNotifyRoutine
0xf80002aa1079   29 4533c0               XOR R8D, R8D
0xf80002aa107c   2C 488bd7               MOV RDX, RDI
0xf80002aa107f   2F 488d0cd9             LEA RCX, [RCX+RBX*8]
0xf80002aa1083   33 e8c817f8ff           CALL 0xf80002a22850            ntoskrnl.exe!ExCompareExchangeCallBack
0xf80002aa1088   38 84c0                 TEST AL, AL
0xf80002aa108a   3A 7511                 JNZ 0xf80002aa109d             ntoskrnl.exe!PsSetLoadImageNotifyRoutine + 0x4D
0xf80002aa108c   3C ffc3                 INC EBX
0xf80002aa108e   3E 83fb08               CMP EBX, 0x8
0xf80002aa1091   41 72df                 JB 0xf80002aa1072              ntoskrnl.exe!PsSetLoadImageNotifyRoutine + 0x22
0xf80002aa1093   43 488bcf               MOV RCX, RDI
0xf80002aa1096   46 e805e9f5ff           CALL 0xf800029ff9a0            ntoskrnl.exe!IopDeallocateApc
0xf80002aa109b   4B ebcc                 JMP 0xf80002aa1069             ntoskrnl.exe!PsSetLoadImageNotifyRoutine + 0x19
0xf80002aa109d   4D f083053bd4d9ff01     LOCK ADD DWORD [RIP-0x262bc5], 0x1 0x1 ntoskrnl.exe!PspLoadImageNotifyRoutineCount
0xf80002aa10a5   55 8b05d5d3d9ff         MOV EAX, [RIP-0x262c2b]        0x7 ntoskrnl.exe!PspNotifyEnableMask
0xf80002aa10ab   5B a801                 TEST AL, 0x1
0xf80002aa10ad   5D 7509                 JNZ 0xf80002aa10b8             ntoskrnl.exe!PsSetLoadImageNotifyRoutine + 0x68
0xf80002aa10af   5F f00fba2dc8d3d9ff00   LOCK BTS DWORD [RIP-0x262c38], 0x0 0x7 ntoskrnl.exe!PspNotifyEnableMask
0xf80002aa10b8   68 33c0                 XOR EAX, EAX
0xf80002aa10ba   6A 488b5c2430           MOV RBX, [RSP+0x30]
0xf80002aa10bf   6F 4883c420             ADD RSP, 0x20
0xf80002aa10c3   73 5f                   POP RDI
We can see that addresses are resolved according to the known symbols at that address (In the Volatility code we are actually after the PspLoadImageNotifyRoutine address).

Thursday, February 20, 2014

The Rekall Profile Repository and Profile Auto-selection

The previous blog post discussed how Rekall redesigned the profile format into a simple JSON data structure. The profile JSON file now becomes the complete information source about a specific kernel version - including global constants, struct definitions (via vtype definitions) and metadata (such as architecture, version etc).
One important difference from the Volatility profiles, is that in Rekall, the profile is the actual json file itself, while in Volatility a profile name represents a specific class defined within the code base. So for example, with Rekall one can specify the profile as a path to a file (which may be compressed):
$ rekall -f OSX_image.dd --profile ./OSX_10.6.6_AMD.json.gz

The Rekall public repository.

Since the profile file is just data, it can be hosted in a public repository. Rekall can simply download the required profile from the repository when required. This makes it much easier to distribute the code since we do not need to include vast quantities of unnecessary information embedded inside the program.
Rekall provides a public repository located at http://profiles.rekall.googlecode.com/git/ . The Rekall team collects profiles for the most common operating system versions, and we try to increase our coverage as much as possible.
For example, we can see what Rekall is doing when loading an OSX profile:
$ rekall -f memory_vm_10_7.dd.E01 --profile OSX/10.7.4_AMD -v pslist
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//OSX/10.7.4_AMD.gz
INFO:root:Loaded profile OSX/10.7.4_AMD from URL:http://profiles.rekall.googlecode.com/git/
DEBUG:root:Voting round
DEBUG:root:Trying <class 'rekall.plugins.addrspaces.macho.MACHOCoreDump'>
DEBUG:root:Failed instantiating MACHOCoreDump: Must stack on another address space
....
Rekall contacts the default public profile repository to load the specified profile and continues running.

Alternate Repositories

Although it is very convenient to use the public repository, sometimes we can not or do not want to. For example, if we do not have adequate Internet access on the analysis system we might not be able to use the public repository.
Since the profile repository is just a git repository, its easy to mirror it locally. The following will create a directory called rekall.profiles containing the most up to date version of the public repository:
$ git clone https://code.google.com/p/rekall.profiles

# It is possible to update the local mirror with the latest public profiles.
$ cd rekall.profiles
rekall.profiles$ git pull

# Now we can tell Rekall to use the local repository
$ rekall -f memory_vm_10_7.dd.E01 --profile_path full_path_to_rekall.profiles \
   --profile OSX/10.7.4_AMD -v pslist
To save typing, it is possible to just change the local rekall configuration to point at the profile repository by default. Simply edit the ~/.rekallrc file (Which is a configuration file in YAML format):
profile_path:
  - /home/scudette/projects/rekall.profiles
  - http://profiles.rekall.googlecode.com/git/
The profile_path parameter specifies a list of paths to search for the specified profile in order. If we place the public repository in the second position, rekall will only attempt to contact the public repository if the required profile does not exist in the local mirror.
This is useful if you are doing a lot of analysis for unusual Linux systems (i.e. ones with uncommon or custom compiled kernels). In that case you can put your private profiles in a local directory, but still fall back to the public repository for common profiles.

Windows profiles.

Putting the profile data in the public repository helps to reduce code footprint. While removing the embedded volatility profiles from the code base it because obvious that Volatility does not actually contain enough windows kernel vtypes to cover all the different windows releases out there.
As described in the previous blog post, the profile contains information specific to a single build of the windows kernel. Each time the windows kernel source code is modified and rebuilt, a new profile should be generated. In reality, Microsoft rebuilds and redistributes the windows kernel multiple times during a single marketed release, and even multiple times for different release markets. We know this because each time an executable is built, it contains a new GUID embedded in it.
Note
How does the Microsoft compiler generate debugging symbols?
When an executable is built, the compiler places the debugging symbols in a separate file (with a .pdb extension). The final executable contains a special structure called an RSDS signature (This is not the official name since this is not exactly documented, but the string "RSDS" actually appears in the executable).
The RSDS structure contains three critical pieces of information:
  • The GUID - a random number unique to each built binary.
  • The filename of the pdb file which goes with the binary.
  • An age. This is a number usually single digit like 2 or 3.
Microsoft typically does not ship the debugging information in order to save space on distribution media. Instead, they provide a public symbols server. One can access the debugging symbols for each built binary (if they are released of course), by simply providing the GUID, age and the filename of the pdb file.
Of course the infrastructure that Microsoft provides is there to serve the windows kernel debugger, but we can leverage this same infrastructure in Rekall. In a sense Rekall is emulating the windows debugger to some extent when analyzing a windows memory dump.
You can check the exact kernel version running in a memory image using Rekall’s RSDS scanner:
$ rekall -f win7.elf version_scan --name_regex krnl
  Offset (P)   GUID/Version                     PDB
-------------- -------------------------------- ------------------------------
0x0000027bb5fc F8E2A8B5C9B74BF4A6E4A48F180099942 ntkrnlmp.pdb
Here we see that this image contains a specific version with the GUID F8E2A8B5C9B74BF4A6E4A48F180099942. We actually can check the GUIDs from the binary on disk for the windows kernel.
I was curious as to how many different kernel binaries exist in the wild? I began to collect GUIDs for various versions of Windows, generate profiles for these and put them in the profile repository. I have found approximately 200 profiles of the windows kernel (ntoskrnl.exe and its variants) with different architectures (AMD64 and I386), versions and build numbers. For example Windows XP Service Pack 2 has a build number of 2600 but we found over 30 different versions in the wild.
The profile repository contains a special type of profile definition which is a Symlink. For example, we define a profile called Win7SP1x64 which contains:
{
  "$METADATA": {
    "Type": "Symlink",
    "Target": "ntoskrnl.exe/AMD64/6.1.7601.17514/3844DBB920174967BE7AA4A2C20430FA2"
  }
}
This just selects a representative profile from the many Windows 7 Service Pack 1 profiles we have. This allows Rekall to be used in backwards compatibility mode:
$ rekall -f ~/images/win7.elf -v --profile Win7SP1x64 pslist
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//Win7SP1x64.gz
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//ntoskrnl.exe/AMD64/6.1.7601.17514/3844DBB920174967BE7AA4A2C20430FA2.gz
INFO:root:Loaded profile ntoskrnl.exe/AMD64/6.1.7601.17514/3844DBB920174967BE7AA4A2C20430FA2 from URL:http://profiles.rekall.googlecode.com/git/
INFO:root:Loaded profile Win7SP1x64 from URL:http://profiles.rekall.googlecode.com/git/
....
We can see that first the Symlink profile is opened, followed by the real profile.

What profile do I need?

Have you even been given an image of a windows version, but you don’t know exactly which one it is supposed to be? Is it a 64 bit system or a 32 bit system? Is it Windows 7 or Windows XP? Is it Service Pack 1 or 2?
Volatility has the imageident plugin which load all the windows profiles it knows about (about 20 different ones) and tries to fit them to the image. Its very slow and often does not work.
The easier way is simply check the RSDS signature of the windows kernel:
$ rekall -f win7.elf version_scan --name_regex krnl
  Offset (P)   GUID/Version                     PDB
-------------- -------------------------------- ------------------------------
0x0000027bb5fc F8E2A8B5C9B74BF4A6E4A48F180099942 ntkrnlmp.pdb
The Rekall public repository organizes windows profiles using two hierarchies, the first is by binary name, architecture and build version, for example:
ntoskrnl.exe/I386/5.1.2600.1151/04FB9A156FF44ECCA6EBCAE9617D8DB73.gz
However a more useful organization is by GUID (since the GUID is universally unique). If we know the GUID we can automatically access the correct profile without needing to know if it is Windows 7, WinXP or whatever:
$ rekall -f ~/images/win7.elf -v --profile GUID/F8E2A8B5C9B74BF4A6E4A48F180099942 pslist
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//GUID/F8E2A8B5C9B74BF4A6E4A48F180099942.gz
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//ntoskrnl.exe/AMD64/6.1.7600.16385/F8E2A8B5C9B74BF4A6E4A48F180099942.gz
INFO:root:Loaded profile ntoskrnl.exe/AMD64/6.1.7600.16385/F8E2A8B5C9B74BF4A6E4A48F180099942 from URL:http://profiles.rekall.googlecode.com/git/
INFO:root:Loaded profile GUID/F8E2A8B5C9B74BF4A6E4A48F180099942 from URL:http://profiles.rekall.googlecode.com/git/
....
This method is actually extremely reliable since it will retrieve exactly the correct profile according to the RSDS header we find. Rekall uses this method by default to guess the required profile to use. Therefore normally users do not really need to provide the profile explicitly to Rekall:
$ rekall -f ~/images/win7.elf -v pslist
DEBUG:root:Voting round
DEBUG:root:Trying <class 'rekall.plugins.addrspaces.macho.MACHOCoreDump'>
.....
INFO:root:Autodetected physical address space Elf64CoreDump
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//pe.gz
INFO:root:Loaded profile pe from URL:http://profiles.rekall.googlecode.com/git/
DEBUG:root:Verifying profile GUID/F8E2A8B5C9B74BF4A6E4A48F180099942
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//GUID/F8E2A8B5C9B74BF4A6E4A48F180099942.gz
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//ntoskrnl.exe/AMD64/6.1.7600.16385/F8E2A8B5C9B74BF4A6E4A48F180099942.gz
INFO:root:Loaded profile ntoskrnl.exe/AMD64/6.1.7600.16385/F8E2A8B5C9B74BF4A6E4A48F180099942 from URL:http://profiles.rekall.googlecode.com/git/
INFO:root:Loaded profile GUID/F8E2A8B5C9B74BF4A6E4A48F180099942 from URL:http://profiles.rekall.googlecode.com/git/
DEBUG:root:Found _EPROCESS @ 0x2818140 (DTB: 0x187000)
We can see that Rekall initially fetches the pe profile (so it can parse the RSDS header), when a hit is found, the profile repository is search by the GUID. This is found as a symlink to an actual profile from a Windows 7 version.

What if the profile repository does not have my exact version?

As mentioned above we are still building the repository up as a public service, and it may be that we do not have the profile for the exact version in your memory image. You will typically see something like this:
$ rekall -f ~/images/win7.elf -v pslist
DEBUG:root:Opened url http://profiles.rekall.googlecode.com/git//GUID/F8E1A8B5C9B74BF4A6E4A48F180099942
DEBUG:root:Could not find profile GUID/F8E1A8B5C9B74BF4A6E4A48F180099942 in http://profiles.rekall.googlecode.com/git/
DEBUG:root:Could not find profile GUID/F8E1A8B5C9B74BF4A6E4A48F180099942 in None
Traceback (most recent call last):
  File "/home/scudette/VirtualEnvs/Dev/bin/rekall", line 9, in <module>
    load_entry_point('rekall==1.0rc3', 'console_scripts', 'rekall')()
  File "/home/scudette/rekall/rekall/rekal.py", line 145, in main
    flags = args.parse_args(argv=argv, user_session=user_session)
  File "/home/scudette/rekall/rekall/args.py", line 218, in parse_args
    LoadProfileIntoSession(parser, argv, user_session)
  File "/home/scudette/rekall/rekall/args.py", line 194, in LoadProfileIntoSession
    state.Set(arg, value)
  File "/home/scudette/rekall/rekall/session.py", line 169, in __exit__
    self.session.UpdateFromConfigObject()
  File "/home/scudette/rekall/rekall/session.py", line 210, in UpdateFromConfigObject
    self.profile = self.LoadProfile(profile_parameter)
  File "/home/scudette/rekall/rekall/session.py", line 464, in LoadProfile
    filename)
ValueError: Unable to load profile GUID/F8E1A8B5C9B74BF4A6E4A48F180099942 from any repository.
Although you could maybe substitute a generic profile (like Win7SP1x64 as described above). This is really not recommended and will probably stop working at some point in the future (as Rekall uses more advanced analysis methods which depend on accurate profiles).
The correct solution is to generate your own profile like this:
# First find the GUID of the kernel in your image
$ rekall -f win7.elf version_scan --name_regex krnl
  Offset (P)   GUID/Version                     PDB
-------------- -------------------------------- ------------------------------
0x0000027bb5fc F8E2A8B5C9B74BF4A6E4A48F180099942 ntkrnlmp.pdb

# Then fetch the GUID from Microsoft's symbol server.
$ rekall fetch_pdb -D. --guid F8E2A8B5C9B74BF4A6E4A48F180099942 --filename ntkrnlmp.pdb
Trying to fetch http://msdl.microsoft.com/download/symbols/ntkrnlmp.pdb/F8E2A8B5C9B74BF4A6E4A48F180099942/ntkrnlmp.pd_
Received 2675077 bytes
Extracting cabinet: ntkrnlmp.pd_
  extracting ntkrnlmp.pdb

All done, no errors.

# Now Generate the profile from the pdb file. You will need to provide the
# approximate windows version.
$ rekall parse_pdb -f ntkrnlmp.pdb --output F8E2A8B5C9B74BF4A6E4A48F180099942.json --win 6.1
  Extracting OMAP Information 62%
Please send us that GUID so we can add it to our repository. If you have a local repository you can just add it to your own repository (under the GUID/ directory).

Summary

  • Rekall has moved the profiles out of the code base
  • Profiles are now stored in their own unique repository.
  • Profiles are now much more accurate since they are exactly tailored to the specific version of the kernel in the memory image, rather than guessing approximate representative profiles by commercial release names (e.g. Win7).
  • Rekall also implements a robust profile auto-detection method. The user rarely needs to explicitly provide the profile on the command line, and detection is extremely fast and reliable.

Wednesday, February 19, 2014

Rekall - We can remember it for you wholesale.

Rekall started life as special branch in the Volatility project to explore new approaches of performing some memory analysis. Over time this branch was known as the "scudette" branch after the volatility core developer who performed this work (scudette@gmail.com). For various reasons (which you can read more about in the History section of the README.txt file) this branch evolved into a new project called "Rekall".
Although the Volatility project implements some excellent algorithms, we wanted to improve on the Volatility code by focusing on some areas which we felt were very important. The goals and priorities of the Rekall project are slighly different from the Volatility project’s:
  • Focus on coding style, readability and maintainability.
  • Create modular code which can be used as a library within other tools.
  • Focus on performance. Because we wanted to deploy Rekall to perform Live analysis with GRR we need it to be efficient and robust. It turns out that making it faster also makes the tool more accurate (We will discuss it in another post).
  • Develop and research more accurate, advanced memory analysis techniques.
  • Document all algorithms heavily. The volatility code base lacks much documentation on how algorithms are implemented. There are many "magic" numbers derived by reversing some unknown functions. These are hard to replicate and explain.
We will use this blog as a medium to discuss some of the improvements and research we did in the Rekall project, and the improvements over the Volatility code base. If you have suggestions or contribution, please either add a comment to the page below or send us a mail to rekall-dev@googlegroups.com.
Although we often compare the Rekall implementation to the one in the Volatility project, we do not mean to suggest that the Volatility approach is inferiour. Simply that they are focusing on different aspects of memory analysis. For example, volatility hasn't focused much on performance, but it's one of our current main focus - so we put more effort into optimizing the code for speed.

Currently we are focusing our efforts on the above areas and the tool is not yet officially released as a stable tool. Although it is generally stable, we reserve the right to modify APIs heavily before the final release.
We encourage people to try out the Rekall trunk and send bug reports or open issues with the google code site:
Or the mailing list:

Quick start

Rekall is available as a python package installable via the pip package manager. Simply type (for example on Linux):
sudo pip install rekall
You might need to specifically allow pre-release software to be included (until Rekall makes a major stable release):
sudo pip install --pre rekall
To have all the dependencies installed. You still need to have python and pip installed first.
To be able to run the ipython notebook, the following are also required:
pip install Jinja2 MarkupSafe Pygments astroid pyzmq tornado wsgiref
For windows, Rekall is also available as a self contained installer package. Please check the download page for the most appropriate installer to use (http://downloads.rekall.googlecode.com/git/index.html)

Development version

For development it is easier to install rekall inside a virtual env. Virtual Env is a way for containing and running multiple versions of python packages at the same time, without interfering with the host system.
# You might need to install virtualenv:
$ sudo apt-get install python-virtualenv

# This will build a new empty python environment.
$ virtualenv /tmp/Test

# Now we switch to the environment - all python code runs from here.
$ source /tmp/Test/bin/activate

# This will install all dependencies into the virtual environment.
$ pip install --pre rekall

# For development run the devel version - this will symlink your virtual
# environment with the source tree so when you make changes to the source they
# appear immediately in the code (without needing to install them to the
# environment first).

$ git clone https://code.google.com/p/rekall/
$ cd rekall
$ python setup.py develop

Rekall Profiles


Rekall started life as a fork from the Volatility project. Volatility uses profiles to control the parsing of memory. For example, in Volatility one must specify the profile before analysis begins:
$ vol.py -f myimage.dd --profile Win7SP1x86 pslist

What is a profile?

So what is this profile? A profile provides the application with specialized support for precisely the operating system version which is running inside the image. Why do we need specialized support? In order to make sense of the memory image.
Lets take a step back and examine how memory is used by a running computer. The physical memory itself is simply a series of zeros and ones, without any semantic context at all. The processor is free to read/write from arbitrary locations (sans alignment restrictions). However, computer programs need to organize this memory so they can store meaningful data. For example, in the C programming language one can define a struct which specifies how variables are laid out in memory (For all the details see this workshop):
typedef unsigned char uchar;
enum {
  OPT1,
  OPT2
} options;

struct foobar {
    enum options flags;
    short int bar;
    uchar *foo;
}
Using this information, the compiler can devise a layout of how to store each variable in memory. Since Rekall only receives the memory as a contiguous block of ones and zeros, we need to know where each parameter is laid out in memory.
This problem is actually common to a debugger. The debugger needs to also retrieve the struct members so it can display them to the user. It turns out that to make debugging easier, compilers generate exact layout information for every data type they have. This way the debugger can see where in memory (relative to the struct offset) is each parameter.
Rekall (and Volatility) use this debugging information to know how to extract each struct member from memory. We construct a python data structure which specifies exactly how to extract each field by parsing the debugging symbols.
For example, the above struct foo might by described by:
vtypes = {
  'foobar': [12, {
     'flags': [0, ['Enumeration', dict(
        target="unsigned int",
        choices={
          1: "OPT1",
          2: "OPT2",
          },
        )]],
     'bar: [4, ['unsigned short int']],
     'foo: [8, ['Pointer', dict(target="unsigned char")]],
}
Note that:
  • The description is purely data. It consists of field names, offsets and type names.
  • The precise offset of each field is provided explicitly. This is different from many other parsing libraries (e.g. Construct) which require all fields to be specified (or padding fields to be inserted). This special feature allows:
  • To write sparse struct definitions - i.e. definitions where not all the fields are known.
  • Alias fields (e.g. implement a union) where different types are all located in the same memory address.
A profile is actually a collection of such vtype definitions (among other things) which provides the rest of the code with the specific memory layout of the struct members. You can think of it as a template which is overlayed on top of the memory to select the individual field members.
Typically to analyze an operating system, the profile is generated from debugging symbols for the kernel binary itself.

How do we deal with versions?

As operating systems evolve over time, the source code changes in very subtle ways. For example, assume the above struct definition is altered to add an additional field:
struct foobar {
    unsigned int new_field;
    enum options flags;
    short int bar;
    uchar *foo;
}
Now to make space for the new field, all subsequent fields are pushed up by 4 bytes. This means the vtype definition we have above is wrong, since the offsets for all the fields have changed. If we tried to use the old template on the memory image from the new operating system, we will think that the new_field is actually flags, the flags field is actually bar etc.
So generally a profile must match the exact version of the operating system kernel we are analyzing. Slight version mismatches might still work but not reliably (Struct definitions which have not changed between versions will continue to work, but if some of the types were slightly modified our analysis will break).
So how does Volatility solve this problem?
  • Volatility has many windows profiles embedded into its source code. For example there is a profile for Windows XP Service Pack 3, one for Windows Vista Service Pack 2 etc. Also included are profiles for the different architectures (x86 and x64).
  • For OSX, one has to download the profile pack from the Volatility site. These are Zip files containing the textual output of dwarfdump (the dump of debugging symbols). When running on an OSX image, Volatility opens the zip file, parses the output of dwarfdump into an in memory python data structure before proceeding with the analysis. Each OSX profile is approximately 1mb, making the entire profile pack around 50mb big.
  • For Linux there are so many versions, that users must build their own by compiling a kernel module in debug mode, and dumping the output of dwarfdump. Again the profile is a zip file containing the output of the linux dwarfdump (which is actually slightly different from the OSX one). Again this must be parsed by the program before any analysis can begin.
There are a number of problems with this approach:
  1. Windows profiles are included in the code base, which means that all windows profiles are always loaded into memory all the time (even when analyzing a different version of windows).
  2. There are about 20-30 different windows profiles. In practice there are hundreds of released builds of the windows kernel. So the profiles that are included in Volatility are only representative to the precise version. As discussed above, one need to have the exact profile version for reliable memory analysis. Hence there is bound to be some variability between the profile version provided by Volatility and the one needed for the actual image.
  3. This is simply not scalable - there is a limit of how many profiles one can include with the code. For OSX the profiles must be downloaded separately, and for linux they must be built. You cant really use it as a library included into a third party with such a huge memory footprint.
  4. It is also very slow. Due to the plugin model in Volatility, profiles are placed inside one of the plugins directory. When Volatility starts up it tries to load all files inside its plugin directory. This means you cant just point Volatility into your profiles directory because it will always try to open every single profile you have in there.
  5. The profile format is not consistent between operating systems. The OSX profiles are parsed using OSX specific parsers, Linux is parsed using a textual based dwarf parser, while windows profiles must be inserted into the code manually.
  6. The profiles are very slow to parse. The dwarfparsers used for Linux and OSX profiles are actually parsing the textual output of the dwarfdump program - this is quite slow and not really needed.
Since it is important to the Rekall project to minimize memory footprint (so it can be used as a library) and also to improve performance, we had to redesign how profiles work:
  • We observed that the profile contains the vtype definitions for the specific operating system involved. The vtype definitions are just a static data structure consisting of lists, dicts, strings and numbers. This means we can store the profile in a data file, instead of embed it as python code.
  • In python, textual parsing is pretty expensive. Especially parsing the output of dwarfdump is pretty slow. We observed that profiles are written only once (when dumping the output of dwarfdump) but are read every single time the tool runs. It therefore makes sense to write the profile in a format which is optimized for loading very fast with minimal parsing. Since the vtype definition is just a data structure, we know that in Python, JSON is the fastest serialization for simple data structures there is. (Maybe cPickle is faster but we wanted to stay away from pickles to enable the safe interchange of profiles).
  • Finally we observed that for Linux and OSX (and actually for windows too, as explained in a future blog post), the zip file contains a number of different types of data. The Zip file contains the vtype description of all the structs using in the kernel, but also it contains the offsets of global symbols (e.g the kernel system map). For analysing these we need both symbols and constants to represent the kernel version.
In Rekall, the profile is a simple data structure (using strings, dict, lists and numbers) which represents a specific version of the kernel. Rather than separate the different types of information (e.g. vtypes and constants) into different members of a zip file, we combine them all into a single dict. Here is an example of a Linux Ubuntu 3.8.0-27 kernel:
{
 "$CONSTANTS": {
  ".brk.dmi_alloc": 18446744071598981120,
  ".brk.m2p_overrides": 18446744071598964736,
  ".brk.p2m_identity": 18446744071594827776,
  ".brk.p2m_mid": 18446744071594831872,
  ".brk.p2m_mid_identity": 18446744071598927872,
  ".brk.p2m_mid_mfn": 18446744071596879872,
  ".brk.p2m_mid_missing": 18446744071594807296,
  ".brk.p2m_mid_missing_mfn": 18446744071594811392,
  ".brk.p2m_missing": 18446744071594803200,
  ".brk.p2m_populated": 18446744071598952448,
....

"$METADATA": {
  "ProfileClass": "Linux64",
  "Type": "Profile"
  },
 "$STRUCTS": {
  "__raw_tickets": [4, {
   "head": [0, ["short unsigned int"]],
   "tail": [2, ["short unsigned int"]]
   }],
....
We can see that the top level object is a dict, with keys like "$CONSTANTS", "$METADATA", "$STRUCTS". These are called profile sections. For example, the most common sections are:
$CONSTANTS: A dict of constants and their offsets in memory.
$STRUCTS: The vtype description of all structs in this kernel version.
$METADATA: This describes the kernel, it contains the name of the python class that implements this profile, the kernel’s build version, architecture etc.
The whole data structure is serialized using JSON into a file and is loaded at once using pythons json.load() function (This function is actually implemented in C and is extremely fast).
An interesting optimization is the realization that if dictionaries are sorted in the json file, then gzip will work much more effectively (since the data will naturally contain a lot of repeated common prefixes - especially with the very large system map). This makes the JSON files much smaller on disk than the Volatility profiles. For example, the Volatility profile for OSX Lion_10.7_AMD.zip is about 1.2mb while the Rekall profile for the same version is 336kb. Both profiles contain the same information and are both compressed.
The Rekall profile format is standard across all supported operating systems. Even though generating the profiles uses different mechanism for different operating systems (i.e. parsing PDB files for windows, parsing dwarf files for Linux, parsing debug kernels for OSX), the final output is exactly the same. This makes the profile loading code in Rekall much simpler.
It is possible to convert existing Volatility profiles into the Rekall format by using the convert_profile plugin (This might be useful when migrating old profiles from Volatility to Rekall):
$ rekall convert_profile ./profiles/Volatility/SnowLeopard_10.6.6_AMD.zip ./OSX_10.6.6_AMD.json
$ rekall -f OSX_image.dd --profile ./OSX_10.6.6_AMD.json
In a future post we discuss how Rekall profiles are organized into a public profile repository.