Analysing Unknown Binaries


The "right" method to analyze an unknown binary would be to perform reverse engineering (RE) on the code. The process of RE is however time consuming and convoluting.  As such, it may not be "cost-effective" to RE every binary that need to be analyzed.  We performed RE on selected code (e.g. login from bigwar.tgz) and code fragments (libproc.so.2.0.6 from bigwar.tgz). For most of the other binaries, we analyzed them using various methods, which include
  1. ELF file format analysis,
  2. strings analysis, and
  3. runtime analysis.
We explain each of these techniques in the following sections.

ELF File Format Analysis
The main purpose of ELF file format analysis is to detect the presence of parasite code which is often a virus.  To parse the ELF file, the utility readelf (part of GNU binutils package) is used. readelf is used to display information about an ELF object file, and the information that can be displayed includes the file header, program header, section header, symbol tables, dynamic information, relocation information, notes, and version information. One of the methods that parasite code uses to reside in an ELF executable is through the use of the techniques known as segment padding.  For segment padding, the code segment is padded by another page size, and the parasite code resides in this padded region. To have the parasite code executed, the entry point address (part of file header information) is modified to point to the padded region. However, for most of the compilers, the entry point of an executable is often the start of the .text section (which is often near the start of the code segment).  Thus, by inspecting the program entry point address with respect to the code segment and .text section, parasite code injected using the segment padding technique can be discovered. For the binaries provided, we were able to detect two different types of parasite code, the Linux.OSF.8759 virus and the Linux/Rst-A virus. These viruses infected chattr (and a list of other binaries) and tools/sniffer/kde from hax.tgz respectively. Anti-virus softwares are also helpful in identifying the signature of virus.

Another use of ELF file format analysis is to study the dependencies between executable and shared libraries. This information is captured in the dynamic section of the executable. The trojanised function of an executable may not necessary reside on the executable, it can reside in the shared library that it is linked to instead. The hack_procinit function of libproc.so.2.0.6 is an example. The libproc.so.2.0.6 shared library is linked by both ps and top.

Strings Analysis
Strings analysis refers to using the command strings to extract sequences of printable characters from a file, and subsequently extracts "interesting" strings from the command output. This method, though primitive, is often sufficient to detect a large class of trojans in the wild. For strings analysis, by knowing how some of the trojans operate, we often look for configuration filenames. For example, Linux Rootkit 5 (lrk5) and its derivatives make use of configuration files to store lists of filenames, directory names, processes names, IP addresses and port numbers to hide from command output. For example the filename  /usr/lib/locale/ro_RO/uboot/etc/procrc found in curatare/.Clean/pstree (in bigwar.tgz) certainly aroused our suspicion. The filename /dev/mounnt and the string cocacola found in the trojanised login is equally suspicious.  However, strings analysis can be easily circumvented. One of the methods is to use filenames that do not arouse suspicion of the investigator. The filename /usr/include/file.h in the trojanised ls is an example. For this case, we were only brought to the attention of the series of filenames /usr/include/{hosts.h, proc.h, file.h} in the various trojanised binaries when we look into the shell script remove. In the shell script remove, the files .c, .d and .p, which resemble typical trojans configuration files are mv to /usr/include/{hosts.h, proc.h, file.h} respectively. Another method used to circumvent strings analysis is to encode the filename used.  In this manner, the suspicious strings will be translated to another sequence of hexadecimal values, and may not necessarily appear in the strings command output.  Even if they do, they are not likely to arouse suspicion.  Binaries such as libproc.so.2.0.6 and dir, for example, employ such a technique. In any case, the trojanised nature of these unknown binaries will still surface once runtime analysis is performed.

When runtime analysis is not available (perhaps due to the lack of a "controlled" environment), output of strings command can provide pointers to the main function of an unknown binary too. This is possible through help/usage/error messages that are embedded in the binary. The binary nscd, unlike what the name suggests, is a trojanised sshd. However, the author could easily circumvent such analysis with false or encoded help/usage/error messages.  

Yet another use of strings analysis is to know more about the system on which the binary is compiled. This information shows up in the form of compiler strings, such as " GCC: (GNU) egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)", which was found in numerous binaries.  Binary such as lsof embeds more than compiler information in the binary.  Information such as user name, system name, system time, and compiler flags were also included. When a binary is not stripped, the name of variables, source files or header files will also show up in the binary. The strings " /xL/lrk5/fileutils-3.13/src/" and " ../../rootkit.h", for example, showed up in du,and ls. However, binaries that are not stripped should not be left behind by any respectable attacker.

Runtime Analysis
Runtime analysis should only be performed in a controlled environment, i.e. where damage done by the malicious code or worm spreading could be contained. We used Redhat 6.2 running within VMware for this purpose. The VMware is configured to use undoable disk to so that any damages done by the malicious code could be undone easily.  In addition, the networking mode is setup as host-only networking, and the precaution of turning off other network interfaces of the host machine is taken to prevent the spread of worm, if any.  Our choice of Redhat 6.2 as the guest OS is out of convenience.  Ideally, the guest OS should resemble the honeypot, probably a Redhat 7.2 system, as close as possible.

For executables that did not yield any useful results with strings analysis, we used strace to study the behaviour of these executables. strace intercepts and records the system calls which are called by a process and the signals which are received by a process. The name of each system call, its arguments and its return value are printed on standard error or to the file specified with the -o option. By comparing the system calls made between a known executable and the trojanised executable of the same kind, e.g. a legitimate login program and a supposedly trojanised login, any anomalies can be easily detected. This, of course, was done on the guest OS. With the use of strace, we were able to discover configuration files that are referenced by trojanised programs that did not shown up with static strings analysis. The program md5sum and netstat are examples. However,  an attacker can circumvent the investigator's attempt to strace a program by using anti-debugging techniques. In fact, this method is employed by the parasites code of various binaries found in hax.tgz and hax-small.tgz.

Whenever we have come up with a hypothesis of how the trojan operates, we test the hypothesis by actually running the trojanised executable within the guest OS. The required configuration files are set up in the respective location, and the command output is observed. We are aware that doing so does not expose all functionality of the trojans (only proper RE can do so). The main purpose, instead, is to verify our hypothesis. Thus, there may probably be lots of other backdoor or trojanised behaviour that remain undiscovered by our analysis.