Saturday, April 29, 2017

ShadowBrokers Leak: A Machine Learning Approach

During the past few weeks I read a lot of great papers, blog posts and full magazine articles on the ShadowBrokers Leak (free public repositories: here and here) released by WikiLeaks Vault7.  Many of them described the amazing power of such a tools (by the way they are currently used by hackers to exploit systems without MS17-010 patch) other made a great reverse engineering adventures on some of the most used payloads and other again described what is going to happen in the next feature.

So you probably are wandering why am I writing on this discussed topic again? Well, I did not find anyone who decided to extract features on such tools in order to correlate them with notorious payloads and malware. According to my previous blog post  Malware Training Sets: A machine learning dataset for everyone I decided to "refill" my public gitHub repository with more analyses on that topic.

If you are not familiar with this leak you probably want to know that Equation Group's (attributed to NSA) built FuzzBunch software, an exploitation framework similar to Metasploit. The framework uses several remote exploits for Windows such as: EternalBlue, EternalRomance, Eternal Synergy, etc.. which calls external payloads as well, one of the most famous - as today- is DoublePulsar mainly used in SMB and RDP exploits. The system works straight forward by performing the following steps: 

  • STEP1: Eternalblue launching platform with configuration file (xml in the image) and target ip.

Eternalblue working

  • STEP2: DoublePulsar and additional payloads. Once the Eternablue successfully exploited Windows (in my case it was a Windows 7 SP1) it installs DoublePulsar which could be used as a professional PenTester would use Meterpreter/Empire/Beacon backdoors.  

DoublePulsar usage
  • STEP3: DanderSpritz. A Command and Control Manager to manage multiple implants. It could acts as a C&C Listener or it might be used to directly connect to targets as well.

DanderSpritz


Following the same process described here (and described in the following image) I generated features file for each of the aforementioned Equation Group tools. The process involved files detonation into multiple sandboxes performing both: dynamic analysis and static analysis as well. The analyses results get translated into MIST format and later saved into json files for convenience.


In order to compare previous generated results (aka notorious Malware available here) to the last leak by figuring out if Equation Group is also imputable to have built known Malware (included into the repository), you might decide to use one of the several Machine Learning frameworks available out there. WEKA (developed by University of Waikato) is a romantic Data Mining tool which implements several algorithms and compare them together in order to find the best fit to the data set. Since I am looking for the "best" algorithm to apply production Machine Learning to such dataset I decided to go with  WEKA. It implements several algorithms "ready to go" and it performs auto performance analyses in oder to figure out what algorithm is "best in my case". However WEKA needs a specific format which happens to be called ARFF (described here). I do have a JSON representation of MIST file. I've tried several time to import my MIST JSON file into WEKA but with no luck. So I decided to write a quick and dirty conversion tool really *not performant* and really *not usable in production environment* which converts MIST (JSONized) format into ARFF format. The following script does this job assuming the MIST JSONized content loaded into a mongodb server. NOTE: the script is ugly and written just to make it works, no input controls, no variable controls, a very quick naive and trivial o(m*n^2) loop is implemented. 

From MIST to ARFF

The resulting file MK.arff is a 1.2GB of pure text ready to be analyzed through WEKA or any other Machine Learning tools using the standard ARFF file format. The script is going available here. I am not going to comment nor to describe the results sets, since I wont to reach "governative dangerous conclusions" in my public blog. If you read that blog post to here you should have all the processes, the right data and the desired tools to be able to perform analyses by your own. Following some short inconclusive results with no associated comments.

TEST 1:
Algorithm: Simple K-Mins
Number of clusters: 95 (We know it, since the data is classified)
Seed: 18 (just random choice)
Distance Function: EuclideanDistance, Normalized and Not inverted.

RESULTS (square errors: 5.00):

K-Mins Results

TEST 2 :
Algorithm: Expectation Maximisation
Number of clusters: to be discovered
Seed: 0

RESULTS (few significant clusters detected):

Extracted Classes
TEST 3 :
Algorithm: CobWeb
Number of clusters: to be discovered
Seed: 42

RESULTS: (again few significative cluster were found)

Few descriptive clusters (click to enlarge)

As today many analysts did a great job in study ShadowBrokers leak, but few of them (actually none so far, at least in my knowledge ) tried to cluster result sets derived by dynamic execution of ShadowBrokers leaks. In this post I tried to follow my previous path enlarging my public dataset by offering to security researcher data, procedures and tools to make their own analyses.