Positive Research Center

Syndikovat obsah
Positive Researchhttp://www.blogger.com/profile/12273696227623127095noreply@blogger.comBlogger20813
Aktualizace: 20 min 18 sek zpět

Recovering Huffman tables in Intel ME 11.x

6 Prosinec, 2017 - 13:38
Today Positive Technologies' expert Dmitry Sklyarov will explain how Intel ME 11.x stores its state on the flash and the other types of file systems that are supported by ME 11.x in London during his talk on Black Hat conference. Here is his articles about recovering Huffman tables in Intel ME 11.x

Many Intel ME 11.x modules are stored in Flash memory in compressed form [1]. Two different compression methods are used: LZMA and Huffman encoding [2]. LZMА can be decompressed using publicly available tools [3], but reversing Huffman encoding is a much more difficult challenge. Unpacking of Huffman encoding in ME is implemented in hardware, so writing the equivalent code in software is a far from trivial task—assuming it can be accomplished at all.

Previous ME versionsBy reviewing sources from the unhuffme project [4], it is easy to see that previous versions of ME had two sets of Huffman tables, with each set containing two tables. The existence of two different tables (one for code and the other for data) is probably due to the fact that the code and data have very different statistical properties.

Other observed properties include:

  • The range of codeword lengths is different among the sets of tables (7–19 versus 7–15 bits inclusive).
  • Every sequence encodes a whole number of bytes (from 1 to 15 inclusive).
  • Both sets use Canonical Huffman Coding [5] (which allows quickly determining the length of a codeword during unpacking).
  • Within a given set, the lengths of encoded values for any codeword are the same in both tables (code table and data table).
Our taskWe can assume that Huffman tables for ME 11.x have retained the latter three properties. So to fully recover the tables, what we need to find is:

  • Range of lengths of codewords
  • Shape (boundaries of values of codewords of the same length)
  • Lengths of encoded sequences for each codeword
  • Encoded values for each codeword in both tables

Splitting compressed data into pagesTo learn about individual modules, we can start with what we already know about the internal structures of the firmware [6].

By looking at the Lookup Table, which is a part of the Code Partition Directory, we can determine which modules are affected by Huffman encoding, where their unpacked data starts, and what size the module will be once unpacked.

We can parse the Module Attributes Extension for a specific module to extract the size of the compressed/unpacked data and SHA256 value for the unpacked data.

A cursory look at several ME 11.x firmwares shows that the size of data after Huffman decoding is always a multiple of the size of a page (4096 == 0x1000 bytes). The start of the packed data contains an array of four-byte whole-number values. The number of array elements corresponds to the number of pages in the unpacked data.

For example, for a module measuring 81920 == 0x14000 bytes, the array will occupy 80 == 0x50 bytes and will consist of 20 == 0x14 elements.

The two most significant bits of each of the Little Endian values contain the table number (0b01 for code, 0b11 for data). The remaining 30 bits store the offset of the start of the compressed page relative to the end of the offset array. In the screenshot above, we see information describing 20 pages:

Offsets for the packed data for each page are arranged in ascending order; the size of the packed data for each page does not come into play directly. In the example above, the packed data for each particular page starts at a boundary that is a multiple of 64 = 0x40 bytes, with unused regions filled with zeroes. But judging by the other modules, we see that such alignment is not mandatory. This gives reason to suspect that the unpacker stops when the amount of data in the unpacked page reaches 4096 bytes.

Since we know the total size of the packed module (from the Module Attributes Extension), we can split the packed data into separate pages and work with each page individually. The start of the packed page is defined in the array of offsets; the page size is defined by the offset of the start of the next page or total size of the module. The packed data may be padded with an arbitrary number of bits (theoretically these bits could have any value, but in practice are usually zeros).

As seen in the screenshot, the last compressed page (starting at offset 0xFA40) consists of byte 0xB2 == 0b10110010, which is followed by 273 bytes with the value 0xEC == 0b11101100, and then zeroes. Since the bit sequence 11101100 (or 01110110) is repeated 273 times, we can assume that it encodes 4096/273 == 15 identical bytes (most likely with the value 0x00 or 0xFF). In this case, the bit sequence 10110010 (or 1011001) encodes 4096-273*15 == 1 byte.

This is consistent with the idea that each code sequence encodes a whole number of bytes (from 1 to 15 inclusive). But such an approach will not suffice for completely recovering the Huffman tables.

Finding the Rosetta Stone
As shown previously [6], identically named modules in different versions of ME 11 firmware can be compressed with different algorithms. If we take the Module Attributes Extension for identically named modules that have been compressed with LZMA and Huffman encoding, and then extract the SHA256 value for each module version, we find that there is no LZMA/Huffman module pair that has the same hash values.

But one should remember that for modules compressed with LZMA, SHA256 is usually computed from compressed data. If we calculate SHA256 for modules after LZMA decompression, a large number of pairs appears. Each of these module pairs yields several pairs of matching pages, both with Huffman encoding and in unpacked form.

Shape, Length, ValueHaving a large set of pages in both compressed and uncompressed form (separately for code and for data) allows recovering all of the code sequences used in those pages. The methods needed for this task combine linear algebra and search optimization. While it would be likely feasible to create a rigorous mathematic model taking all the relevant constraints into account, because this is a one-time task it was simpler to do part of the job manually and part with automated methods.

The main thing is to at least approximate the shape (points of change of code sequence lengths). For example, 7-bit sequences have values from 1111111 to 1110111, 8-bit sequences have values from 11101101 to 10100011, and so on. Since Canonical Huffman Coding is used, if we know the shape, we then can predict the length of the next code sequence (the shortest sequence consists of only ones, the longest one consists of only zeroes—the lower the value, the longer the sequence).

Not knowing the exact size of the compressed data, we can discard all trailing sequences consisting only of null bits. These sequences are the longest, and it is unlikely that the rarest code sequence would occur in the final position.

When each compressed page can be represented as a set of code sequences, we can start determining the lengths of the values that are encoded by them. The lengths of encoded values for each page should add up to 4096 bytes. With this knowledge, we can set up a system of linear equations: the unknowns are the lengths of encoded values, the coefficients are the number of times a particular code sequence is found in the compressed page, and the constant equals 4096. Code and data pages can both be "plugged in" at the same time, since for identical code sequences, the lengths of encoded values should be the same.

Once we have enough pages (and equations), Gaussian elimination provides the one valid solution. And once we have the uncompressed plaintext, length of each value, and their order, we easily derive which sequence codes for which value.

Unknown sequencesAfter running these methods on all pages for which we had both encoded and plaintext equivalents, we could recover up to 70% of sequences from the code table and 68% from the data table. Lengths were now known for approximately 92% of sequences. Meanwhile, the shape remained a bit of a mystery: in some places, either one value of shorter length or two values of a longer length could have been present, making it impossible to determine the boundary until one of the values is encountered in the unpacked data.

With this knowledge in hand, we can proceed to recover values for code sequences for which we do not have the matching plaintext.

If we have a sequence of unknown length, another row is added to our system of equations and we can quickly determine its length. But if we don't have the plaintext, how can we determine the value?

Verification and brute force to the rescueFortunately for us, the metadata contains the SHA256 value for the unpacked module. So if we correctly guess all unknown code sequences on all the pages that make up a module, the SHA256 value should match the value from the Module Attributes Extension.

When the total length of unknown sequences is 1 or 2 bytes, simple brute-forcing is enough to "crack the code." This method can also work with 3 or even 4 unknown bytes (especially if they are located close to the end of the module), but brute-forcing can take several hours or days on a single CPU core (that said, computation can easily be split among multiple cores or machines). No attempts were made to brute-force 5 or more bytes.

This method was able to recover several more code sequences (and several modules). This left only the modules in which the total number of unknown bytes exceeded the capabilities of brute force.

HeuristicsSince many modules are only slightly different from each other, we can apply several heuristics in order to decode more sequences by analyzing the differences.

Use of second Huffman tableSince the unpacker has two Huffman tables, the packer  tries to compress data with both tables, discarding the version that takes up more space. So the division of code versus data is not clear-cut. And if a part of a page changes, the other table may be more efficient (yield a smaller result). So by looking at other versions of the same module, we can find identical fragments that were compressed by the other table, thus recovering unknown bytes.

ReuseWhen a single code sequence is found many times in a single module or in multiple modules (such as in code and in data), it is often easy to figure out the constraints governing the unknown values.

Constants and tables with offsetsThe module data often contains constants and offsets (such as for text strings and functions). Constants may be module-unique or shared among modules (for example, hashing and encryption functions). Meanwhile, offsets are unique to each module version but must refer to the same (or very similar) fragments of data or code. As a result, the values can easily be recovered.

String constants from open-source librariesSome fragments of ME code were obviously borrowed from open-source projects (such as wpa_supplicant), which makes it easy to reconstruct fragments of text strings based on context and available source code.

Code from open-source librariesIf we look at the source code and find the text of a function whose compiled code contains several unknown bytes, we can simulate the compiler to guess which values fit.

Similar functions in other module versionsSince versions of the same module may be only slightly different, sometimes we can find the equivalent function in another version of the module and, based on the code, figure out what the unknown bytes should mean.

Similar functions in prior ME versionsWhen code has not been taken from public sources, if we have a fragment that is unknown in all available versions of a module and is not found in any other module (which was the case with the amt module), we can find the same place in previous versions (such as in ME 10), pick apart the logic of the function, and then see how it works in the unknown spot in ME 11.x.

Finishing the jobStarting with the modules containing the highest percentage of known sequences, and combining the described heuristics, we gradually were able to improve our coverage of the Huffman tables (each time testing our work against the SHA256 hash). Modules that originally contained dozens of unknown sequences now had only a few remaining. The process looked to be soon complete—except for that pesky amt.

As the largest module (around 2 megabytes, or 30% of the total), amt contains many codewords that are not found in any other module but occur in all versions of amt. We were highly confident of several sequences, but the only way to be sure would be to guess all of them correctly (so as to match the SHA256 hash). Thanks to the invaluable assistance of Maxim Goryachy, we were able to bring down this barrier as well. We could now unpack any module in the firmwares available to us that had been compressed with Huffman encoding.

New firmware versions appeared over time, containing all-new codewords. But in each case, one of the above heuristics succeeded in solving the module's mysteries, therefore further improving our coverage of the tables.

Closing notesAs of mid-June 2017, we were successful in recovering approximately 89.4% of sequences for the code table and 86.4% for the data table. But the chances of successfully obtaining 100% table coverage in a reasonable period of time based on analysis of new modules were slim at best.
On June 19, user IllegalArgument published Huffman table fragments on GitHub [7], covering 80.8% of the code table and 78.1% of the data table. Most likely, the author (or authors) used a similar approach based on analysis of available firmware versions. Thus the published tables did not provide any new information.

On July 17, Mark Ermolov and Maxim Goryachy made a breakthrough and could now find plaintext values for any compressed data. We prepared four compressed pages (two pages for each of the Huffman tables), recovering all 1,519 sequences for both tables [8].

In the process, one quirk came to light. In the Huffman table for data, the value 00410E088502420D05 corresponds to both 10100111 (8 bits long) and 000100101001 (12 bits long). This is a clear case of redundancy, most likely caused by an oversight.

The final shape of the data is as follows:

  1. https://recon.cx/2014/slides/Recon%202014%20Skochinsky.pdf
  2. https://en.wikipedia.org/wiki/Huffman_coding
  3. http://www.7-zip.org/sdk.html
  4. http://io.netgarage.org/me/
  5. http://www.cs.uofs.edu/~mccloske/courses/cmps340/huff_canonical_dec2015.html
  6. https://www.troopers.de/troopers17/talks/772-intel-me-the-way-of-the-static-analysis/
  7. https://github.com/IllegalArgument/Huffman11
  8. https://github.com/ptresearch/unME11

Author: Dmitry Sklyarov, Head of Application Research Unit, Positive Technologies

Positive Technologies on GitHub

4 Prosinec, 2017 - 14:01

Currently, an increasing number of companies, such as Google, Microsoft, Facebook, and JetBrains, are placing in open access the code of both small and big projects. Positive Technologies is famous not only for its skilled professionals in IT security but also for a lot of professional developers. This enables us to make a contribution into further development of the Open Source project.

PT has the following GitHub groups that support our open projects:

We have given a detailed description of the first group together with its projects and a brief description of others.

Positive TechnologiesThis is the main group where we develop projects designed for open access from the start and those that used to be exclusively insider projects. Education and demo projects are also run here.

Open DevOps communityThe Community is aimed to build ready-made open solutions for managing the full cycle of development, testing, and related processes, as well as product delivery, deployment, and licensing.

Currently, the Community is at an early stage of development but it already provides some useful Python-based tools. Yes, we do love Python!

Active projects:

  1. Crosspm is a universal package manager enabling to download packages to assemble multi-component products on the basis of rules set out in the manifest.
  2. Vspheretools is a tool that allows to control vSphere machines straight from the console. It is also possible to use it as an API library in your Python scripts.
  3. YouTrack Python 3 Client Library is a Python client to work with API YouTrack.
  4. TFS API Python client is Python client to work with API Team Foundation Server by Microsoft.
  5. A Python client for Artifactory is a Python client to work with the Artifactory API binary data storage.
  6. FuzzyClassificator is a universal neuro-fuzzy classifier of arbitrary objects whose properties can be measured based on a fuzzy scale.

Each tool has an automatic Travis Cl build uploadable to the PyPi repository where it can be found and installed with the standard pip install.

Some other tools are getting ready to be published:

  1. CrossBuilder is a system for cross-platform builds (build as a code). It is just like Travis CI while being independent of a Cl system used (TeamCity, Jenkins, GitLab-CI, etc.).
  2. ChangelogBuilder generates release notes describing any amendments made to product features. This generator receives and aggregates data from various trackers (TFS, YouTrack, GitLab, etc.).
  3. polyglot.sketchplugin is a plugin for the Sketch system used by designers for easier handling of multilingual text composition.

Everybody is welcome to contribute a new tool. If you wish to create your own project, please see our ExampleProject for the project structure and detailed guidelines on how to create a project. All you need to do is to copy it and create your own project on the basis of ExampleProject. If you do have any ideas or tools for automation, you are welcome to share them with the Community under an MIT license. This is fashionable, honorable and prestigious :)

Positive researchThis is a repository for publishing research articles, presentations, utilities (including those for detecting vulnerabilities), signatures, and methods of attack detection.

  • Pentest-Detections is a utility that enables quick network scanning (with support of IPv4 and IPv6) and detection of vulnerabilities that can be exploited by WannaCry or NotPetya.
  • UnME11 are tools that allow to decode the latest versions of Intel ME 11.x.
  • Bad_Tuesday_Cryptor_SIEM is a MaxPatrol SIEM package for combatting of NotPetya.
  • Me-disablement are methods to disable Intel ME. The repository contains only the old method. For a new method on the basis of High Assurance Platform (HAP), please see our article Disabling Intel ME 11 via undocumented mode.

AttackDetectionThe attack team provides to this repository some rules for vulnerability exploitation thanks to the intrusion detection systems Snort и Suricata IDS. The project’s main goal is to create rules for vulnerabilities that are widely spread and have high severity. The repository contains files for integration with oinkmaster, a script for updating and deploying rules in a designated IDS. The repository also has traffic files to test the rules. Notably, the repository has been added to favorites 100 times while about 40 new vulnerabilities have been discovered for the year, including BadTunnel, ETERNALBLUE, ImageTragick, EPICBANANA, and SambaCry. Announcements about new vulnerabilities are published in Twitter.

Positive JSThis is a community for developing tools (mainly web tools) used in PT products.

LibProtectionThis is an organization uniting Positive Development User Group’s members who are currently working on adjusting the LibProtection library for various languages and platforms. The library provides developers with safe solutions for working with strings while perfectly ensuring sanitization of input data and automated protection of applications from injection attacks.

PT.PMPT Pattern Matching Engine is a universal signature code analysis that accepts user patterns written in a domain specific language (DSL). This engine is used to check web applications for vulnerabilities contained in Approof, as well as in the source code analyzer PT Application Inspector.

The analysis includes several stages:

  1. Parsing of the source code into the parse tree.
  2. Converting the tree into a unified format.
  3. Comparing the tree with user patterns.

The approach used in this project allows to unify the task of universal pattern development for different languages.

PT.PM is conducting continuous integration, supporting assembly and testing of modules both in Windows and in Linux (Mono). The development is implemented via labelled issues and pull requests. Alongside with the development, we also document the project while publishing results of all major builds both in the format of NuGet packages and raw artifacts. The way PT.PM is organized may be considered as an example for all further projects.

For the first stage, that is for source code parsing, we use parsers based on ANTLR. The tool generates them for different runtimes on the basis of formal grammars contained in the repository. Our company is actively developing the repository. Currently, Java, C#, Python 2 and 3, JavaScript, C++, Go, and Swift runtimes are supported while support for the latter three has started just recently.

Noteworthy, ANTLR is used not only in PT projects on Application Security but also in Max Patrol SIEM where it is used for processing the Domain Specific Language, which is applied for description of dynamic asset groups. Knowledge exchange has prevented waste of time on tasks already solved before.

ANTLR grammars
Positive Technologies has helped to develop and improve grammars for PL/SQL, T-SQL, MySQL, PHP, Java 8, JavaScript, and C#.

PL/SQLSQL grammar has vast syntax with lots of keywords. Fortunately, the PL/SQL grammar already existed for ANTLR 3 and it was not that difficult to port it for ANTLR 4.

T-SQLNo reliable parsers were found for T-SQL, not even mentioning open sources, so it took us quite a long time and efforts to restore the grammar on the basis of MSDN documents. Anyway, we finally managed to achieve a great result: the grammar covers many common syntactic constructions, looks neat, stays independent of the runtime, and it has been tested (see the examples of SQL queries on MSDN). Since 2015, over 15 external users have contributed to the grammar. Moreover, the grammar is also used now in DBFW, a prototype of network firewall for data base management, the subproject of PT Application Firewall.

MySQLThe grammar was developed by the team mentioned above on the basis of T-SQL. It is also used in DBFW.

PHPThis grammar was translated from Bison to ANTLR. It is interesting for its support of PHP, JavaScript, and HTML at once. To be more precise, code sections of JavaScript and HTML are parsed into text, which is then processed by parsers specifically for these languages.

JavaThe grammar supporting Java 8 has been developed just recently. It is based on grammar of the former version Java 7. The new grammar introduces substantially expanded and improved test examples with various syntaxes (AllInOne7.java, AllInOne8.java) and performance test results for popular Java projects (including jdk8, Spring Framework, Elasticsearch).

JavaScriptIt was developed on the basis of the old grammar ECMAScript without strict compliance with the standard. When developing grammars, we are primarily focused on practicality and simplicity and not just formal compliance. Another distinctive feature is almost full support of ECMAScript 6 as well as outdated constructions (HTML comments, CDATA sections).

Not all syntax constructions can be described with grammar rules only. In some cases, it is convenient and important to use the code on a target runtime language. For instance, in JavaScript the token get is just an identifier in some cases while in other cases it can be a keyword describing a property getter. So, it is possible to parse this token as a common identifier and check token values in the parser when processing the property:

    : Identifier{p("get")}? propertyName

This grammar is interesting because these code fragments are universal at least for C# and Java runtimes thanks to the superClass option.

This means that, in the C# code, the function p("get") is described in the parent class JavaScriptBaseParser.cs:

protected bool p(string str) =>  _input.Lt(-1).Text.Equals(str).

As for Java, this function looks as follows (JavaScriptBaseLexer.java):

protected boolean p(String str) {
    return _input.LT(-1).getText().equals(str);

C#Being mostly experimental, this grammar was created to compare the speed of ANTLR and Roslyn parsers.

Developments and prospects
For more details on grammar development, please see our last year's article Theory and Practice of Source Code Parsing with ANTLR and Roslyn.

As can be seen in the change history and numerous number of merged pull requests (tsql, plsql, mysql), these grammars are constantly being improved not only by Positive Technologies but also by a number of third-party developers. For the time of collaboration, the repository has grown not only in terms of quantity but also in terms of quality.

PT.SourceStatsIt allows to collect statistics for projects based on different programming languages while being used in the free product Approof.

AspxParserThis project is devoted to developing a parser of ASPX pages that is used not only in the open PT.PM engine but also in the internal analyzer of NET applications (AI.Net), which is based on abstract interpretation of code.

FP Community rulesIn the repository, we are currently developing rule sets in the YARA format. These rule sets are used in the signature analysis module of Approof projects named FingerPrint.

The FingerPrint engine is launched based on a set of source codes (backend and frontend). In accordance with the rules described, YARA searches for known versions of external components (for example, a version 3 bla-bla library). The rules are set in such a way that they contain signatures of vulnerable library versions where problems are described in the text format.

Each rule includes several conditions for file checking. For instance, that could be certain strings contained in a file. If the file meets the conditions, Approof provides in the final report the information about vulnerabilities found in a certain component, indicating the N version and describing all related CVEs.

Education and demo projectsAt PHDays VII, the Appsec Outback master class was conducted at the PDUG section. For the master class, we developed education and demo versions of the Mantaray static code analyzer and the Schockfish network firewall. These projects have all the main mechanisms that are used in mature security tools. Unlike the latter, their main goal is to demonstrate algorithms and security methods, help to understand the process of app analysis and protection, and to show fundamental possibilities and limitations of technologies.

The repository also contains examples of security tools implementation:

  • DOMSanitizer — a module for detecting XSS attacks against web browsers
  • DOMParanoid — a module (security linter) for assessing HTML security.

Our projects use both permission licenses (MIT, Apache) and our own, which allows free non-commercial use.

Our move to GitHub has proved to be quite useful and made us experienced in various areas — setting up DevOps for Windows and Linux, document writing, and developments.

Positive Technologies plans to expand Open Source projects even further.

Author: Ivan Kochurkin, Positive Technologies

Intel fixes vulnerability found by Positive Technologies researchers in Management Engine

23 Listopad, 2017 - 16:36

Intel has issued a security advisory and released a patch for a vulnerability discovered in Intel ME by Positive Technologies researchers Mark Ermolov and Maxim Goryachy. Intel has also published a downloadable detection tool so that administrators of Windows and Linux systems can determine whether their hardware is at risk.

Intel Management Engine is a proprietary dedicated microcontroller integrated into the Platform Controller Hub (PCH) with a set of built-in peripherals. Since the PCH is the conduit for almost all communication between the CPU and external devices, Intel ME has access to practically all data on the computer. The researchers found a flaw that allows running unsigned code on the PCH on any chipset for Skylake processors and later.

For example, attackers could target computers with a vulnerable version of Intel ME in order to implant malware (such as spyware) in the Intel ME code. This would be invisible to most traditional protection methods, since the malware is running on its own microprocessor—one separate from the CPU, which is used to run most operating systems and security software. Because the malware would not visibly impact performance, victims may not even suspect the presence of spyware that is resistant both to OS reinstalls and BIOS updates.

The Intel security advisory gives a full list of vulnerable processors:

  • 6th, 7th & 8th generation Intel Core processor family
  • Intel Xeon Processor E3-1200 v5 & v6 product family
  • Intel Xeon Processor Scalable family
  • Intel Xeon Processor W family
  • Intel Atom C3000 processor family
  • Apollo Lake Intel Atom Processor E3900 series
  • Apollo Lake Intel Pentium
  • Celeron N and J series processors

Maxim Goryachy, one of the Positive Technologies’ researchers, described the situation as follows: “Intel ME is at the core of an enormous number of devices worldwide. This is why it was so important to test its security. This subsystem sits below the OS and has access to a huge range of data. An attacker could use this privileged access to evade detection by traditional protection tools, such as antivirus software. Our close partnership with Intel was aimed at responsible disclosure, as part of which Intel has taken preventive measures, such as creating a detection tool to identify affected systems. We recommend reading the Intel security advisory for full details. We plan to present this vulnerability in greater detail in an upcoming talk at Black Hat Europe.”