My Website :)

3, 2, 1 -- Kernel Down

About a year and a half ago, I was taking a Real-Time operating systems course. I don't want to bore you with details, so I'll give a high-level overview of what transpired up until where this story begins. We started the class with basically nothing, and from scratch, we had built a small operating system with kernel and user-level spaces, system calls, interrupts, and IO.

While working on the IO assignment, we spotted a peculiar bug. Whenever we started the kernel, it would crash after around 20 seconds. What's weird is that another team had the exact same bug. So began our journey of hunting down the root cause. To us amateurs, debugging kernel bugs felt like torture.

We had gone over the code multiple times without apparent immediate causes. I even objdumped the kernel to see maybe if there were some instructions somewhere causing the error. That proved to be futile. At some point, one of us noticed that the kernel didn't crash in the version of the kernel that we submitted before we implemented interrupts. Now at least we know where to look. The culprit must have been added while we added interrupts. This didn't make debugging any easier. At this point being frustrated and naive, we chose the easy but time-consuming approach of commenting bits and pieces of the interrupt code out until the bug disappeared. This would take us quite a long time because we had to identify the piece we wanted to comment out, comment it out and make sure the code still works without it, build the kernel, deploy the kernel onto the Raspberry Pi, and then wait another 20 seconds to see if the kernel crashes or not.

We started out by commenting this piece of code out:

// array of interrupt ids, C1 and C2 are clock interrupts
// static const uint32_t iids[] = {C1_ID, C2_ID, UART_ID};
static const uint32_t iids[] = {UART_ID};
enable_interrupts(iids, sizeof(iids)/sizeof(uint32_t));

This confirmed our theory that the interrupt code is causing the crash. Excited, we enabled interrupts again and proceeded to comment out bits of the code, but time went on and on and the crash kept happening. We had effectively commented out all the code for the interrupts and the bug kept happening. Half a day had passed and it was already late at night so we left the lab to gather again in the morning. Deflated, I headed home but kept thinking about this issue on my walk.

This is when I realized something. The bug only disappears when we completely disable the interrupts. What if the bug is in the context switch code and not the interrupt handler. I looked at the code and saw this:

// sets the current tasks and restores the context of said task
void activate(struct task* t) {
    cur_task = t;

    // restore the value of the registers
    asm volatile("mov x1, %0" :: "r"(t->tf.x1));
    asm volatile("mov x2, %0" :: "r"(t->tf.x2));
    // ...
    // More bad code
    // ...
    asm volatile("mov x30, %0" :: "r"(t->tf.x30));
    asm volatile("msr ELR_EL1, %0" :: "r"(t->tf.ELR_EL1));
    asm volatile("msr SPSR_EL1, %0" :: "r"(t->tf.SPSR_EL1));
    asm volatile("msr SP_EL0, %0" :: "r"(t->tf.SP_EL0));
    // have to set x0 last because how gcc compiles
    asm volatile("mov x0, %0" :: "r"(t->tf.x0));

    asm("eret");
}

It took me a few minutes to spot the bug myself. It might not be obvious unless you have worked with this type of code before, and if you have, you can tell right away that my code is terrible. The bug is with the asm("eret") call. This is an assembly instruction that returns from kernel mode to user mode. The issue is that when we ran eret, the activate function never reached its true end. So, it never got to deallocate its stack space. This is around 16 bytes of leaked memory on each call which might not sound like a big deal. But remember, the context switching is happening once every 10 milliseconds. Over 20 seconds, that's 32KB bytes of leaked memory. As our primitive kernel lacked any sort of memory protection, the stack probably grew so much that it ran into some other kernel memory and crashed the whole thing.

I messaged the other team and sure enough, they also had an asm("eret") in their context switch code. The solution was easy. Get rid of all the (dirty and evil) inline assembly and move the entire context switch code into an assembly file and link it.

And this was the story of how one line of code in one objectively terribly written function caused a small obscure problem that compounded and crashed a whole kernel.

Feel free to contact me at aryan.abed[at]pm.me