About a year and a half ago, I was taking a Real-Time operating systems course. I don't want to bore you with details, so I'll give a high overview of what transpired up until where this story begins. We started the class with basically nothing, and from scratch, we had built a small operating system with kernel and user-level spaces, system calls, interrupts, and IO.
It was while working on the IO assignment that we spotted a peculiar bug. Whenever we started the kernel, it would crash after around 20 seconds. What's weird is that another team had the exact same bug. So began our journey of hunting down the root cause. To us amateurs, debugging kernel bugs felt like torture.
We had gone over the code multiple times without apparent immediate causes. I even objdumped the kernel to see maybe if there were some instructions somewhere causing the error. That proved to be futile. At some point, one of us noticed that the kernel didn't crash in the version of the kernel that we submitted before we implemented interrupts. Now at least we know where to look. The culprit must have been added while we added interrupts. This didn't make debugging any easier. At this point being frustrated and naive, we took the easy but time-consuming choice of commenting bits and pieces of the interrupt code out until the bug disappeared. This would take us quite a long time because we had to identify the piece we wanted to comment out, comment it out and make sure the code still works without it, build the kernel, deploy the kernel onto the Raspberry Pi, and then wait another 20 seconds to see if the kernel crashes or not.
We started out by commenting this piece of code out:
// array of interrupt ids, C1 and C2 are clock interrupts
// static const uint32_t iids[] = {C1_ID, C2_ID, UART_ID};
static const uint32_t iids[] = {UART_ID};
enable_interrupts(iids, sizeof(iids)/sizeof(uint32_t));
This confirmed our theory that the interrupt code is causing the crash. Excited, we enabled interrupts again and proceeded to comment out bits of the code, but the time went on and on and the crash kept happening. We had effectively commented out all the code in for the interrupts and the bug kept happening. Half a day had past and it was already late at night so we left the lab to gather again later. Deflated, I head home but kept thinking about this issue on my walk.
This is when I realized something. The bug only disappears when we disable the interrupt completely. What if the bug is in the context switch code and not the interrupt handler. I looked at the code and saw this:
// sets the current tasks and restores the context of said task
void activate(struct task* t) {
cur_task = t;
// restore the value of the registers
asm volatile("mov x1, %0" :: "r"(t->tf.x1));
asm volatile("mov x2, %0" :: "r"(t->tf.x2));
// ...
// More bad code
// ...
asm volatile("mov x30, %0" :: "r"(t->tf.x30));
asm volatile("msr ELR_EL1, %0" :: "r"(t->tf.ELR_EL1));
asm volatile("msr SPSR_EL1, %0" :: "r"(t->tf.SPSR_EL1));
asm volatile("msr SP_EL0, %0" :: "r"(t->tf.SP_EL0));
// have to set x0 last because how gcc compiles
asm volatile("mov x0, %0" :: "r"(t->tf.x0));
asm("eret");
}
It took me a few minutes to spot the bug myself and it might not be obvious unless you have worked with this type of code before. If you have worked with this type of code before though, you can tell that my code is terrible. The bug is with the asm("eret")
call. This is an assembly instruction that returns from kernel mode to user mode. The issue is that when we ran eret
, the activate
function never reached the end. So it never got to deallocate its stack space. This is around 16 bytes of leaked memory on each call which might not sound like a big deal. But remember, the context switching is happening once every 10 milliseconds. Over 20 seconds, that's 32KB bytes of leaked memory. As our primitive kernel lacked any sort of memory protection, the stack probably grew so much that it ran into some other kernel memory and crashed the whole thing.
I messaged the other team and sure enough, they also had an asm("eret")
in their context switch code. The solution was easy. Get rid of all the inline assembly and move the entire context switch code into an assembly file and link it.
And this was the story of how one line of code in one objectively terrible written function caused such a small obscure problem that compounded and crashed a whole kernel.
Feel free to contact me at aryan.abed[at]pm.me