How to Solve Android Kernel Panic

In the course of studying a library compiled in Android’s ARM platform, I came across this message upon activating the Android ARM OS inside the emulator as follow:

panic01

The error Kernel panic – not syncing is happening at boot time, and I have the stack trace as follows:

panic02

I can say that the above stack trace is rather useless, because it doesn’t show what causes it, it only shows the order the signal is processed.

The only useful info is only the signal code, in the error message is called exitcode=0x0000000b. This code is actually a SIGSEGV, and certainly there are some modules of functions that triggers this signal.

To find what module that causes the signal, it is necessary to examine the android kernel, and mine is inside android goldfish kernel source. There are several candidates that trigger the SIGSEGV signal, such as force_sig, force_sigsev, arm_syscall, etc. This should be eliminating one by one until leaving the only one that verified to be actually executed.

So how to perform the verification process for each candidate function ? Again use the kallsyms and the modified android emulator to the rescue. You can consult to the previous article on how to perform physical breakpoint to the kernel function of interested.

The elimination process leaves the do_page_fault kernel function as the causes of SIGSEGV. But to arrive at the condition of really triggering SIGSEGV in the function flow, it is necessary to examine whether it actually calls __do_page_fault function.

Unfortunately, the __do_page_fault is not listed in kallsyms, so I should find another alternatives for tracing the function flow. By examining the function flow inside do_page_fault, by discounting the __do_page_fault, there are two more calls to be examined, i.e. __do_kernel_fault and __do_user_fault.

But examining the __do_kernel_fault at source level convinced me it is unlikely to be get called, because it leads to a very different kernel error message as shown at the start of this article.

This leaves only __do_user_fault, and it is indeed gets called. This is the register state at the start of the function:

panic03

This is the prototype of the function:

panic04

There are many false faults in the Android system because, it use the fault mechanism as means for doing some other correction routines, such as memory allocation for example.

Notice the R3 value contains 11 (0x0B) is the kernel code for SIGSEGV.

To differentiate between the real and false one, it is necessary to perform register examination at the start of __do_user_fault. Here is the real one:

panic05

At the start of __do_user_fault function, R11 register contains fault code, in this case 0x65536 is VM_FAULT_BADMAP. And the requested address is at R4 register which is 196608 = 0x30000. And the system will stop by sending kernel panic and halted.

So why those specific memory causes a fault ? To answer this question, it is necessary to provide complete emulator execution log, so that I can locate the exact position of the error.

By examining the result of generated qemu-log, I’ve found the strange execution patterns as follows (i.e. moments before the system crashes):

panic06

Seems like the emulator is encountering an endless loops and keep continuing until it hit at some memory boundary and causes the un-recoverable fault. This is evident by examining the next pc which located at 0xC0031C1C. This address is in the vicinity of do_page_fault function trap and will call __do_user_fault when it gives up after trying the recovery routines.

By tracing the above run away execution log up to the starting point, I notice the anomaly as follows:

panic07

Instead of performing unconditional jump to the requested address, it just keeps executing the instruction below, just like it is blind that the instruction were there.

In the end, after examining the emulator’s source code, I notice that in the course of modifying the routine inside disas_thumb_insn (translate.c), I accidentally remove the cpu_lduw_code statement, so the instruction is not get processed when the Android OS switches to thumb mode.

The cause is so simple, but the effect is enormous and almost impossible to find out, like finding needle in a haystack the size of football field.

And when it is happening in the middle of some deadline, it will surely cause a panic attack, as suggested by Android’s Linux OS kernel message 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: