Case of Intermittent Crashes

I’ve just finished a server application that performs writing of OPC Data to Historical Database. At the testing phase, I’ve noticed that the program crashes every now and then. Application sometimes runs smoothly for 3 – 4 hours or even a whole day before it crashes.

Using default configuration from Dr. Watson, there’s a method to retrieve the crash dump from temporary directory pointed to by the technical information from crash info dialog box:

cr01

cr02

Using the .ecxr and kv command of WinDbg after opening the crash dump file will reveals the location of access violation:

cr03

So it is caused by ecx register is not intialized at the event of the call. So what value the ecx value should be when the application runs normally ?

Since crash dump file is a postmortem data of the application, it is next to impossible to find object references denoted by the ecx value. In this case, I can use the live program to simulate ecx value at the above instruction location.

I will perform break-point at the original location when ecx value should behave normally, and because I’ve compiled it using private symbols, I can view the local variable as follows:

cr04

So, at normal condition, the value of ecx register should point to vector of CLPHIntervalValue, and when I compared it with ecx value at crash dump file, it points to null value + offset 0x2C. To find what kind of the variable at offset 0x2C, I should move up the callstack just before the routine call that crashes:

cr05

From the above picture, clearly the offset 0x2C is arIntVal variable and it is calculated from 0x223b300 which is instance of CLPHIntervalMgr class.

Here I can deduce that, the value of ecx = 0x2C is derived from non-existent CLPHIntervalMgr. But how could this possibly be happening ?

The “this” object that refers to CLPHIntervalMgr is a prove of successful call to the object, but somehow, at the execution to access the private variable of arIntVal that belongs to this class, the CLPHIntervalMgr object just vanish into thin air, leaving only offset 0x2C to begin with.

As it turned out, the above peculiarity is one of the problem of multi-threading model, especially when the routine is not protected inside the synchronization object.

Precisely, I have threads that performs write to the memory cache and one thread that perform remove the cache and writes it into the database. So, at the successful retrieval of CLPHIntervalMgr object, the writer thread removes the object, that causes the object vanishes when it tries to access arIntVal.

The problem is solved by moving the routine that perform execution of memory cache addition into the synchronization block.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: