Was it just a “software glitch”?
   
Yet, amidst all the exuberance of its successful automated landing at Mars’ Ares Vallis on July 4, 1997, the engineers came across a disturbing problem. The Pathfinder was repeatedly undergoing system resets, which wiped out entire data from its system even before it was transmitted back to Earth.
Interestingly, it all boils down to a simple concept one studies in a Computer Science undergraduate course. Before we dive into this and the methods used by engineers to fix it on a lonely machine out on another planet in space, let’s get to know a bit more about Pathfinder’s software.
Pathfinder’s software
   
As small-in-size Operating Systems, they have to utilize the limited resources at hand to the best of their ability to do multiple tasks — each of which is on a separate thread in a real-time OS. Also, as small pieces of software, these Operating Systems consist of an “information bus” — a shared memory area used for communication between different spacecraft parts.
Now that we know we had just one information bus but multiple tasks to execute, how do we decide which task is executed first? It is decided based on the relative urgency by assigning a “priority” to every task.
Oh, so this means “switching” between tasks is possible, then the next question that pops up is how do we ensure that data traveling through the information bus is accessed by only 1 task at a time? If this is not the case, then critical data may be corrupted by unintentional overwriting.
Hence, we need something analogous to a lock on the information bus — so that in the case above, if both the tasks share the same resource ,
   
So what was the Problem?
Let us consider that we have 2 tasks sharing the same resource(R)- here the information bus —
Everything works fine till a medium priority task(M) and a long-running task at that — here, the interrupt caused by the communications task — enters the scene. Let us assume an instance where —
   
   
   
In generalized terms, a medium priority task paused(preempted) a low priority task, indirectly not allowing a high priority task to start because it is waiting for the shared resource still held by the low priority task.
Sounds familiar? Yes, this is the priority inversion problem, which, as the name suggests, is
Priority inversion violates the priority model that high priority tasks can only be prevented from running by higher priority tasks and briefly by low priority tasks, which will quickly complete their use of a resource shared by the high and low priority tasks.
   
How did the engineers find out the problem?
We saw the problem theoretically above, but the JPL engineers had to dive through tons of data to figure out the system resets’ underlying issue.
What helped them here was —
Although this was not present in the production software aboard the spacecraft(because the logs will be a lot of data to send back to Earth), but they ran the software on the replica of the spacecraft in the lab, under conditions which they believed had led to the failure, with trace logs mode on.
After about 18 hours of running the software, they successfully got a system reset on the replica. The analysis of the trace logs led to the conclusion of the priority inversion problem.
How was it fixed?
The Priority Inversion problem is solved by using Priority Inheritance protocols .
Let us again assume the 3 tasks, Low priority task(L), Medium priority task(M), and High priority task(H), where L and H need to use the same shared resource.
In this particular case, VxWorks already had a boolean parameter to select whether the mutex should perform priority inheritance. This is by default switched off, which led to the priority inversion issue.
Hence, the fix was to switch ON the boolean parameter. But how do we do it on a spacecraft, millions of kilometers away from Earth?
Fortunately, VxWorks had a feature for our rescue, enabled in the spacecraft before launch — theC language interpreter . This allows the execution of C type expressions and functions for on-the-fly system debugging.
By coding convention, the initialization parameter for the mutex in question (and those for two others that could have caused the same problem) was stored in global variables, whose addresses were in symbol tables also included in the launch software — hence available to the C interpreter. A short C program was uploaded to the spacecraft, which changed these variables’ values from FALSE to TRUE when interpreted.
And that’s it! No more system resets occurred.
Some Lessons?!
  