Was it just a “software glitch”?

In the winter of 1996, the Pathfinder by NASA was hailed as a one-of-its-kind spacecraft. Aiming to land the first-ever robotic rover on the Red Planet’s surface, the whole mission was designed to demonstrate the role technology can play in the “faster, better and cheaper” development and execution of space missions.

Yet, amidst all the exuberance of its successful automated landing at Mars’ Ares Vallis on July 4, 1997, the engineers came across a disturbing problem. The Pathfinder was repeatedly undergoing system resets, which wiped out entire data from its system even before it was transmitted back to Earth.

Interestingly, it all boils down to a simple concept one studies in a Computer Science undergraduate course. Before we dive into this and the methods used by engineers to fix it on a lonely machine out on another planet in space, let’s get to know a bit more about Pathfinder’s software.

Pathfinder’s software

Pathfinder consisted of a single-CPU computer system(meaning only one program could run at a time), with VxWorks as its Operating System. VxWorks is a Real-time Operating System developed by Wind River Systems.

As small-in-size Operating Systems, they have to utilize the limited resources at hand to the best of their ability to do multiple tasks — each of which is on a separate thread in a real-time OS. Also, as small pieces of software, these Operating Systems consist of an “information bus” — a shared memory area used for communication between different spacecraft parts.

Now that we know we had just one information bus but multiple tasks to execute, how do we decide which task is executed first? It is decided based on the relative urgency by assigning a “priority” to every task.

Oh, so this means “switching” between tasks is possible, then the next question that pops up is how do we ensure that data traveling through the information bus is accessed by only 1 task at a time? If this is not the case, then critical data may be corrupted by unintentional overwriting.

Hence, we need something analogous to a lock on the information bus — so that in the case above, if both the tasks share the same resource ,

This is called nothing but **Mutex** in Operating systems to ensure **mutual exclusion** of a resource — only one task can access a resource at a time.

So what was the Problem?

Let us consider that we have 2 tasks sharing the same resource(R)- here the information bus —

Everything works fine till a medium priority task(M) and a long-running task at that — here, the interrupt caused by the communications task — enters the scene. Let us assume an instance where —

2. A request for the bus management task(H) comes in, but it is waiting for L to quickly finish its execution and release the mutex on shared resource R.

3. The interrupt request of the communications task(M) comes in, requiring the information bus's shared resources. 4. Thus, M pauses L and starts its execution.

5. L still controls the mutex lock over the shared resource (R) because it was not allowed to finish. Hence H can only start its execution when M finishes, and then L finishes and releases the mutex. Or, **H is blocked by M.**

In generalized terms, a medium priority task paused(preempted) a low priority task, indirectly not allowing a high priority task to start because it is waiting for the shared resource still held by the low priority task.

Sounds familiar? Yes, this is the priority inversion problem, which, as the name suggests, is

Priority inversion violates the priority model that high priority tasks can only be prevented from running by higher priority tasks and briefly by low priority tasks, which will quickly complete their use of a resource shared by the high and low priority tasks.

Okay, now there was a problem. Yet, why were there total system resets? That’s because the spacecraft had a **watchdog timer** . Whenever it realized that a high priority task was waiting for a long time, it knew something was wrong, but it could not place the reason, so it would just force a system reset. This way, multiple system resets happened on the spacecraft(at least 4), which baffled the engineers because it started coming in the way of their experiments.

How did the engineers find out the problem?

We saw the problem theoretically above, but the JPL engineers had to dive through tons of data to figure out the system resets’ underlying issue.

What helped them here was —

Although this was not present in the production software aboard the spacecraft(because the logs will be a lot of data to send back to Earth), but they ran the software on the replica of the spacecraft in the lab, under conditions which they believed had led to the failure, with trace logs mode on.

After about 18 hours of running the software, they successfully got a system reset on the replica. The analysis of the trace logs led to the conclusion of the priority inversion problem.

How was it fixed?

The Priority Inversion problem is solved by using Priority Inheritance protocols .

Let us again assume the 3 tasks, Low priority task(L), Medium priority task(M), and High priority task(H), where L and H need to use the same shared resource.

In this particular case, VxWorks already had a boolean parameter to select whether the mutex should perform priority inheritance. This is by default switched off, which led to the priority inversion issue.

Hence, the fix was to switch ON the boolean parameter. But how do we do it on a spacecraft, millions of kilometers away from Earth?

Fortunately, VxWorks had a feature for our rescue, enabled in the spacecraft before launch — theC language interpreter . This allows the execution of C type expressions and functions for on-the-fly system debugging.

By coding convention, the initialization parameter for the mutex in question (and those for two others that could have caused the same problem) was stored in global variables, whose addresses were in symbol tables also included in the launch software — hence available to the C interpreter. A short C program was uploaded to the spacecraft, which changed these variables’ values from FALSE to TRUE when interpreted.

And that’s it! No more system resets occurred.

Some Lessons?!

You can read more in detail about it here—

The case of Mysterious System Resets on Mars Pathfinder