O’Reilly has a great interview up with NASA’s Peter Gluck, project software engineer for the Mars Phoenix Lander. I always find the design and implementation of mission-critical systems interesting. In short, they’re running a radiation-hardened system (the RAD 6000 board) with a 33MHz CPU, 128 megabytes of RAM, and a PCI peripheral interface… pretty advanced stuff for space. This usually surprises people when they first hear about these systems, but the circumstances require proven technology that is hardened against the perils of outer space (for example, the Hubble Space Telescope was recently upgraded to an Intel 486 processor… the Space Shuttle still runs on hardened PDP-11s).
The software is written in C and running on the VxWorks real-time OS… Lockheed Martin (who wrote the control systems) switched from ADA to C a few years back. There are plenty more interesting details in the article. Here are a few teasers:
The RAD 6000 has built in error detection and corrections. So the hardware does RAM scrubbing. There is a RAM scrubbing that occurs on a continuous basis. And beyond that, we have internal fault protection that monitors the health and safety of the software. And if a software task, for example, fails to respond to a ping, we have pings in the system, then the fault protection task will declare that a fault has occurred and will safe the spacecraft. And what that means, by “safeing”, we mean that the spacecraft will enter into a power and communications safe mode where it will just sit and wait for the ground to respond. It’ll basically phone home and say, I’ve got a problem; somebody tell me what to do.
So if it were to completely lock-up, the hardware has to be stroked every 64 seconds. There’s a watch-stop timer. And so if that 64 second period expires, then the hardware resets and the software is rebooted, and hopefully that clears whatever error occurred. Now in the event that that doesn’t work, we have a whole second set of avionics onboard. So the hardware will try to boot to the same side, and if the same side doesn’t come up and start stroking the watch-stop timer, then it will swap to the other side and boot the first side.
Interviewer: Am I right in assuming that there’s very little process separation in the older RAD 6000 boards?
Peter: Exactly… We have strict coding guidelines that we use. We don’t allow dynamic memory allocation, for example.
These are true fail-safe systems… not the stuff we mortal engineers play with. Click HERE to read the rest of the interview.