[meteorite-list] More Curiosity Computer Troubleshooting on Tap

From: Ron Baalke <baalke_at_meteoritecentral.com>
Date: Tue, 5 Mar 2013 08:29:45 -0800 (PST)
Message-ID: <201303051629.r25GTj2s017192_at_zagami.jpl.nasa.gov>

http://www.spaceflightnow.com/mars/msl/130304computer/

More Curiosity computer troubleshooting on tap
BY WILLIAM HARWOOD
STORY WRITTEN FOR CBS NEWS "SPACE PLACE" & USED WITH PERMISSION
March 4, 2013

Work to carry out what amounts to an electronic brain transplant aboard
the Curiosity Mars rover -- a complex sequence of steps to switch
operations to a backup flight computer -- is continuing this week amid
ongoing analysis to figure out how to resolve memory corruption
discovered last week in the rover's active computer.

The memory glitch interrupted science operations, forcing flight
controllers to put the craft in a low-activity "safe mode" while the
computer switch was implemented.

Richard Cook, the Mars Science Laboratory project manager at the Jet
Propulsion Laboratory in Pasadena, Calif., told CBS News Monday the
computer swap was going well and that limited science operations should
resume shortly.

"We spent the weekend kind of getting back, not totally to regular
operations, but at least out of the immediate safe mode kind of a
thing," he said. "We got it out of safe mode, got back to using the
high-gain antenna, so we're well along the way to restoring things."

The problem cropped up last Wednesday when Curiosity failed to send back
science data as expected and then failed to put itself to sleep during
scheduled downtime. Reviewing telemetry, engineers discovered data
corruption in the solid-state memory used by the rover's active flight
computer.

Curiosity is equipped with two redundant computer systems, known as
"side A" and "side B." Either one is capable of carrying out the rover's
mission and only one operates at a time with the other on standby as a
backup. The B-side computer was checked out during the cruise from Earth
to Mars while the A-side computer has been running operations since
before landing last August.

Cook said the switchover to side B is a complex procedure and that
engineers are taking their time to make absolutely sure the process is
carried out correctly.

"We have some more work to do to upload configuration files and
parameters, things like that, so it's going to be another few days or so
to kind of get things totally recovered," he said. "But basically, it's
going well."

Once the B-side computer is fully up and running, limited science
operations should resume. But Cook said the engineering team wants to
have a better idea of what went wrong with the A-side memory before
going "full throttle" on the B-side computer.

Engineers suspect the memory glitch might have been caused by space
radiation, a "single-event upset" in which an energetic particle made it
through radiation-hardened components and changed the state of one or
more memory addresses. As luck would have it, the corruption was found
in the memory's directory, which tracks where data is stored.

If that theory is correct, booting the A-side computer and its software
would be expected to re-write the memory blocks, presumably flushing the
corrupted data. In that case, assuming no other problems, the A-side
computer would be deemed healthy and cleared to serve as backup to the
B-side computer.

But before attempting a full re-boot, Cook said, engineers plan to
power-up the A-side machine Wednesday, without loading software, to
check the status of the non-volatile memory.

"The first thing you can do is just turn it on without software running
and just treat it like it's an extended memory bank," he said. "That's
actually what we're going to do first, we're just going to read the
memory. If it comes back saying it's got a bit error, then that means
it's still corrupted."

Because the memory retains data when it is powered down, engineers
expect the corruption will still be present when they power the system
back up. The real question is whether data can be successfully stored in
the affected locations.

"If you then turned around and wrote to it, and it said, hey, I still
can't write to this memory cell without getting an error, then it would
tell you there's something more systemic going on, or more permanent,"
Cook said.

It's a bit of a "catch-22" for the computer experts at JPL, he added.
Letting the computer's software boot up and write data to the suspect
memory locations would destroy evidence that might help pin down what
went wrong in the first place.

"So the first thing we're going to do is just bring it up, read the
memory, dump memory from the areas where we think we had a problem and
take a look at that and then decide what to do next, whether or not to
write it," Cook said. "If it looks like it's all better, we may just
bring software up and then software will essentially do the same thing,
but for all the memory at once."

If the memory problem cannot be corrected, programmers could attempt to
bypass the corrupted locations with a software patch.

"There are multiple banks of memory, it's not a single monolithic
thing," Cook said. "So if you had an uncorrectable error in one place,
then you could effectively map it out, you would tell software when it's
booting up don't try to use this area of memory. That's an example of
something you could do."

Curiosity landed in Gale Crater on Aug. 6. The $2.5 billion mission is
devoted to searching for signs of past or present habitability and for
evidence of organic compounds like those necessary for life as it is
known on Earth.
Received on Tue 05 Mar 2013 11:29:45 AM PST


Help support this free mailing list:



StumbleUpon
del.icio.us
reddit
Yahoo MyWeb