Hamster Republic -> James's Stuff -> James's GNU/Linux Gremlins Home


[Upgrading Debian] [Contura Aero] [Fujitsu Lifebook] [Mac Powerbook] [Curse of the Netherwog (now with happy ending!)] [Software Suspend]


Hardware Failure Under Linux

or, The Curse of the Netherwog!!!

now with happy ending!

One day I was talking to my cousin Brian, and he suddenly offered me a free motherboard and processor. If I had known the chain of events it would trigger, I might have said no, but instead I said "sure!"

It seems that Brian, who works for Intel had been given the motherboard combo for free at a corprate meeting/pep-rally as a reward for answering a question intelligently. (The question was "Why does Intel write Drivers?". Two people got motherboards for answering. I forget what Brian's answer was, but the other guy's answer was "Because we are better at it than Microsoft!")

Anyway. He didn't need the motherboard, and he knew that my old Windows box was aging, so he gave it to me.

I had been wanting to build a Linux box, and this seemed like a perfect base to build it from. It was an i810 motherboard with integrated everything, and an 800 mhz Pentium III processor (which at the time was almost top-of-the-line) I went shopping, and picked up a case, 128 MB of memory, a 10 GB Western Digital hard drive, an Intel EtherExpress Pro 10/100 PCI and a Visiontek NVidia Geforce2 AGP card (I don't hold with integrated stuff!)

It all worked, and soon I had my Linux Box, which I named netherwog after a goofy-looking yellow monster I drew for Wandering Hamster. But all was not well. After awhile I started experincing wierd random errors, which are logged here. I replaced component after component, first thinking the Memory was responsible, then the Motherboard, then the Video Card, then the hard-drive. Replacing these components sometimes temporaraly solved the problem, but then, wierd random errors would pop up again. I have never been so frusrated with a Linux computer (but I have dozens of other similar computers both at home and at work that run Linux rock-solid!) So anyway, the current score is that I have NO clue what is wrong with this box.



What follows is a log of various problems I have had, and solutions I have attempted. I write these as I they happen, so some of them express the opinion that the problem is solved, but that has not been the case... or, obviously, I am dealing with more than one problem here :)


Jan 2002

The Scary Meltdown

I was playing with my linux box, trying to figure out how to permit SDL games change the screen-resolution when in full-screen mode, and as I went to recompile something, make suddenly failed and filled my xterm with garbage characters. I tried to type clear but I just got more garbage. That seemed very unlinuxlike. I closed the xterm and tried to re-open it, and my window manager hung. That also seemed very unlinuxlike. So then I tried to kill my X-server with CTRL+ALT+BACKSPACE... no response. CTRL+ALT+DEL... no response. Reset and power buttons were dead too. I held down the power button 10-seconds, BRS-ATX-style, and then power on again. LILO booted, Linux booted, and I went into fsck because my partitions were unmounted. I went to amuse myself elsewhere while I waited, and when I came back I found that fsck had been interrupted by a scary error ending with a stack dump and a kernel panic. I rebooted again, and this time I was able to get all the way through fsck-ing, and into X. I wrote off the crash as cosmic rays or sunspots, and resumed playing. I went to erase a source-tree I didnt need anymore, and accadentaly tried to remove it as a user with the wrong privileges to delete the files. At the end of the string of access denied messages I got a little more garbage, and another kernel panic. X locked hard. This was getting about as unlinuxlike as I could stand. If this had been happening to my Windows box, I would have shrugged it off, and considered it a bad day, nothing more. This being my Linux box, I was shaken to the very roots of my faith in Linux, my faith in the ability of human beings to make a computer system work, my faith in the ability of human beings to make ANY system work, be it computer, political, religious, or anything, and I began to fanticize about quitting my job, burning down my house, fleeing to the mountains, and living in a cave as a hermit who only ventures forth to steal bread and fling rocks at passers-by. I rebooted again. Blank screen. No boot. Just the humming of the cooling-fans. I powered off. Waited for a moment, kneeling on the carpet in front of the box, gazing distractedly to one side, counting the seconds and slipping deeper and deeper into a grim depression. After I could wait no longer, I powered on again, an action without hope. BEEEEEEEEEEEEEEEEEEEEEEEEEEEEP! said my box. It was one long solid high pitched beep from the PC speaker that did not stop until I had collected my wits enough to hold down the ATX power button for another 10 seconds. When the painful noise was gone, I was happy again, its high-pitched shriek had lifted my spirits from the abyss, for now I knew without a doubt that this was no dreaded software meltdown, it was a wonderful catastrophic hardware failure! I hate system crashes, but when I am forced to endure one, I love to discover that it is the result of a hardware failure. Hardware is from the realm of the material world, and it can fail. It is in the nature of matter to break down, so a random meaningless failure ceases to be a thing of chaos, and takes on a sort of order that comes from the knowledge that all things die, even silicon chips. Software failures, on the other hand, are from the realm of logic. They are an abstract thing that intrudes into the physical realm, so a gruesome failure of a software system, most especially an inconsistent, and unpredictable failure is made all the more painful, bringing to mind the fear that logic itself is flawed, that chaos is the only rule, and that all man's works are doomed to greater and greater failure as they grow more and more complex. I was terribly relieved to know that this was a mere hardware failure. My faith in Linux, while not fully restored, was alive again. I would be able to identify the faulty component, motherboard, CPU and memory were my first suspects, and then I would be able to indignantly call the appropriate tech-support department and demand vengeance, satisfaction, and an RMA number. I dug out my motherboard manual, and looked up the beep codes to see what a single long endless beep meant. To my delight, there was no such beep code. I turned on the power again, to listen to the sweet siren and glory in the fact that my motherboard was truly fried, but I only got silence. My monitor was dark, but the LED went green-- amber-- green-- amber-- and finally stayed green. A simple error message about a CMOS checksum error appear on the screen, followed by a horrible thrashing from the floppy-drive, and a series of short cryptic messages that made me think of an execution pointer leaping about randomly in a string space like a cat lost in a paper bag. The PC speaker let out a low groan that would have taken a fair bit of programming to do on purpose. I didn't need the manual to tell me that one wasnt a valid beep-code. I powered off before it could get any further, for fear of what might come next. Reveling in the joy of wondering how recently I had backed up my critical files, and speculating how much of my /home partition might have survived, I unplugged the box, opened the case, and, inspired by the CMOS checksum error I had seen, I located and used the CMOS-clear jumper. After that I got a clean POST screen, and was able to easily fix my BIOS settings. LILO booted, and Linux booted, and fsck finished without crashing. It then prompted me to give the root password and manually fsck my /usr partition again. I did so, and carefully noted the path and filename of each file that it told me was corrupted. By remarkable luck, all of the aflicted files were located in two old source trees, (including the one I had tried to remove as a wrongly-privledged user). I was able to log in, and I took a moment to browse my /usr/lost+found before attempting to startx. It came up perfectly, and I tried each and every major utility, application, and game on my sytem and found them all to be working satisfactoraly. I remove the two fried source-trees, and re-untarred the one I cared about, and then spend the rest of the evening compiling and playing games, just to make sure I wasnt going to lock or crash again. So I am secure in the knowledge that my crashes and lockups had been caused by freak motherboard error, and I was left feeling very impressed at my box's resiliance. Of course, I still dont know what might have cause that freak error, and I dont know if it will ever happen again. We shall see.

How the Hardware Failure Came Back and Bit Me In the Bum

But of course, just because the symptom wasnt happening, didn't mean the problem was gone. After a 12 hour torture-test session, my box locked again. I went a-searching and found that there was BIOS update available for my motherboard that addressed incompatabilities with the GeForce 2 chipset. Delighted, I installed this, but it did not help. I locked up again, this time experiencing massive damage to my root partition. I was resigned to the fact that I was going to need to re-install the whole operating system, but decided I would continue testing, in the hopes I might locate the hardware problem. When one already knows that one's system is hosed, it is easyer to debug aggrssively, because you are not afraid to change things randomly. I was a bit surprised when my system booted just fine, and almost everything worked exactly as it should. A couple scripts and config files were missing, but I found that fsck had recovered them unharmed and attached them to /lost+found . So I moved all of my critical files such as finacial info and writing-in-progress back to my (ugh) Windows box, and determined to continue to vigorously utilize the Linux box for Games, web browsing, and MP3ing until either I located the problem or a lockup killed something that fsck couldnt fix.

On a long weekend drive to visit family, I though about it logically, and remembered my debugging training. What was the last thing I had changed before the lockups started? I had added a soundcard the day before, but there was something even more recent. I had attempted to install NVidia's X driver in place of the stock nv driver that comes with X. I had compiled the driver, and I had edited my /etc/modules.conf to allow NVdriver to load, but I had not actually gotten as far as changing my XF86Config file to use the "NVidia" driver instead of the "nv" driver, so I had assumed it was not being loaded yet and could not be related to my problem... but the module was set up to be loadable. Perhaps it was getting loaded anyway, and wrecking havoc with my kernel. When I got home I erased the NVdriver module binary and remarked it out of /etc/modules.conf

It is quite possibly that that solved my problem. I have continued to use the box without any further lockups for several weeks now. Once I feel a little more ambitious/confident/masochistic, I may re-enable it to see if I can lock myself up again, thus proving the thing.

I have still not re-installed the operating system. Once I restored those few lost files, everything else has been running quite happyly. I am impressed with the resiliance of the Linux filesystem.


it is fair to say that something improved, because here I experienced about five months of uninterrupted stability.


July 2002

The Return of the Lockup

Shortly after I upgraded to Debian 3.0, my old lockup bug came back. The lockups are far less painful now, thanks to the wonderful wonderful ext3 filesystem, but a lockup is a lockup all the same. I immediately suspected my NVidia card, and noticed how toasty-warm it was getting. I think this may be the sort of problem that can be solved by adding cooling fans and heat synchs... plastering the puppies all over everything! :) More on this when I solve the problem.

No, I take that back. I gave up on my NVidia card for the time being. I switched to a Voodoo 5 for a few hours, but got lockups still. The I pulled my old ATI Rage Pro All-in-Wonder card out of my Windows Box and put that in there. It is PCI, and eveything I had tested up to this point was AGP. It worked for several days, but I did get one lockup (might have been software. I was running some alpha stuff at the time) so I switched it back. Then I switched to a simple Diamond Speedstar A200 AGP, which I am still using today (2002-09-30) and I took special care to switch AGP to 1x mode in the BIOS, but no luck. I still had some lockups (although not terribly frequently (yeesh! Thats how I talk about windows computers! this is awful!))


August 2002

Filesystem Corruption

I had been suffering a few lockups. Not as bad as the old days of the "scary meltdown" but still bad enough to make me unhappy and occasionally resort to *gasp* <heresy>checking my e-mail from my Windows box!</heresy>

The wondeful EXT3 filesystem was making my crashes quite painless, but I discovered that even EXT3 is not bulletproof.

I noticed something was wrong when tried to read the manpage for some command and it told me that the manpage did not exist! I knew it did, because I had read the same page before. I checked and discovered that /usr/share/man was unreadable. I dont mean the permissions were screwed up, I mean it was unreadable.

I thought his was very very odd. So I immediately backed up my most important files. I then rebooted, but it couldn't even boot anymore. Among the unreadable files on my hard drive was libext2fs.so.2.4 which the boot system needed in order to run the filesystem check.

I found a rescue disk and ran badblocks. It found no problems. Then I ran fsck and it found problems galore on my root and /usr partitions. It only found one problem on my /home partition, telling me the ext3 journal was corrupt, and that I would have to mount it as ext2

Next I found Western Digital's DLDIAG utility, stuck it on a DOS bootdisk, and checked my drive with that. It came up clean.

I then had to reinstall Debian to restore the missing files, but my /home partition was still alive, so hey, I was happy.

Once I was sure everything was okay, and I could run all my programs, and access all my files, then I went to rebuild the journal on /home . My thought process went something like this:


Me: "Okay, what command was I supposed to use to add
     an EXT3 journal to an EXT2 filesystem?"

types: man mke2fs

*glances at the manpage*

Me: "Yeah, that looks like it."

types: mke2fs -j /dev/hda7

Me: "Oops! silly me! I have to unmount that first."

types: umount /home

types: mke2fs -j /dev/hda7

*reads the messages scrolling by*

Me: "Oh, crap."

Me: "Did I just do what I think I did?"

types: mount /home

types: ls -l /home

total 0
drwx------    2 root     root        16384 Aug 27 15:47 lost+found

Me: "..."

But the moral of the story is that in spite of mysterious hardware failure, and in spite of my own tremendous stupidity, everything turned out okay, because I had backed up my most important files. ...I had about 700 megabytes of MP3s that I hadn't backed up, but hey, now I can go buy the CDs :)


August 31 2002

Crashy Crashy!

So I set the system up again from scratch, clean, and had just started using it again, when it hangs again.

I hard-reset, and when the video-bios screen popped up, the model name was visible in the first line as normal, but the rest of the screen was filled with garbage ascii chars.

Powered down, let it sit a few minutes, and tried again. It booted, and got to the login prompt, but typing my username produced a crash and a stack dump, and returned me to the logon prompt. I tried again, and it told me username invalid. I tried loggin in as root, and I got another stack dump.

So I calmly unplugged the box, and pushed it aside, and did my work on my laptop instead. Until further notice, I have given up (but I never give up for long)


Rant

In the past I have run into unreliable systems like this one that defy proper troubleshooting techniques. Its as if there is no specific problem with any one component, its like the whole conceptual collection of components known as "this machine" has a problem. Like when this computer was "born" it got an evil soul.

I get the disturbing feeling that if I start over with all-new hardware (even if the hardware is exactly the same) and just give it a different name (oh, this is my "new" computer. That crashy one is my "old" computer) it will work flawlessly.. and that if I canibalize every single component of the crashy computer and put them in other systems one by one, none of them will exhibit any probelms at all, because they aren't tainted by "being" that cursed computer.

Which of course leads me to direct you to: http://james.HamsterRepublic.com/technomancy/
January 8 2003

Pleasant Stability

I am pleased to say that the netherwog has been very stable lately. After I got home from my vacation in November I removed the VIA motherboard, and switched back to the original Intel i810 motherboard that startid the whole thing. I reinstalled the operating system clean, and it has not crashed or locked once since. I even upgraded to Debian unstable with out any loss of system stability. Everything is working wonderfully, and I use it for everything. I am still not sure what the problem was before, but I do NOT blame the VIA motherboard just because it was the last thing sI changed before things started working again. The originaly symptoms started long before I tried the FIRST motherboard swap, back when I was using the same i810 I am using now, and I have other linux systems at work (severs even!) that use exactly the same model of VIA motherboard. What I really credit the current stability to is the fact that when I was switching motherboards, I spoke aloud to the computer in calm soothing tones, telling it that it was getting a NEW motherboard, and assuring it that everything would be better from now on. I think it believed me.

September 12 2003

Glorious Stability

I haven't had any problems in months and months and months. THIS is the way a computer is SUPPOSED to work. YAY!

January 8 2004

Introducing Risk

Well, things have been working so well for so long, i think it is time to make things hard on myself. I am ready to risk setting up Software Suspend.