Monthly Archive for May, 2007

The Legend of the ‘Punctured Stripe’

Here where I work we have a dedicated IIS web server we’ll call IISServ.  IISServ is configured with a RAID 1 array and a non-RAID single disk for expanded storage of logs.  Awhile back, IISServ was moved down to a new spot on the rack it was in. Magically, after the move, the computer started reporting I/O and other CRC errors on drive C… 

I called dell and told them about the errors and they had me run a diagnostic on the whole RAID controller.  The diagnostic reported back errors on disk 0. Disk 0 was the first disk in the RAID1 array.  Dell shipped me a new disk and I swapped it out while thinking about how easy the whole process was. 

A few days later we received some more I/O and read errors that were canceling our shadow copy and remote backup process for IISServ.  After I ran some updates, IISServ failed to reboot properly and hung somewhere on the action of booting into the OS.  I removed the RAID1 drive that was not just recently replaced and the computer booted normally.  After the computer booted I re-inserted the bad drive.  I called Dell again and they had me run another diagnostic on the RAID array.  This time errors appeared on the second drive.  Dell sent me another good drive.  I hot-swapped the new drive into IISServ.

  After a day or two I still received more errors and the system state backup still failed.  I ran a chkdsk.  Chkdsk reported some errors, so I scheduled some downtime to do a chkdsk /r.  After the chkdsk /r a few sectors were flagged as ‘bad’ when I did a chkdsk again – great news, right?  STILL the error persisted and I assumed the only problem could be with the RAID controller it’s self.  The server is under warranty, so back to Dell I went again. 

  Dell had me run a reporting tool and send them some log information from the RAID controller.  They scanned the logs and found both disks 0 and 1 (the two in the RAID1) were reporting bad sectors in the exact same places.  Dell forwarded me up to their manager, or specialist, or whatever they have and he explained to me his version of the problem.  He called it a ‘Punctured Stripe’, which nobody in my office including me had ever heard of.  He explained the error was on the logical layer of the array and not on the physical disks.  He also explained to me that consistency checks are the preventative maintenance for this.  He also went on and explained to me that there was no application or process for repairing this and informed me the only way to repair the problem was to RECREATE THE ARRAY from scratch.  That’s right, destroy the entire array and all the data and start over.  I guess I should have ran more constancy checks, but even then he admitted, this problem can still occur. 

  “Punctured Stripe” with quotes came up with about 4 results.  One guy was ranting about how people use RAID 1 as their backup solution, which it is not intended to do.  Another search found someone that explained that RAID1 is only effective at preventing complete drive hardware failure, and is susceptible to passing corrupt data between drives.  I even found someone posting with the thread title ‘RAID1 is useless’, ranting about having the same problem we have.  I also went so far as to call someone who does hard drive recovery and ask him if he had ever heard of a ‘Punctured Stripe’.  He said he had never heard of it, but he had heard of corrupt data being replicated on the logical layer of a RAID1, effectively running the entire array and requiring it to be recreated. 

  I am trying to figure out how this happened, how we could have prevented it, and what good RAID1 is if corrupt data can ruin the array so easily.  From the looks of it, RAID1 isn’t such a great redunancy solution!

Update 6/14/07 - Found that Dell released a utility for dealing with this problem.

Error ID command line lookup (EventID.net)

This simple little script just looks up an error code on www.EventID.net when you supply an event. I find this prety useful when I am digging into server logs.

Sample:
id 2000
(looks up errorid 2000 on eventid.net)

id.bat


@echo off
color 1f
start http://www.eventid.net/display.asp?eventid=%1

PsExec Batch Shortcut

This batch file simply saves the trouble of worrying about getting all the context right for psexec and makes it easier to run things on multiple computers with varying usernames.  Help is built into the file as you can see below, so just run the batch without any paramiters for a listing.

runremote.bat

@echo off
color 1f

if "%1" == "" goto help

:runit
psexec @serverlist.txt -c -s -f -n 10 -u %1 %2
goto end

:help
echo ---
echo blog@integrii.net
echo PsExec Batch File Shortcut
echo http://blog.integrii.net
echo ---
echo PURPOSE:
echo Runs a file or command on a list of remote computers with a specified username.
echo ---
echo USEAGE:
echo runstuff [remote username] [command or file]
echo ---
echo NOTE:
echo You MUST CHANGE the serverlist.txt to contain a list of target computers. One computer per line, hostname or IP address.
echo ---
pause

:end


 View more about PsExec here:
http://www.microsoft.com/technet/sysinternals/Security/PsExec.mspx 

Auto-Ban FTP Attacks with Remote Service Install

This .zip file contains everything you need (including cscript.exe) to deply a service on a list of remote computers.  The service this script deploys is .vbs script that looks for failed logon attempts to the Administrator account.  When the script detects a failed login to the Administrator account it adds a route that breaks the connection almost immidately.  Then, the hostile IP is added to the ban list for all FTP sites on the machine.  This ban is at the root level, not just the site level.

 This script is very effective at stopping strong-arm attackers because almost any brute force attack tries the Administrator account right away.  The only prolem with this script is that it will NOT stop failed attempts on other accounts.  From my experience, this script stops at least half of all attackers right away, and most eventually.

Steps to install:

  1. Open the serverlist.txt and put one server on each line, (this installer works with psexec from sysinternals) then save and close the file. 
  2. Open up the runstuff.bat in your favorite text editor (Notepad++ for me) and change the ‘administrator’ username to whatever domain administrator you use on your network.  Including the domain might or might not be required depending on how your target computers are set up. 
  3. Cut and paste the files to a place on the network available to all the target computers.
  4. Open the install.bat in a text editor and change the (INSERT NETWORK LOCATION HERE) to the UNC path on your network.
  5. Open the runstuff.bat file and enter your password when prompted.

Download .zip: BanIP Remote Installer

Thanks the .vbs script’s author Chrissy and to frijoles for the great instructions I used to write this.