Date: May 25, 2000
Problem: /Shannon/d3/ is full. Heather is loading the Exxon survey for st_54 and Nate is loading seismic files into Kileuea. We noticed that things were wrong when Heather loaded the 8bit data and the .3dv files read 0MB. Nate and Kyle also noticed that when they were working in Landmark when they went to read in a file the screen would blink repeatedly before loading the data, Nate finally got an error message that the disk was full. He also tried loading .3dv files only to find that they were also empty.
Solution: Miley returned on May 30 (he was out for personal
reasons) and told us that he had taken Shannon off the HSM so that he could
make repairs, which he was then able to do and put Shannon back on the HSM. Shannon
was up and running again late in the afternoon on the 30th. The
solution to this would have been to have John and Heather communicating more
effectively so that John knew that Heather was loading data and Heather knew
that Shannon was off the HSM. There was no way for anyone else to fix this
problem while Miley was away, we couldn't have even migrated the data with the
HSM down. The long term solution is we need to resolve how to do scheduled
backups with the HSM running.
Duration of problem: 5 days
Date: May 25, 2000
Problem: Nate is not able to plot to wiggle, he is on hydro trying to print directly from Landmark. Heather tried from the command line on phase and was also unsuccessful. Heather also notices that the web page is down, she reboots using the sudo command, but it doesn't work.
Solution: Miley returned on the 30th
and explained that this had probably been down since the system reboot on May
19th. Miley tried setting this up so that pecten would restart the
Zeh software, sometimes it works, sometimes it doesn't, this time it didn't.
Miley restarted hydro and the ZEH software and we are now able to plot to
wiggle. Heather needs to learn how to restart the ZEH software in case John is not available.
Duration of problem: 5 days
Date: May 28, 2000
Problem: Peter is working and centroid crashes, he also notices that hydro is down, including the web page.
Solution: Miley returns on the 30th and explains
they were rebooted when there was a power outage. Miley
bought a backup for hydro, centroid, and phase and installed them. But, we still need to keep an
eye on hydro and hal and move them to a climate controlled environment.
Duration of Problem: 2 days
Date: May 29, 2000
Problem: We notice a blinking problem in Landmark. When we click on Seismic-Select from Map and go over to the Map View window it blinks repeatedly.
Solution: Miley and I called Landmark and spoke with Charles
Fischer (#205370), it turns out we have a bug, one of our patches is
malfunctioning. We have to back-out the 105284-33 patch and reinstall one of
the patches between 105284-16 and –24. It has something to do with the patch
and us running on the CDE. Miley is going to back out the patch and see what
happens. Backing out the patch worked and the screen is no longer blinking.
Duration of Problem: 5 days
Date: May 31, 2000
Problem: Rachel is not able to pull up the correct email addresses when writing email. There seems to be a cascading address problem in Lotus Notes.
Solution: You need to specify in your location document whether or not you want to
do lookups on the server. To get your location document, choose File-Tools-Edit Current Location. Click here for
graphic of Notes screen.
Duration of Problem: 2 days
Date: June 2, 2000
Problem: Heather tried to backup Peter's laptops onto Wave through the FTP. This has worked before but the Wave ftp connection seemed to be down.
Solution: Heather contacted Kevin who then told her to type
"geosci/heatherj" in the username, this worked fine.
Duration of Problem: less than 1 day
Date: June 2, 2000
Problem: Rachel's computer is not responding, she gets an error message when booting up, the computer will not boot, it doesn't respond to turning off or Cntrl-Alt-Delete.
Solution: Kevin rebuilt the computer and reinstalled the
software.
Duration of Problem: 2 days
Date: June 5, 2000
Problem: Miley discovers that one of the leads on each of the UPS' on the geosystems suns was disconnected, one of the problems was that rocky was rebooting every time there was a power flicker.
Solution: John reconnected the leads, reconfigured the
system and tested it to make sure that everything was working properly. Miley ordered
UPS' for hydro and centroid and phase, they are connected and working properly. John
also checked the other existing UPS' to make sure that they were working.
Duration of Problem: 1 day
Date: June 12, 2000
Problem: There is no method by which a person is warned when they are nearing their quota in their home directory.
Solution: Miley is working on it.
Duration of Problem: ongoing
Date: June 20, 2000
Problem: Miley discovers that the ADSM has stopped, he goes to check it out and finds that two of the tapes were jammed.
Solution: John fixes the jam and later upgrades the software for the ADSM for $5,000. He gets the ADSM back
up and running with no more problems.
Duration of Problem: 2 days
Date: June 23, 2000
Problem: The hard drive on Muskat (Jim's PC) crashes. The computer was not backed up.
Solution: Kevin takes a look at it, tries to hook it up to his machine. His computer doesn't recognize the hard
drive either. John sends it to Ontrack in Utah. They take a look at it for 3 days, nothing is retreivable from the disk. This was the only computer in the BRG that is not on the ADSM system.
Everything else has been backed up. Fortunately, most of the things on Muskat were also on Wave and the Suns.
Duration of Problem: 21 days
Date: July 20, 2000
Problem: Lotus Notes went down due to a license issue.
Solution: John Miley spoke with Kevin Grainer and solved the problem. This section will be filled in further by John Miley.
Duration of Problem: 2 days
Date: July 20, 2000
Problem: Lava doesn't replicate with Austin (Notes).
Solution: Yet to be solved.
Duration of Problem: ongoing
Date: August 3, 2000
Problem: Brandon is loading seismic data from tape into /shannon/d3 and the process hangs along with all other processes on shannon.
Solution: John is notified and discovers that /shannon/d3 was filled to 100% while Brandon was loading data. John
changes the configuration and forces the data to migrate. He tells Brandon not to load anything until /shannon/d3 is below 80%. On
August 5, Brandon notices that /shannon/d3 is at 80%, so he tries to load the seismic from tape again. Around 4pm, everything goes down,
shannon seems to be non-functional. On August 8 Nate has trouble seeing some wells in Geolog.--Error file(3): Error reading from >
"/shannon/d3/mincom/w_kileuea/wells/6_tx_a-12.well": Permission Denied. Heather looks at permissions and thinks it might be from the
file/moving and copying. Nothing sees to work. She then restores one of the projects from the /hydro/d0/dlt2 disk and everything works
fine. After discussion with Miley, it was realized that the HSM had been taken offline and that the data had been written to tape and was
then not retrievable. On August 9 Miley runs a series of tests and restarts the ADSM. A volume gets stuck in one of the drives where it
then becomes "unavailable" to ADSM. Miley managed to free the tape by rebooting the drives but as soon as he checked it back into the
library it got stuck again. The tape might be broken. Miley fixed the tape and gets the ADSM up and running to get Nate back in action,
but this leads to a much bigger issue. After much investigation it was discovered that the files that had been migrated had been
corrupted. Miley worked with IBM support and on August 12 sends them a crash dump. The server was running high on CPU, there was a
problem with the expiration processing (e.p.). Miley turned the e.p. off and the system started working. It then became available for
reclamation processing. The HSM then started crashing, ADSM support then stated that the database was trashed. Miley then realized that
all of this had stemmed from Aug 3, when Brandon and Heather were loading data onto /shannon/d3 at the same time. There was a default
setting on the tape storage system set at 70 gig and when they exceeded that with the loading, the disk started filling up and eventually
became full. John fixed the default setting, but it seems that this may have caused more problems than originally thought. Miley
restored the backup from August 6 on August 14 and tries to do an incremental backup. Miley is still trying to assess which of our files
has been corrupted, which files were migrated, and how to get those files back onto the working system. On August 18 Brandon tries to
open a variety of landmark projects and .3dv files and is only successful opeing 1 out of 9. The error in Seisworks says: Unable to open
seismic data file one32b01.3dv. Reason: Please Correct Seismic Files using Seismic-Parameters. Victor is also unable to open some of
his files in his thesis directory. This is the first time that files other than Geolog or Landmark files have trouble being recalled.
Brandon then tries a test. He moves 4 files off his laptop hard drive that he had backed-up a long time ago on ADSM, and then restores them
using the ADSM client. They all restore. John again called IBM. It seems that we are able to migrate files, but not recall them, we are
also unable to reconcile or catalogue the tapes with the disk. IBM main support couldn't figure out the problem, so they sent it to their
second level people. They called Miley back on Thursday August 24. The problem was with the daemons, they were running, but needed to be
restarted. Miley did this and was then able to recall EVERYTHING that had been migrated to tape. John then did a database audit of the
system and discovered that he is still unable to do incremental backup and the reconcile process is still going to sleep. He was finally
able to complete an incremental backup on September 1, he then tried to reconcile the file system. The system failed to do another
incremental backup. On September 3 Brandon is unable to access files on /shannon/d3, John has to reboot shannon. On September 6 the
system completed an incremental backup with one failure in 2.5 hours. But the system still needs to be reconciled. On September 7 IBM
found another client doing HSM under Solaris with the same problems that we are having who did resolve the issue. When we try and solve
our problem using the other clients solution, it doesn't work. IBM is still looking into the problem. On Spetember 10 Peter looks at
/shannon/d3 capacity and it's at 84%, John says that he bumped it up to 85% for testing reasons. John tries to restore specific files
that he knows were migrated to server storage. 14 out of the 16 files have an unsucessful restore. On September 12 John and Peter decide
that it might be best to offload all of our data onto another disk (such as the ESSC cray) and rebuild ADSM from the ground up. After
talking with the ESSC it is found that it might be hard to separate our data into 10GB chuncks and that it would be best to just wait for
the new Sun disk to get here. John talks to IBM on September 15 and finds out the following: It's down to the TCP/IP rejection issues.
There is an APAR from another client running our configuration who's problems were resolved by upgrading the Veritas Filesystem software
and this is what IBM would like us to try. The version of Veritas they want us to upgrade to was not an option with the previous version
of the HSM client, it is now that we've upgraded. We'd like to try that too! But we're not comfortable doing it until we have secured
our data. On September 18 a Sun representative drives from Harrisburg to give us our new 160gig disk. This is what Miley will use to
offload our data onto. On September 18 Miley sends an email explaining that the current plan is to start moving our data out of server
storage and putting it on the new 160gig disk. Once we have the data in a consolidated file space and we are confident that it is intact
Miley is going to remove all remnants of /shannon/d3 from ADSM, remove the /shannon/d3 and shannon/d4 filesystems from
/shannon, remove the old version of veritas from shannon, install the new version of veritas, redo the disks, readd space management and
move our data back into it.
Duration of Problem: ongoing
Date: August 4, 2000
Problem: Pecten goes down.
Solution: John Miley rebooted Pecten. There were still some problems with Phase, so Miley then rebooted both
Shannon and Phase and everything was fine. This section will be filled in further by John Miley.
Duration of Problem: 1/2 day
Date: August 5, 2000
Problem: Shannon goes down. The fragmentation, which wasn't excessive caused the os to believe there was an out of inode condition. In reality a veritas file system can't run out of inodes.
Solution: John Miley rebuilds shannon. John defrags /shannon/d3 and schedules the defrag to run every month from now on.
It has become necessary because of the excessive I/O that the HSM causes. He then backs up /shannon/d3 and the ADSM database, and restarts the HSM.
Duration of Problem: 1 day
Date: August 16, 2000
Problem: Peter's email was not forwarding from his .geosc account to austin.
Solution: Miley checked the .forward mechanism. There was a problem with the configuration that caused the mail not to be forwarded.
He fixed the configuration and everything then forwarded.
Duration of Problem: 4 hours
Date: September 5, 2000
Problem: /hydro/d1 and /hydro/d0 are inaccesible.
Solution: The biggest problem seems to be with /hydro/d1. Miley reboots hal, but the directories seem to be trashed. He restores both /hydro/d0 and /hydro/d1 from backups and
links /hydro/d0/d1 back to /hydro/d1.
Duration of Problem: 1 day