The Computer Problems/Solutions Page


Date: May 25, 2000

Problem: /Shannon/d3/ is full. Heather is loading the Exxon survey for st_54 and Nate is loading seismic files into Kileuea.  We noticed that things were wrong when Heather loaded the 8bit data and the .3dv files read 0MB. Nate and Kyle also noticed that when they were working in Landmark when they went to read in a file the screen would blink repeatedly before loading the data, Nate finally got an error message that the disk was full. He also tried loading .3dv files only to find that they were also empty.


Solution: Miley returned on May 30 (he was out for personal reasons) and told us that he had taken Shannon off the HSM so that he could make repairs, which he was then able to do and put Shannon back on the HSM. Shannon was up and running again late in the afternoon on the 30th. The solution to this would have been to have John and Heather communicating more effectively so that John knew that Heather was loading data and Heather knew that Shannon was off the HSM. There was no way for anyone else to fix this problem while Miley was away, we couldn't have even migrated the data with the HSM down. The long term solution is we need to resolve how to do scheduled backups with the HSM running.

Duration of problem: 5 days

 


Date: May 25, 2000

Problem: Nate is not able to plot to wiggle, he is on hydro trying to print directly from Landmark. Heather tried from the command line on phase and was also unsuccessful. Heather also notices that the web page is down, she reboots using the sudo command, but it doesn't work.


Solution: Miley returned on the 30th and explained that this had probably been down since the system reboot on May 19th. Miley tried setting this up so that pecten would restart the Zeh software, sometimes it works, sometimes it doesn't, this time it didn't. Miley restarted hydro and the ZEH software and we are now able to plot to wiggle. Heather needs to learn how to restart the ZEH software in case John is not available.

Duration of problem: 5 days

 


Date: May 28, 2000

Problem: Peter is working and centroid crashes, he also notices that hydro is down, including the web page.


Solution: Miley returns on the 30th and explains they were rebooted when there was a power outage. Miley bought a backup for hydro, centroid, and phase and installed them. But, we still need to keep an eye on hydro and hal and move them to a climate controlled environment.

Duration of Problem: 2 days

 


Date: May 29, 2000

Problem: We notice a blinking problem in Landmark. When we click on Seismic-Select from Map and go over to the Map View window it blinks repeatedly.


Solution: Miley and I called Landmark and spoke with Charles Fischer (#205370), it turns out we have a bug, one of our patches is malfunctioning. We have to back-out the 105284-33 patch and reinstall one of the patches between 105284-16 and –24. It has something to do with the patch and us running on the CDE. Miley is going to back out the patch and see what happens. Backing out the patch worked and the screen is no longer blinking.

Duration of Problem: 5 days

 


Date: May 31, 2000

Problem: Rachel is not able to pull up the correct email addresses when writing email. There seems to be a cascading address problem in Lotus Notes.


Solution: You need to specify in your location document whether or not you want to do lookups on the server. To get your location document, choose File-Tools-Edit Current Location. Click here for graphic of Notes screen.

Duration of Problem: 2 days

 


Date: June 2, 2000

Problem: Heather tried to backup Peter's laptops onto Wave through the FTP. This has worked before but the Wave ftp connection seemed to be down.


Solution: Heather contacted Kevin who then told her to type "geosci/heatherj" in the username, this worked fine.

Duration of Problem: less than 1 day

 


Date: June 2, 2000

Problem: Rachel's computer is not responding, she gets an error message when booting up, the computer will not boot, it doesn't respond to turning off or Cntrl-Alt-Delete.


Solution: Kevin rebuilt the computer and reinstalled the software.

Duration of Problem: 2 days

 


Date: June 5, 2000

Problem: Miley discovers that one of the leads on each of the UPS' on the geosystems suns was disconnected, one of the problems was that rocky was rebooting every time there was a power flicker.


Solution: John reconnected the leads, reconfigured the system and tested it to make sure that everything was working properly.  Miley ordered UPS' for hydro and centroid and phase, they are connected and working properly. John also checked the other existing UPS' to make sure that they were working.

Duration of Problem: 1 day

 


Date: June 12, 2000

Problem: There is no method by which a person is warned when they are nearing their quota in their home directory.


Solution: Miley is working on it.

Duration of Problem: ongoing

 


Date: June 20, 2000

Problem: Miley discovers that the ADSM has stopped, he goes to check it out and finds that two of the tapes were jammed.


Solution: John fixes the jam and later upgrades the software for the ADSM for $5,000. He gets the ADSM back up and running with no more problems.

Duration of Problem: 2 days

 


Date: June 23, 2000

Problem: The hard drive on Muskat (Jim's PC) crashes. The computer was not backed up.


Solution: Kevin takes a look at it, tries to hook it up to his machine. His computer doesn't recognize the hard drive either. John sends it to Ontrack in Utah. They take a look at it for 3 days, nothing is retreivable from the disk. This was the only computer in the BRG that is not on the ADSM system. Everything else has been backed up. Fortunately, most of the things on Muskat were also on Wave and the Suns.

Duration of Problem: 21 days

 


Date: July 20, 2000

Problem: Lotus Notes went down due to a license issue.


Solution: John Miley spoke with Kevin Grainer and solved the problem. This section will be filled in further by John Miley.

Duration of Problem: 2 days

 


Date: July 20, 2000

Problem: Lava doesn't replicate with Austin (Notes).


Solution: Yet to be solved.

Duration of Problem: ongoing

 


Date: August 3, 2000

Problem: Brandon is loading seismic data from tape into /shannon/d3 and the process hangs along with all other processes on shannon.


Solution: John is notified and discovers that /shannon/d3 was filled to 100% while Brandon was loading data. John changes the configuration and forces the data to migrate. He tells Brandon not to load anything until /shannon/d3 is below 80%. On August 5, Brandon notices that /shannon/d3 is at 80%, so he tries to load the seismic from tape again. Around 4pm, everything goes down, shannon seems to be non-functional. On August 8 Nate has trouble seeing some wells in Geolog.--Error file(3): Error reading from > "/shannon/d3/mincom/w_kileuea/wells/6_tx_a-12.well": Permission Denied. Heather looks at permissions and thinks it might be from the file/moving and copying. Nothing sees to work. She then restores one of the projects from the /hydro/d0/dlt2 disk and everything works fine. After discussion with Miley, it was realized that the HSM had been taken offline and that the data had been written to tape and was then not retrievable. On August 9 Miley runs a series of tests and restarts the ADSM. A volume gets stuck in one of the drives where it then becomes "unavailable" to ADSM. Miley managed to free the tape by rebooting the drives but as soon as he checked it back into the library it got stuck again. The tape might be broken. Miley fixed the tape and gets the ADSM up and running to get Nate back in action, but this leads to a much bigger issue. After much investigation it was discovered that the files that had been migrated had been corrupted. Miley worked with IBM support and on August 12 sends them a crash dump. The server was running high on CPU, there was a problem with the expiration processing (e.p.). Miley turned the e.p. off and the system started working. It then became available for reclamation processing. The HSM then started crashing, ADSM support then stated that the database was trashed. Miley then realized that all of this had stemmed from Aug 3, when Brandon and Heather were loading data onto /shannon/d3 at the same time. There was a default setting on the tape storage system set at 70 gig and when they exceeded that with the loading, the disk started filling up and eventually became full. John fixed the default setting, but it seems that this may have caused more problems than originally thought. Miley restored the backup from August 6 on August 14 and tries to do an incremental backup. Miley is still trying to assess which of our files has been corrupted, which files were migrated, and how to get those files back onto the working system. On August 18 Brandon tries to open a variety of landmark projects and .3dv files and is only successful opeing 1 out of 9. The error in Seisworks says: Unable to open seismic data file one32b01.3dv. Reason: Please Correct Seismic Files using Seismic-Parameters. Victor is also unable to open some of his files in his thesis directory. This is the first time that files other than Geolog or Landmark files have trouble being recalled. Brandon then tries a test. He moves 4 files off his laptop hard drive that he had backed-up a long time ago on ADSM, and then restores them using the ADSM client. They all restore. John again called IBM. It seems that we are able to migrate files, but not recall them, we are also unable to reconcile or catalogue the tapes with the disk. IBM main support couldn't figure out the problem, so they sent it to their second level people. They called Miley back on Thursday August 24. The problem was with the daemons, they were running, but needed to be restarted. Miley did this and was then able to recall EVERYTHING that had been migrated to tape. John then did a database audit of the system and discovered that he is still unable to do incremental backup and the reconcile process is still going to sleep. He was finally able to complete an incremental backup on September 1, he then tried to reconcile the file system. The system failed to do another incremental backup. On September 3 Brandon is unable to access files on /shannon/d3, John has to reboot shannon. On September 6 the system completed an incremental backup with one failure in 2.5 hours. But the system still needs to be reconciled. On September 7 IBM found another client doing HSM under Solaris with the same problems that we are having who did resolve the issue. When we try and solve our problem using the other clients solution, it doesn't work. IBM is still looking into the problem. On Spetember 10 Peter looks at /shannon/d3 capacity and it's at 84%, John says that he bumped it up to 85% for testing reasons. John tries to restore specific files that he knows were migrated to server storage. 14 out of the 16 files have an unsucessful restore. On September 12 John and Peter decide that it might be best to offload all of our data onto another disk (such as the ESSC cray) and rebuild ADSM from the ground up. After talking with the ESSC it is found that it might be hard to separate our data into 10GB chuncks and that it would be best to just wait for the new Sun disk to get here. John talks to IBM on September 15 and finds out the following: It's down to the TCP/IP rejection issues. There is an APAR from another client running our configuration who's problems were resolved by upgrading the Veritas Filesystem software and this is what IBM would like us to try. The version of Veritas they want us to upgrade to was not an option with the previous version of the HSM client, it is now that we've upgraded. We'd like to try that too! But we're not comfortable doing it until we have secured our data. On September 18 a Sun representative drives from Harrisburg to give us our new 160gig disk. This is what Miley will use to offload our data onto. On September 18 Miley sends an email explaining that the current plan is to start moving our data out of server storage and putting it on the new 160gig disk. Once we have the data in a consolidated file space and we are confident that it is intact Miley is going to remove all remnants of /shannon/d3 from ADSM, remove the /shannon/d3 and shannon/d4 filesystems from /shannon, remove the old version of veritas from shannon, install the new version of veritas, redo the disks, readd space management and move our data back into it.

Duration of Problem: ongoing

 


Date: August 4, 2000

Problem: Pecten goes down.


Solution: John Miley rebooted Pecten. There were still some problems with Phase, so Miley then rebooted both Shannon and Phase and everything was fine. This section will be filled in further by John Miley.

Duration of Problem: 1/2 day

 


Date: August 5, 2000

Problem: Shannon goes down. The fragmentation, which wasn't excessive caused the os to believe there was an out of inode condition. In reality a veritas file system can't run out of inodes.


Solution: John Miley rebuilds shannon. John defrags /shannon/d3 and schedules the defrag to run every month from now on. It has become necessary because of the excessive I/O that the HSM causes. He then backs up /shannon/d3 and the ADSM database, and restarts the HSM.

Duration of Problem: 1 day

 


Date: August 16, 2000

Problem: Peter's email was not forwarding from his .geosc account to austin.


Solution: Miley checked the .forward mechanism. There was a problem with the configuration that caused the mail not to be forwarded. He fixed the configuration and everything then forwarded.

Duration of Problem: 4 hours

 


Date: September 5, 2000

Problem: /hydro/d1 and /hydro/d0 are inaccesible.


Solution: The biggest problem seems to be with /hydro/d1. Miley reboots hal, but the directories seem to be trashed. He restores both /hydro/d0 and /hydro/d1 from backups and links /hydro/d0/d1 back to /hydro/d1.

Duration of Problem: 1 day

 


Back to FAQ main page 

Last Modified September 18, 2000 - Heather Johnson