[Aces-support] Odd mpi errors relating to MNC package
cnh at mit.edu
Fri Dec 3 19:24:18 EST 2004
Can you send a copy of your job script.
On Fri, 2004-12-03 at 18:27, Daniel Enderton wrote:
> Hey Ed [and others],
> The issue with my jobs cutting out at 4am happened again last night.
> Has this happened with anyone's else ITRDA jobs? Should I continue
> to expect this or this just an issue with getting ITRDA fully online?
> Is it something about how my model trials are configured? If so,
> what can be done?
> >On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> >> I started three jobs last night around 1am within a range of about 20
> >> minutes of each other. They all came back with mpi errors (in the
> >> pbs error files) relating to netcdf and mnc that read something like:
> >> ABNORMAL END: package MNC
> >> forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> >> Image PC Routine Line Source
> >> mitgcmuv.O1 081F75F8 Unknown Unknown Unknown
> >> Stack trace terminated abnormally.
> >> p4_error: latest msg from perror: Bad file descriptor
> >> In the pbs standard out file, the problematic part looked like this:
> >> NetCDF ERROR: No such file or directory
> >> MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> >> p4_31316: p4_error: net_recv read: probable EOF on socket: 1
> >> p5_27711: p4_error: net_recv read: probable EOF on socket: 1
> >> p7_16293: p4_error: net_recv read: probable EOF on socket: 1
> >> p3_647: p4_error: net_recv read: probable EOF on socket: 1
> >> p2_12796: p4_error: net_recv read: probable EOF on socket: 1
> >> rm_l_1_1848: p4_error: listener select: -1
> >> p6_21651: p4_error: net_recv read: probable EOF on socket: 1
> >> P4 procgroup file is pr_group.
> >> All the STDERR files are there but of zero size. The STDOUT files
> >> have nothing in them of note at the end (just the usual sea ice
> >> monitor statistic for one of the packages that I am using).
> >> Something else odd; they all seemed to break down at almost the exact
> >> same time (even though I did not start then all within this close of
> >> a time):
> >> [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> >> -rw------- 1 enderton aces 10409734 Dec 2 04:04 AquaC3O10/AqC3O10_C.e51362
> >> -rw------- 1 enderton aces 55986184 Dec 2 04:04 AquaC3O10/AqC3O10_C.o51362
> >> [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> >> -rw------- 1 enderton aces 10104455 Dec 2 04:03 AquaC3O5/AqC3O5_C.e51361
> >> -rw------- 1 enderton aces 54339284 Dec 2 04:03 AquaC3O5/AqC3O5_C.o51361
> >> [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> >> -rw------- 1 enderton aces 9799175 Dec 2 04:04 AquaC3O20/AqC3O20_C.e51363
> >> -rw------- 1 enderton aces 52693228 Dec 2 04:04 AquaC3O20/AqC3O20_C.o51363
> >Hi Daniel,
> >*Good* bug report!
> >It looks like the kernel ran out of file descriptors. It does not look
> >like a problem with MITgcm itself [and I'm not just saying that to pass
> >the blame off as the MNC author ;-)]
> >The 4:04am time frame is very suspicious. Its *right* after the system
> >usually kicks off some cron jobs that update the locate database, update
> >whereis, do the pre-linking, etc. At these times the system can be very
> >heavily loaded and seems that it ran out of file descriptors ("file
> >Heres a relevant link from the magic of Google:
> >Edward H. Hill III, PhD
> >office: MIT Dept. of EAPS; Rm 54-1424; 77 Massachusetts Ave.
> > Cambridge, MA 02139-4307
> >emails: eh3 at mit.edu ed at eh3.com
> >URLs: http://web.mit.edu/eh3/ http://eh3.com/
> >phone: 617-253-0098
> >fax: 617-253-4464
> >Aces-support mailing list
> >Aces-support at acesgrid.org
> Aces-support mailing list
> Aces-support at acesgrid.org
More information about the Aces-support