Quantcast
Channel: Intel® Fortran Compiler for Linux* and macOS*
Viewing all articles
Browse latest Browse all 2583

Coarray Fortran programs hang with certain number of images

$
0
0

On our slurm cluster, depending on the number of nodes and the number of specified images, even the simplest coarray program either hangs or segfaults. With 2 nodes (each 12 cores) I can not run more than 16/20 images depending on the way I launch it, otherwise it will segfault. With 3 or more nodes the number of images I can use increases, but is always below the maximum.

The program is

program hello
      implicit none
      sync all
      write (*,*) "hello from image", this_image()
      sync all
end program hello

 

$ ifort --version
ifort (IFORT) 19.0.1.144 20181018

$ ifort -coarray=distributed -coarray-num-images=20 hello.f90 -o hello

16 images work, 20 hang and the maximum number of 24 crashes:

$ ./hello
MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported.
MPI startup(): I_MPI_CAF_RUNTIME environment variable is not supported.
MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
In coarray image 20
Image              PC                Routine            Line        Source             
hello              00000000004060D3  Unknown               Unknown  Unknown
libpthread-2.12.s  00007F48D65F67E0  Unknown               Unknown  Unknown
libicaf.so         00007F48D6A8E086  for_rtl_ICAF_BARR     Unknown  Unknown
hello              00000000004051DB  Unknown               Unknown  Unknown
hello              0000000000405182  Unknown               Unknown  Unknown
libc-2.12.so       00007F48D6271D1D  __libc_start_main     Unknown  Unknown
hello              0000000000405029  Unknown               Unknown  Unknown

Abort(0) on node 19 (rank 19 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 19

etc.

In this mode if I set I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so and run ./hello I get no output at all, but it does not hang. Trying it multiple times sometimes gives:

srun: error: slurm_send_recv_rc_msg_only_one to tev0107:42260 : Transport endpoint is not connected
srun: error: slurm_receive_msg[192.168.76.7]: Zero Bytes were transmitted or received

 

In the no_launch mode:

$ ifort -coarray=distributed -switch no_launch hello.f90 -o hello

I can consistently run 20 images without problems:

$ srun --mpi=pmi2 -n 20 ./hello
MPI startup(): I_MPI_CAF_RUNTIME environment variable is not supported.
MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
 hello from image           3
 hello from image          11

...

but beyond 20 it will either hang or segfault again.

 

If I remove the sync all statements and just keep the write I see a couple of "hello from image" messages, but after 5 or 6 it just hangs. It does not segfault without the sync statements.

 

We also have the 2018 release installed, and I don't have the problem there. But with the 2018 release even this trivial program with just a write and a sync all takes 30 seconds of cpu time before finishing. An equivalent program running direct mpi directives finishes in a fraction of a second. I can also run any number of images in shared mode on just a single machine (in oversubscription).

The guide at https://software.intel.com/en-us/articles/distributed-memory-coarray-for... seems to be outdated since the tcp fabric is no longer supported with Intel MPI.

Am I the first one to test the Coarray Fortran implementation with the 2019 suite or is all of this just a user error? Is the Coarray Fortran implementation considered to be stable?

Hope you can help,

Tobias

 

 


Viewing all articles
Browse latest Browse all 2583

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>