Quantcast
Channel: Intel® Fortran Compiler for Linux* and macOS*
Viewing all articles
Browse latest Browse all 2583

ifort 13.0, 14.0 coarray extremly slow read/write between nodes

$
0
0

This is my test code: $ cat ca_check.f90 program z implicit none integer :: x(10)[*], img, nimgs, i real :: time1, time2 img = this_image() nimgs = num_images() x = img if (img .eq. 1) then do i=1,nimgs call cpu_time(time1) x = x(:)[i] call cpu_time(time2) write (*,"(a,f)") "Remote read took, s : ", time2-time1 call cpu_time(time1) x(:)[i] = x call cpu_time(time2) write (*,"(a,f)") "Remote write took, s : ", time2-time1 write (*,"(99999(i0,tr1))") x end do end if sync all write (*,"(a,i0,a,i0,a)") "Image: ", img, " out of ", nimgs, "completed ok" end program z $ Compiled with: ifort -o ca_check.xcack ca_check.f90 -coarray=distributed -coarray-config-file=ca.conf -debug full -warn all $ cat ca.conf -envall -n 64 ./ca_check.xcack $ $ cat zpbs #!/bin/sh #PBS -l walltime=00:01:00,nodes=4:ppn=16 #PBS -j oe #PBS -m abe cd $HOME/nobackup/cgpack/branches/coarray/tests echo "LD_LIBRARY_PATH: " $LD_LIBRARY_PATH > zzz echo "which mpirun: " `which mpirun` >> zzz export I_MPI_DAPL_PROVIDER=ofa-v2-ib0 mpdboot --rsh=ssh --file=$PBS_NODEFILE -n 4 mpdtrace -l >> zzz cm-launcher ./ca_check.xcack >> zzz mpdallexit $ $ cat zzz LD_LIBRARY_PATH: /cm/shared/apps/torque/4.2.4.1/lib:/cm/shared/apps/moab/7.2.2/lib:/cm/shared/tools/subversion-1.8.4/lib:/cm/shared/apps/intel-cluster-studio/impi/4.1.0.024/intel64/lib:/cm/shared/apps/intel-cluster-studio/composer_xe_2013.1.117/compiler/lib:/cm/shared/apps/intel-cluster-studio/composer_xe_2013.1.117/compiler/lib/intel64:/cm/shared/apps/intel-cluster-studio/composer_xe_2013.1.117/lib:/cm/shared/apps/intel-cluster-studio/composer_xe_2013.1.117/lib/intel64 which mpirun: /cm/shared/apps/intel-cluster-studio/impi/4.1.0.024/intel64/bin/mpirun node32-035_47536 (10.131.0.179) node33-002_50475 (10.131.0.98) node33-003_55287 (10.131.0.99) node34-006_42324 (10.131.0.54) Remote read took, s : 0.0010000 Remote write took, s : 0.0000000 1 1 1 1 1 1 1 1 1 1 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 0 0 0 0 0 0 0 0 0 0 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 3 3 3 3 3 3 3 3 3 3 Remote read took, s : 0.0000000 Remote write took, s : 0.0010000 0 0 0 0 0 0 0 0 0 0 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 5 5 5 5 5 5 5 5 5 5 Remote read took, s : 0.0000000 Remote write took, s : 0.0010000 0 0 0 0 0 0 0 0 0 0 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 7 7 7 7 7 7 7 7 7 7 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 0 0 0 0 0 0 0 0 0 0 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 9 9 9 9 9 9 9 9 9 9 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 10 10 10 10 10 10 10 10 10 10 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 11 11 11 11 11 11 11 11 11 11 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 12 12 12 12 12 12 12 12 12 12 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 13 13 13 13 13 13 13 13 13 13 Remote read took, s : 0.0009990 Remote write took, s : 0.0000000 14 14 14 14 14 14 14 14 14 14 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 15 15 15 15 15 15 15 15 15 15 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 16 16 16 16 16 16 16 16 16 16 Remote read took, s : 0.0000000 Remote write took, s : 0.0000000 17 17 17 17 17 17 17 17 17 17 Remote read took, s : 13.3259735 Remote write took, s : 12.9360342 18 18 18 18 18 18 18 18 18 18 Remote read took, s : 13.8728924 Remote write took, s : 12.5950813 19 19 19 19 19 19 19 19 19 19 Remote read took, s : 14.5117950 Remote write took, s : 12.9060364 20 20 20 20 20 20 20 20 20 20 $ Note that: - values read from processors 2,4,6,8 are just wrong. They are all zero, but must be equal to the processor number. - There are 16 cores in a node. Read/write to/from the first 16 processors are very fast, <1us. Read/write to/from processor 17, which probably is the first processor in another node, is still fast, but every other processor beyond that takes over 10 seconds for read or write. I've checked with both 13.0 and 14.0. I'm happy to provide further details of MPI setup. Thanks Anton


Viewing all articles
Browse latest Browse all 2583

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>