Threadprivate allocatable performance issues

Hello,

I have a parrallel part of a code which uses a THREADPRIVATE ALLOCATABLE array of a derived type which, in turns, contains other ALLOCATABLE variables:

MODULE MYMOD
TYPE OBJ
  REAL, DIMENSION(:), ALLOCATABLE :: foo1
  REAL, DIMENSION(:), ALLOCATABLE :: foo2
END TYPE

TYPE(OBJ), DIMENSION(:), ALLOCATABLE ::  priv

TYPE(OBJ), DIMENSION(:), ALLOCATABLE ::  shared

!$OMP THREADPRIVATE(priv)

END MODULE

The variable "priv" is used by each thread as buffer for heavy calculations and is then copied on a shared variable.

MODULE MOD2

SUBROUTINE DOSTUFF()

!$OMP PARALLEL PRIVATE(i,n,dim)

CALL ALLOCATESTUFF()

CALL HEAVYSTUFF()

CALL COPYSUFFONSHARED()

!$OMP END PARALLEL

END SUBROUTINE DOSTUFF

SUBROUTINE ALLOCATESTUFF()
USE MYMOD, ONLY : priv

ALLOCATE(priv(n))
DO i=1,i
  ALLOCATE(priv(i)%foo1(dim))
  ALLOCATE(priv(i)%foo2(dim))
ENDDO

END SUBROUTINE ALLOCATESTUFF

SUBROUTINE COPYSTUFFONSHARED()
USE MYMOD
...
END SUBROUTINE COPYSTUFFONSHARED

SUBROUTINE HEAVYSTUFF()
USE MYMOD, ONLY : priv
...
END SUBROUTINE HEAVYSTUFF

END MODULE

I'm running this code on a machine with two CPUs, each one with 10 cores, and I'm experiencing a strong loss of performance when passing the limit of 10 threads: basically, the codes scales linearly up to 10 threads, and then the slope is strongly reduced after this barrier. I obtain a very similar behavior on a machine with 8 CPUs, each one with 4 cores but this time the loss is around 5/6 threads.

As order of magnitude "n" of priv is small (less than 10), whereas "dim" for each "foo" is of the order of some milions.

What I guess from this behavior is that there's a sort of bottleneck in accessing the memory because of the connection between the CPUs. The strange behavior is that if I mesure separately the time required for doing HEAVYSTUFF and COPYSTUFFONSHARED, it is HEAVYSTUFF that slowes down, whereas COPYSTUFFONSHARED has an "almost linear" speed-up.

The question is: am I assured that the memory in a THREADPRIVATE derived type will be actually allocated locally on the CPU to which the thread belongs? If so, what else can be the explanation of this behavior? Otherwise, how can I force data locality?

Thank you

Threadprivate allocatable performance issues

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...