|
Adding a new dimension to DFT calculations of solids ...
Hardware Benchmarks forsingle cpu performance (serial lapw1c)parallel cpu performance (mpi-parallel lapw1_mpi)Below you can find some timings of serial and parallel benchmarks run on various platforms and using different compilers. At present Intel processors (at least Core i7) seem to be the fastest processors, but modern AMD cpus seem to have catched up (for a quite good price). Using 8 cores of an Intel I9 14900K (6.0 GHz) processor the benchmark time for a single k-point comes down to 2.7 seconds (while 6 cores of an Intel I7 3930K need 14 sec, and the speed when running 6 jobs in k-parallel mode is still only 49 sec for 6 k-points). Please note that on multi-core (or multi-cpu) systems the performance can drastically decrease when running N lapw1-jobs on N cores in parallel due to the limited memory-bus speed (Multi-core cpus). Thus, the memory bandwidth seems to be most important for the performance of a "k-parallel" scf-cycle and thus for the real "throughput". The mpi-benchmark demonstrates, that on a single 8-core node the sequential version (open-mp parallel) is twice as fast as the mpi-version. Thus mpi pays only off if you have more than one node (or a very large shared memory machine). Serial benchmark: NMAT=3481, complexIntel Core I9-14900k (8 cores@6.0 +12 cores@3.4GHz), oneapi-2021.1.1 (wall times) k=1, omp=1, 10.8 sec/k-point k=1, omp=2, 6.3 sec/k-point k=1, omp=4, 3.8 sec/k-point (58.0 for 16k) k=1, omp=8, 2.7 sec/k-point (39.8 for 16k) Intel Core I9-12900k (8 cores@3.2 + 8 cores@2.4GHz), oneapi-2023 k=1, omp=1, 13.3 sec/k-point k=1, omp=2, 7.4 sec/k-point k=1, omp=4, 4.6 sec/k-point k=1, omp=8, 3.7 sec/k-point Intel Core i9-9900K 4.70GHz 15.4 sec ifort 21.1 (1 job with 1 thread) Intel Core i9-9900K 4.70GHz 9.1 sec ifort 21.1 (1 job with 2 thread) Intel Core i9-9900K 4.70GHz 7.1 sec ifort 21.1 (1 job with 4 thread) Intel Core i9-9900K 4.70GHz 6.4 sec ifort 21.1 (1 job with 8 thread) AMD Ryzen 5 5600G, 6 cores, CPU max MHz:4464; (January 2023) gfortran 11.3 + openblas zenp-r0.3.21: k=1, omp_global=1, 20.9 sec/k-point k=1, omp_global=2, 13.2 sec/k-point k=1, omp_global=4, 9.5 sec/k-point k=1, omp_global=6, 8.8 sec/k-point OneAPI (ifort 2021.8.0 + mkl 2023.0.0) k=1, omp_global=1, 29.7 sec/k-point k=1, omp_global=2, 17.4 sec/k-point k=1, omp_global=4, 12.0 sec/k-point k=1, omp_global=6, 10.0 sec/k-point +++++++++++++++ test_case.output1_gfortran: TIME HAMILT (WALL) = 5.7, HNS = 1.6, DIAG = 13.3 test_case.output1_ifort: TIME HAMILT (WALL) = 4.0, HNS = 2.7, DIAG = 21.3 --- from these lines one sees that ifort is still faster than gfortran for AMD (HAMILT), --- but openblas is faster than mkl (HNS+DIAG) for AMD --- probably mkl switches off some opt. for AMD Intel i9-10980XE 3.0 GHz (18 cores, Hyperthreding on, Ifort 19.0.1.144) k=1, omp_global=1, Total 27.1 sec, 27.1 sec/k-point k=1, omp_global=2, Total 14.7 sec, 14.7 sec/k-point k=1, omp_global=4, Total 9.1 sec, 9.1 sec/k-point k=1, omp_global=8, Total 8.0 sec, 8.0 sec/k-point k=1, omp_global=18, Total 6.9 sec, 6.9 sec/k-point Intel i7-8700K 3.7 GHz omp_global=1, Total 20.4 sec Hyperthreding off, Ifort 19.0.1.144 Intel i7-8700K 3.7 GHz omp_global=2, Total 12.4 sec Intel i7-8700K 3.7 GHz omp_global=3, Total 11.4 sec Intel i7-8700K 3.7 GHz omp_global=4, Total 10.2 sec Intel i7-8700K 3.7 GHz omp_global=5, Total 10.2 sec Intel i7-8700K 3.7 GHz omp_global=6, Total 10.5 sec Intel(R) Core(TM) i7-7820X 3.6 GHz Hyperthreding off, Ifort 19.0.1.144 k=1, omp_global=1, Total 24.6 sec, 24.6 sec/k-point k=1, omp_global=2, Total 14.6 sec, 14.6 sec/k-point k=1, omp_global=3, Total 11.2 sec, 11.2 sec/k-point k=1, omp_global=4, Total 10.1 sec, 10.1 sec/k-point k=1, omp_global=5, Total 9.0 sec, 9.0 sec/k-point k=1, omp_global=6, Total 9.0 sec, 9.0 sec/k-point k=1, omp_global=7, Total 7.9 sec, 7.9 sec/k-point k=1, omp_global=8, Total 8.0 sec, 8.0 sec/k-point Intel Core i7-7820X 3.60GHz 23.5 sec ifort 21.1 (Hyper. on)(1 job with 1 thread) Intel Core i7-7820X 3.60GHz 13.4 sec ifort 21.1 (1 job with 2 thread) Intel Core i7-7820X 3.60GHz 8.2 sec ifort 21.1 (1 job with 4 thread) Intel Core i7-7820X 3.60GHz 6.1 sec ifort 21.1 (1 job with 8 thread) Intel Core i7-3930K 3.20GHz 36 sec composerxe-2013.4.183 (1 job with 1 thread) Intel Core i7-3930K 3.20GHz 23 sec composerxe-203.4.1831 (1 job with 2 thread) Intel Core i7-3930K 3.20GHz 16 sec composerxe-2013.4.183 (1 job with 4 thread) Intel Core i7-3930K 3.20GHz 14 sec composerxe-2013.4.183 (1 job with 6 thread) IBM p460 Power7 53 sec (1 thread only) Intel Core i7-2600 3.40 GHz 37 sec composerxe-2011.4.191 (1 job with 1 thread) Intel Core i7-2600 3.40 GHz 26 sec composerxe-2011.4.191 (1 job with 2 thread) Intel Core i7-2600 3.40 GHz 22 sec composerxe-2011.4.191 (1 job with 4 thread) Intel Core i7 980x, 3.33GHz 65 sec ifort11 (+mkl)(1 job with 1 thread) Intel Core i7 920, 2.66 GHz 91 sec ifort11 (+mkl)(1 job with 1 thread) Intel Core i7 920, 2.66 GHz 57 sec ifort11 (+mkl)(1 job with 2 thread) Intel Core i7 920, 2.66 GHz 40 sec ifort11 (+mkl)(1 job with 4 thread) P4 dual-Xeon, 3.6 GHz 165 sec ifort9 + mkl8 (1 job with 1 thread!) P4 dual-Xeon, 3.6 GHz 125 sec ifort9 + mkl8 (1 job with 2 threads!) bi-Xeon 5320 (overcl 2.67GHz) 119 sec ifort9.1 + mkl9.0 (1 job with 1 threads) bi-Xeon 5320 (overcl 2.67GHz) 90 sec ifort9.1 + mkl9.0 (1 job with 2 threads) bi-Xeon 5320 (overcl 2.67GHz) 76 sec ifort9.1 + mkl9.0 (1 job with 4 threads) bi-Xeon 5320 (overcl 2.67GHz) 69 sec ifort9.1 + mkl9.0 (1 job with 8 threads) P4 Core2 Duo E6600, 2.4 GHz 128 sec ifort10.1+mkl9.1, OMP_NUM_THREADS=1 P4 Core2 Duo E6600, 2.4 GHz 103 sec ifort10.1+mkl9.1, OMP_NUM_THREADS=2 Xeon X3210 Quadcore 2.13GHz 140 sec ifort10.1+cmkl10.0 1 job, 1 thread Xeon X3210 1033 MHz FSB 88 sec ifort10.1+cmkl10.0 1 job, 2 threads Xeon X3210 112 sec ifort10.1+cmkl10.0 2 jobs, 2 threads Xeon X3210 228 sec ifort10.1+cmkl10.0 4 jobs, 1 thread IBM 52A 1.90GHz Power5+(1 cpu) 135 sec xlf10.1,-q64 -O5,ESSL4.2 IBM 52A (-"-,2 cpus) 83 sec - " - IBM 52A (-"-,2 cpus, SMT=on) 80 sec - " - Itanium2(1.6GHz,SGI Altix 3700) 122 sec ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00 Itanium2(-"-, 2 threads) 90 sec ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00 AMD-Opteron, single cpu, 2.4Ghz 190 sec ifort(9.1.40) + libgoto_opteron64p-r1.09.so AMD-Opteron, single cpu, 2.8Ghz 167 sec ifort(10.1.11) + libgoto_opteron64p-r1.23.so Serial benchmark with parallel jobs (Tests the "real" performance under full load with a "k-parallel" job): NMAT=3481, complexIntel Core I9-14900k (8 cores@6.0 +12 cores@3.4GHz), oneapi-2021.1.1 (wall times) k=2, omp=1, 13.3 sec/2k-point k=2, omp=2, 9.1 sec/2k-point (55.0 for 16k) k=2, omp=4, 7.0 sec/2k-point (37.6 for 16k) k=2, omp=8, 7.0 sec/2k-point (65.2 for 16k) k=4, omp=1, 16.1 sec/4k-point k=4, omp=2, 12.3 sec/4k-point (39.6 for 16k) k=4, omp=4, 12.1 sec/4k-point (41.3 for 16k) k=4, omp=8, 13.3 sec/4k-point k=8, omp=1, 21.9 sec/8k-point k=8, omp=2, 22.3 sec/8k-point k=8, omp=4, 23.3 sec/8k-point k=8, omp=8, 26.6 sec/8k-point k=16, omp=1, 40.3 sec/16k-point k=16, omp=2, 43.3 sec/16k-point k=16, omp=4, 45.9 sec/16k-point k=16, omp=8, - sec/16k-point Intel Core I9-12900k (8 cores@3.2 + 8 cores@2.4GHz), oneapi-2023 k=2, omp=8, 13.5 sec (108.0 sec/16k-points or 6.8 sec/k-point) k=4, omp=4, 17.1 sec (68.4 sec/16k-points or 4.3 sec/k-point) k=8, omp=2, 29.9 sec (59.8 sec/16k-points or 3.8 sec/k-point) k=16, omp=1, 55.8 sec (55.8 sec/16k-points or 3.5 sec/k-point) Intel i9-10980XE 3.0 GHz (18 cores, Hyperthreding on, 1 thread, Ifort 19.0.1.144) k=1 27.1 sec; k=2 25.6 sec; k=4 31.2 sec; k=6 31.9 sec; k=9 44.0 sec (4.9 sec/k-point); k=15 62.8 sec (4.2 sec/k-point); k=18 77.5 sec (4.3 sec/k-point) Intel(R) Core(TM) i7-8700K 3.7 GHz (1 thread) k=1 20.4 sec; k=2 22.2 sec; k=3 25.8 sec; k=6 40.8 sec, (6.8 sec/k-point) Intel Core i7-7820X 3.60GHz ifort 21.1 (1 thread, Hyperthr off) k=1 24.6 sec; k=2 24.4 sec; k=4 27.0 sec; k=8 34.4 sec, (4.3 sec/k-point) Intel Core i7-7820X 3.60GHz ifort 21.1 (1 thread, Hyperthr on) 1 job: 24 sec; 2 jobs: 26 sec; 4 jobs: 27 sec; 8 jobs: 31 sec (3.9 sec/k-point) Intel Core i7-7820X 3.60GHz ifort 21.1 (2 threads) 1 job: 14 sec; 2 jobs: 15 sec; 4 jobs: 19 sec (4.9 sec/k-point); Intel Core i7-3930K 3.20GHz composerxe-2013.4.183 (1 thread) 1 job: 37 sec; 2 jobs: 38 sec; 4 jobs: 47 sec; 6 jobs: 49 sec Intel Core i7-3930K 3.20GHz composerxe-2013.4.183 (2 threads) 1 job: 23 sec; 3 jobs: 39 sec; 6 jobs: 50 sec Intel Core i7-2600 3.40 GHz composerxe-2011.4.191 (1 thread) 1 job: 37 sec; 2 jobs: 43 sec; 4 jobs: 62 sec Intel Core i7 980x, 3.33 GHz ifort11 (+mkl) 1 job (1 thread) 65 sec; 6 jobs (1 thread) 89 sec Intel Core i7 920, 2.66 GHz ifort11 (+mkl) Jobs 1 Thread 2 Threads 4 Threads 1 99 62 41 2 100 70 4 104 1333 FSB Dual-Clovertown X5355 @ 2.66GHz, 667 Memory Jobs 1 Thread 2 Threads 4 Threads 8 Threads 1 132 88 66 62 2 145 104 98 4 177 163 1600 FSB Dual-Harpertown 2.8 GHz with 800 MHz Memory Jobs 1 Thread 2 Threads 4 Threads 1 134 83 67 2 123 94 4 148 134 AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz Ifort 10.1.11 + libgotoopteron64p-r1.23.so Jobs 1 Thread 2 Threads 4 Threads 1 167 120 101 2 168 122 4 174A "historical list" can be found here. MPI-parallel benchmark: NMAT=11571, real, full diagonalizationIntel Core I9-14900k (8 cores@6.0 +12 cores@3.4GHz), oneapi-2021.1.1 (wall times) serial code, OMP=8: mpi-benchmark.output1: TIME HAMILT (CPU) = 28.8, HNS = 44.5, HORB = 0.0, DIAG = 83.7 mpi-benchmark.output1: TIME HAMILT (WALL) = 3.7, HNS = 5.6, HORB = 0.0, DIAG = 19.8 > SUM OF WALL CLOCK TIMES: 29.4 (INIT = 0.3 + K-POINTS = 29.1) mpi-code (Elpa) OMP=1, 8 mpi-jobs: mpi-benchmark.output1_1: TIME HAMILT (CPU) = 4.3, HNS = 3.3, HORB = 0.0, DIAG = 42.7 mpi-benchmark.output1_1: TIME HAMILT (WALL) = 4.3, HNS = 3.3, HORB = 0.0, DIAG = 42.7 > SUM OF WALL CLOCK TIMES: 51.0 (INIT = 0.3 + K-POINTS = 50.6)Intel Core i7-7820X 3.60GHz ifort 21.1, elpa2020.10 20Gb infiniband, openmpii(cores-per-node_nodes: total cores) serial 4_1: TIME HAMILT (WALL) = 35.4, HNS = 26.9, HORB = 0.0, DIAG = 113.2, SYNC = 0.0 > SUM OF WALL CLOCK TIMES: 176.2 (INIT = 0.6 + K-POINTS = 175.6) serial 8_1: TIME HAMILT (WALL) = 24.4, HNS = 21.9, HORB = 0.0, DIAG = 81.9, SYNC = 0.0 > SUM OF WALL CLOCK TIMES: 129.0 (INIT = 0.6 + K-POINTS = 128.4) 4_1 (4 cores) TIME HAMILT (WALL) = 32.9, HNS = 17.8, HORB = 0.0, DIAG = 106.7, SYNC = 0.2 4_1 ===> TOTAL CPU TIME: 157.8 (INIT = 0.5 + K-POINTS = 157.3) 8_1:(8 cores) TIME HAMILT (WALL) = 19.1, HNS = 10.0, HORB = 0.0, DIAG = 74.0, SYNC = 0.1 8_1: ===> TOTAL CPU TIME: 103.4 (INIT = 0.5 + K-POINTS = 102.9) 8_2:(16 cores) TIME HAMILT (WALL) = 10.1, HNS = 6.0, HORB = 0.0, DIAG = 48.3, SYNC = 0.0 8_2: ===> TOTAL CPU TIME: 64.9 (INIT = 0.5 + K-POINTS = 64.4) 8_4:(32 cores) TIME HAMILT (WALL) = 5.8, HNS = 3.5, HORB = 0.0, DIAG = 40.0, SYNC = 0.0 8_4: ===> TOTAL CPU TIME: 49.9 (INIT = 0.5 + K-POINTS = 49.4) 8_8:(64 cores) TIME HAMILT (WALL) = 3.1, HNS = 2.1, HORB = 0.0, DIAG = 24.0, SYNC = 0.1 8_8: ===> TOTAL CPU TIME: 29.9 (INIT = 0.6 + K-POINTS = 29.3) P4 dual-Xeon, 3.6 GHz,Infiniband, ifort9+cmkl8 (first number: jobs/node; 2nd number: nodes). aurora_serial: TIME HAMILT (CPU) = 346.7, HNS = 198.6, DIAG = 1188.6 aurora_serial: TOTAL CPU TIME: 1737.0 (INIT = 3.1 + K-POINTS = 1733.9) aurora_1_2: TIME HAMILT (CPU) = 169.9, HNS = 145.2, DIAG = 991.1 aurora_1_2: TOTAL CPU TIME: 1309.6 (INIT = 3.1 + K-POINTS = 1306.5) aurora_1_4: TIME HAMILT (CPU) = 88.1, HNS = 78.4, DIAG = 514.4 aurora_1_4: TOTAL CPU TIME: 684.1 (INIT = 3.0 + K-POINTS = 681.1) aurora_1_8: TIME HAMILT (CPU) = 44.7, HNS = 41.6, DIAG = 304.6 aurora_1_8: TOTAL CPU TIME: 394.3 (INIT = 3.1 + K-POINTS = 391.2) aurora_1_16: TIME HAMILT (CPU) = 25.0, HNS = 23.7, DIAG = 196.8 aurora_1_16: TOTAL CPU TIME: 248.8 (INIT = 3.1 + K-POINTS = 245.7) aurora_1_32: TIME HAMILT (CPU) = 14.7, HNS = 13.8, DIAG = 137.6 aurora_1_32: TOTAL CPU TIME: 169.4 (INIT = 3.1 + K-POINTS = 166.3) aurora_2_1: TIME HAMILT (CPU) = 194.8, HNS = 171.4, DIAG = 1554.2 aurora_2_1: TOTAL CPU TIME: 1923.8 (INIT = 3.2 + K-POINTS = 1920.7) aurora_2_2: TIME HAMILT (CPU) = 103.2, HNS = 90.1, DIAG = 816.6 aurora_2_2: TOTAL CPU TIME: 1013.3 (INIT = 3.1 + K-POINTS = 1010.2) aurora_2_4: TIME HAMILT (CPU) = 46.5, HNS = 48.0, DIAG = 427.6 aurora_2_4: TOTAL CPU TIME: 525.4 (INIT = 3.1 + K-POINTS = 522.3) aurora_2_8: TIME HAMILT (CPU) = 25.5, HNS = 27.6, DIAG = 287.9 aurora_2_8: TOTAL CPU TIME: 344.4 (INIT = 3.1 + K-POINTS = 341.2) aurora_2_16: TIME HAMILT (CPU) = 15.2, HNS = 15.4, DIAG = 179.6 aurora_2_16: TOTAL CPU TIME: 213.5 (INIT = 3.1 + K-POINTS = 210.4) aurora_2_32: TIME HAMILT (CPU) = 9.8, HNS = 9.4, DIAG = 127.7 aurora_2_32: TOTAL CPU TIME: 150.3 (INIT = 3.2 + K-POINTS = 147.1) SUN AMD-2.4GHz dual-core/dual-cpu,Infiniband, SUNstudio10 (first number: jobs/node; 2nd number: nodes). luna-serial: TIME HAMILT (CPU) = 763.3, HNS = 265.0, DIAG = 2255.3 luna-serial: TOTAL CPU TIME: 3286.3 (INIT = 2.5 + K-POINTS = 3283.8) luna-mpi_2_1:TIME HAMILT (CPU) = 384.5, HNS = 199.2, DIAG = 1504.9 luna-mpi_2_1:TOTAL CPU TIME: 2091.4 (INIT = 2.5 + K-POINTS = 2088.9) luna-mpi_4_1:TIME HAMILT (CPU) = 196.6, HNS = 105.2, DIAG = 785.2 luna-mpi_4_1:TOTAL CPU TIME: 1089.9 (INIT = 2.5 + K-POINTS = 1087.3) luna-mpi_2_2:TIME HAMILT (CPU) = 193.7, HNS = 103.7, DIAG = 752.3 luna-mpi_2_2:TOTAL CPU TIME: 1052.6 (INIT = 2.5 + K-POINTS = 1050.1) luna-mpi_4_2:TIME HAMILT (CPU) = 102.2, HNS = 58.5, DIAG = 546.2 luna-mpi_4_2:TOTAL CPU TIME: 709.8 (INIT = 2.6 + K-POINTS = 707.2) luna-mpi_4_4:TIME HAMILT (CPU) = 53.8, HNS = 31.6, DIAG = 251.6 luna-mpi_4_4:TOTAL CPU TIME: 340.0 (INIT = 2.6 + K-POINTS = 337.4) luna-mpi_4_8:TIME HAMILT (CPU) = 31.6, HNS = 18.3, DIAG = 176.9 luna-mpi_4_8:TOTAL CPU TIME: 229.7 (INIT = 2.6 + K-POINTS = 227.1) Xeon X3210 2.13GHz Quad Core, 1066 MHz FSB (first number: jobs/node; 2nd number: nodes). 1 MPI, 1 Thread 1423 Secs HAMILT (CPU ) = 223.0, HNS=174.1, DIAG= 1021.4 2 MPI, 1 Thread 1242 Secs HAMILT (WALL) = 120.3, HNS=129.8, DIAG= 988.4 4 MPI, 1 Thread 1175 Secs HAMILT (WALL) = 80.1, HNS=116.8, DIAG= 977.9 20 nodes AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz, Voltaire 20Gbps Infiniband Ifort 10.1.11 + libgotoopteron64p-r1.23.so # Intel Cluster MKL, MPI-CH1.2 Even more detailed data can be found here. 1 core/ 1 node: TIME HAMILT (CPU) = 314.8, HNS = 341.4, HORB = 0.0, DIAG = 1990.4 TOTAL CPU TIME: 2648.7 (INIT = 2.1 + K-POINTS = 2646.7) 4 core/ 4 node: TIME HAMILT (CPU) = 79.6, HNS = 91.9, HORB = 0.0, DIAG = 483.8 TOTAL CPU TIME: 657.5 (INIT = 2.1 + K-POINTS = 655.4) 8 core/ 8 node: TIME HAMILT (CPU) = 40.0, HNS = 48.6, HORB = 0.0, DIAG = 268.4 TOTAL CPU TIME: 359.3 (INIT = 2.1 + K-POINTS = 357.2) 16 core/16 node: TIME HAMILT (CPU) = 22.0, HNS = 26.7, HORB = 0.0, DIAG = 159.9 TOTAL CPU TIME: 210.8 (INIT = 2.1 + K-POINTS = 208.7) 4 core/ 1 node: TIME HAMILT (CPU) = 85.7, HNS = 96.0, HORB = 0.0, DIAG = 668.0 TOTAL CPU TIME: 851.8 (INIT = 2.1 + K-POINTS = 849.7) 8 core/ 2 node: TIME HAMILT (CPU) = 40.5, HNS = 50.8, HORB = 0.0, DIAG = 344.8 TOTAL CPU TIME: 438.3 (INIT = 2.1 + K-POINTS = 436.3) 12 core/ 3 node: TIME HAMILT (CPU) = 27.6, HNS = 35.8, HORB = 0.0, DIAG = 247.1 TOTAL CPU TIME: 312.6 (INIT = 2.1 + K-POINTS = 310.5) 16 core/ 4 node: TIME HAMILT (CPU) = 22.1, HNS = 27.9, HORB = 0.0, DIAG = 194.3 TOTAL CPU TIME: 246.6 (INIT = 2.1 + K-POINTS = 244.5) 20 core/ 5 node: TIME HAMILT (CPU) = 18.1, HNS = 22.6, HORB = 0.0, DIAG = 165.3 TOTAL CPU TIME: 208.3 (INIT = 2.1 + K-POINTS = 206.3) * libgoto blas libraries are available from: http://www.tacc.utexas .edu/resources/software/ ©2001 by P. Blaha and K. Schwarz |