An Introduction & Reproducibility Within HPC – Some Thoughts
Dr. Kevin G. McIver, Ph.D. | August 19, 2025
An Introduction
This is the first post on Corvid HPC’s HPC Blog.
I’m your host, Dr. Kevin G. McIver, Ph.D.—my background is primarily in explicit Finite Element Analysis and High-Performance Computing (HPC) system architecture and operation. I earned my Ph.D. studying human injury and my dissertation for my Mechanical Engineering Ph.D. from Purdue University is called “The Application of High-Performance Computing to Create and Analyze Simulations of Human Injury”. In it, I wrote a good deal about how to optimize a system design for a particular application (translating MRI images into Finite Element simulations that were run and analyzed as part of a pseudo-DevOps CI/CD pipeline) and that’s what I do here at Corvid HPC for our major accounts. We work as collaborators with our customers to design HPC systems which we manage and host at the appropriate security levels that accomplish the needs of the customer as cost and time efficiently as possible to meet the goals.
In my professional career I’ve had an opportunity to work on a cross section of different HPC applications, spanning from image processing to advanced missile design work, and spend more time than most digging into the “why” behind performance and finding how we can make our customers successful. I also support our internal CFD and FEA code toolchain maintenance, as well as the externally provided codes (e.g., our footprint of NASA code builds) to make sure we’re using the best performing compiler and MPI combos for our system—more on that in future posts!
Reproducibility within HPC—Some Thoughts
HPC systems are used to solve some of the most complicated scientific and engineering problems of our time. It’s common within HPC to validate code outputs by comparing to test data (for example, strain gauges or accelerometers in FEA, or wind tunnel test data within CFD). This step of validating codes is critical to ensure the codes are accurately representing the physical phenomena they’re intended to model. Validating codes against test data in similar environments enables research and development work to proceed without access to costly wind tunnel time or physical testing, and enables rapid Design Space Exploration (DSE) that depending on the security level of the concept may not otherwise be easily possible.
This leads to HPCs to be of incredible use on bleeding edge technology development, sometimes in cases without good physical test data to validate the codes. In those cases, an approach of simulating a “denser” parameter set around the area of interest instead of just the “envelope” may be done to give engineers more confidence that the system will perform adequately.
In cases like that, the reproducibility of those simulations and how sensitive they are to the underlying numerical methods and hardware comes into question. “Numerical reproducibility” specifically refers to the ability to obtain identical results from a given simulation, even when run on different hardware or software configurations.
Some of our customers maintain on-premises HPCs of their own, or come from existing resources in the cloud with pre-defined benchmark cases. In some cases, customers have expressed frustration that different results are seen between their existing resources and Corvid HPC’s systems. Numerical reproducibility is also a hot button topic within the academic HPC world (see around 19,900 results for “Numerical Reproducibility HPC” on Google Scholar).
The Dirty Secret
Achieving exact numerical reproducibility with HPC simulations in different environments ranges from impractical to impossible. Achieving exact numerical reproducibility with HPC simulations running in the same environment on the same hardware with the same core count can be done, depending on the software. Why is that? Within HPC, we’re dealing with a multi-level complex system solving a single problem. There is variability that can be caused by:
Floating-point arithmetic—the finite precision of floating-point numbers can lead to rounding errors that can accumulate over time, especially in iterative algorithms.
Parallelism—different parallel execution orders (or quantities of solver engines) can lead to slight variations in the order of operations, which can affect the final result. One common example is partitioning of unstructured grids, which is highly dependent on solver engine quantity.
Hardware-specific optimizations—compiler optimizations, cache effects, and other hardware-level factors can introduce non-deterministic behavior that is dependent on the CPU (or GPU) vendor, for example, specific branch prediction algorithms can affect load orders.
Algorithm choice—some algorithms are inherently more sensitive to numerical errors and parallel execution order than others.
Software implementation—the quality of the software implementation, including the choice of numerical libraries and compiler settings, can significantly impact reproducibility—particularly if stochastic variables (for example, some material models in explicit FE codes) are involved, replication of runs even within the same hardware and same node count can be a challenge.
Random number generation—random number generators should be carefully seeded to ensure consistent behavior across different runs.
Condition-specific sensitivity—within certain regimes (e.g., high velocity impacts in explicit finite element codes, or material models that include stochasticity) the thing being simulated can cause worse reproducibility and lead to high sensitivity that negatively impacts numerical reproducibility [though the specific phenomena exacerbating issue will be different per software].
Differences in inputs—for any analysis between systems or different hardware in the same cluster use the exact same inputs (confirm with md5sums, if needed).
Without diving too deeply into the above to keep the reader awake, the impacts of the above selections range from night and day differences (+/- 100% variability) to what is referred to as “floating point rounding”, small differences at the decimal places passed ~10 digits of decimals.
Small floating-point error typically does not lead to differences in engineering or scientific decisions which are the core result from physics-informed simulations. Ultimately, many codes used within HPC are not much more than engineering design tools or scientific codes that inform things like weather prediction. The ownness is still on the engineers that are doing the work to ensure that the work is sufficiently grounded in reality to be useful prior to proceeding with designs on the next missile or the newest forecasting model. Seeing differences between one HPC and another regardless of vendor isn’t a surprise, it’s to be expected!
In explicit finite element analysis (and indeed any code that relies on an unstructured mesh which is partitioned into a solvable grid at runtime) in particular there can be extreme sensitivity around element inversions in higher strain rate problems based on how a grid is partitioned. This can mean that a simulation on 64 or 128 CPU cores runs, while the exact same simulation running at 96 cores can error.
The simulation input didn’t change, but the partitioned grid did, as it was sliced into different grid segments with different ghost regions and different strains per element. This can also mean re-running the exact same simulation on 95 cores can cause it to work, which is a frustrating part of being an HPC engineer. All of these differences are due to the combination of hardware and software but are typically more sensitive to the software being used.
Strategies for Success
Knowing that dirty secret of HPC, that this phenomenon is a well-known problem within the field and user community, what can you do to minimize the effects of this on your simulations?
Use consistent solver engine counts. Depending on the code, this could be CPU or GPU quantity, or CPU socket quantity. This won’t solve everything, but particularly for aero databases against the same mesh in the same software for the same application, each and every case should be solved with the same number of solver engines to have a self-consistent database. Using variable quantities of CPUs can lead to skewed outcomes that may or may not have design relevant impacts.
Use software flags where possible. If comparing between systems, or internal to systems and high variability is seen, refer to your software manual to find those options which help minimize hardware-specific or other optimizations that could trigger these reproducibility issues. Certain libraries like Intel’s Math Kernel Library (MKL) or the Intel MPI have environment variables (discussed at length in Intel’s own literature for MKL and slide on their MPI) that can be set to help reproducibility. These flags can cause substantial slowdowns in simulation throughput depending on the software in question, and is a current field of study within HPC.
Use the same software version AND build. For analyses between systems or within a study, use the same software version (e.g., ANSYS 2024R2) and for compiled codes like FUN3D or Metacomp’s CFD++, the same software version AND build, which helps to minimize originating from different MPI/library/compiler optimizations.
Use the same hardware. Using the same CPU/GPU architecture and specific SKUs to compare vendors can minimize differences between systems.
Above all else—use engineering judgement.
Ultimately the outcome of every simulation will vary and is only a representation of reality regardless of the phenomenon being studied or the software tool being used. All models are wrong, but some are useful—this adage is still accurate today. Small differences shouldn’t affect engineering outcomes and unstable simulations can be worked by the analyst to ensure any new simulation environment produces useful results.
Until next time,
-KGM