MI300X vs H100 vs H200 Benchmark Part 1: Training

Today we are going to talk through our five-month journey conducting independent analysis and training-focused benchmarking of the MI300X, the H100 and the H200, engaging with both NVIDIA and AMD. We will do a detailed overview of the numerous low-level benchmarks that we ran, see the table of contents for summary. Furthermore, we will compare the total cost of ownership of Nvidia and AMD GPUs and factor in performance. Ultimately much of what we are doing is openly giving a comprehensive public recommendation to AMD on what they need to do to be competitive and fix their software issues after five months of submitting and squashing bugs. It’s not just that it’s immature software, they need to change how they do development.

In short, when comparing Nvidia’s GPUs to AMD’s MI300X, we found that the potential on paper advantage of the MI300X was not realized due to a lack within AMD public release software stack and the lack of testing from AMD.

AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience. As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates.

MI300X vs H100 vs H200 Benchmark Part 1: Training – CUDA Moat Still Alive

Comments