Besides just messing around with 13.2.0, I've created KOS patches for, built, and tested many configurations of GCC over the past couple months: 4.7.4, 4.9.4, 9.3.0, 9.4.0, 9.5.0, 10.1.0, 10.2.0, 10.3.0, 10.4.0, 10.5.0, 11.1.0, 11.2.0, 11.3.0, 11.4.0, 12.1.0, 12.2.0, 12.3.0, 13.1.0, 13.2.0, gcc-rs from git and gcc 14 from git. I also tested different versions of newlib with different compiler versions as well. Patches are currently available in the KOS github gccdev branch, and at some point in the near future I'll PR those to the main KOS.
Given that I tested a couple dozen toolchain builds with a couple dozen flag sets, with and without LTO, etc., this required compililng several hundred ELF binaries to test. Compiling them isn't so much of a problem, but looping through pvrmark requires a few minutes per build, so in the interest of not having to run my Dreamcast for several days straight, I reduced the amount of iterations per test while running pvrmark, compared to the last benchmark round. Like in the last thread, keep this in mind when comparing results against one another, as you can sometimes get large swings of 1000-3000 or more between runs of the same build. Some builds were only run once, some of the more interesting points (like the high end scores) I ran multiple times and averaged out, so I could at least attempt to get a more accurate representation.
Here are my observations so far:
- GCC 4.7.4, the KallistiOS "legacy" configuration, is still useful to keep around for old code, as it will compile things that modern GCC versions will not. A lot of old code in KOS and kos-ports had to be updated to compile properly on modern GCC. Most of that code was poor code, showing that GCC has become more strict over time.
- As mentioned in the previous thread, GCC 4.7.4 at O3 with LTO was the fastest compiler configuration of all of them, though with a tremendous binary size. Using it with LTO seems to choke on things more than modern GCC, and in fact, right now KallistiOS won't build with LTO under 4.7.4. I had to patch KOS manually to run this benchmark. Without LTO enabled, it's no longer the fastest option and isn't worth using unless, like mentioned above, you need it for old code compatibility reasons. As it ages, using it can bring up compatibility issues as well, for example building GCC 4.7.4 on macOS is broken right now (and was broken on *nix a while back before KOS added a compatibility patch to make it build).
- GCC 4.9.4 was of interest because it supports fast-math and is available in Compiler Explorer, but it's too buggy and generates screwed up code, so I abandoned working with it at all.
- Comparing GCC 9.3.0 toolchain with newlib 3.3 and binutils 2.34 vs. GCC 9.3.0 with newlib 4.3.0 and binutils 2.40 didn't really produce anything of interest. I don't think it's worth messing around with anything other than the latest.
- Comparing within a GCC generation, e.g. 9.3.0 vs. 9.5.0, 10.1.0 vs. 10.5.0, 11.1.0 vs. 11.4.0, etc. does show some trends up or down in performance, but it's always pretty negligible, so it's not worth doing anything more with... there's no magical obscure point release with uber performance hidden away. Might as well use the most up to date version within a generation. Therefore, to keep things simple, I've omitted all the data for these point releases in my charts and just left the interesting bits.
- Without LTO, GCC 9 and 10 are at an obvious disadvantage. With the best configurations, 9.x is about 2% slower than 10.x, and 10.x is about 3% slower than 11.4.0. 11.4.0 through 13.2.0 are all about the same.
- The story is different with LTO. With LTO enabled, the GCC 10 series has an obvious reproducible speed advantage over 9/11/12/13, but even still, the best 10.5.0 score is only a 1.39% increase in speed over the best 13.2.0 score.
Also: the gray X boxes for 4.7.4 are because 4.7.4 doesn't support -freorder-blocks-algorithm=simple
- I didn't really do much comparison of bin size vs. performance this time, but 13.x's -Os flags produce the smallest bins.
- I think that's everything....
The fastest 4.7.4-LTO configuration is 4.11% faster than the fastest 13.2.0-LTO configuration (most up-to-date).
The fastest 10.5.0-LTO configuration is 1.39% faster than the fastest 13.2.0-LTO configuration (most up-to-date).
If those speed differences are worth it to you, and in the case of 4.7.4 you don't mind the huge bin size, you can pursue those compilers, but otherwise, without LTO, 4.7.4 and 10.5.0 are no good, just stick to the latest 13.2.0.
Next...
I wanted to test with and without fast-math functions, but pvrmark doesn't really do anything to emit those instructions, really showing the limitations of using pvrmark for something like this. The next step in this journey is to create a benchmark that more accurately represents an actual workload. In chat we discussed using Quake demos or maybe Harlequest dev edition, which I backed.