• V0ldek@awful.systems
    link
    fedilink
    English
    arrow-up
    25
    ·
    edit-2
    9か月前

    This is a really weird comment. Assembly is not faster than C, that’s a nonsensical statement, C compiles down to assembly. LLVM’s optimizations will most likely outperform or directly match whatever hand-crafted assembly you write. Why would BEQ 1000 be “considerably faster” than if (x == y) goto L_1000;? This collapses even further if you consider any application larger than a few hundred lines of code, any sensible compiler is going to beat you on optimizations if you try to write hand-crafted assembly. Try loading up assembly code and manually performing intraprocedural optimizations, lol, there’s a reason every compiled language goes through an intermediate representation.

    Saying that C# is slower than C is also nonsensical, especially now that C# has built-in PGO it’s very likely it could outperform an application written in C. C#'s JIT compiler is not somehow slower because it’s flexible in terms of hardware, if anything that’s what makes it fast. For example you can write a vectorized loop that will be JIT-compiled to the ideal fastest instruction set available on the CPU running the program, whereas in C or assembly you’d have to manually write a version for each. There’s no reason to think that manual implementation would be faster than what the JIT comes up with at runtime, though, especially with PGO.

    It’s kinda like you’re saying that a V12 engine is faster than a Ferrari and that they are both faster than a spaceship because the spaceship doesn’t have wheels.

    I know you’re trying to explain this to a non-technical person but what you said is so terribly misleading I cannot see educational value in it.

    • iltg@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      7
      ·
      9か月前

      your statement is so extreme it gets nonsensical too.

      compilers will usually produce higher optimized asm than writing it yourself, but there is room to improve usually. it’s not impossible that deepseek team obtained some performance gains hand-writing some hot sections directly in assembly. llvm must “play it safe” because doesn’t know your use case, you do and can avoid all safety checks (stack canaries, overflow checks) or cleanups (eg, make memory arenas rather than realloc). you can tell LLVM to not do those, but it may happen in the whole binary and not be desirable

      claiming c# gets faster than C because of jit is just ridicolous: you need yo compile just in time! the runtime cost of jitting + the resulting code would be faster than something plainly compiled? even if c# could obtain same optimization levels (and it can’t: oop and .net runtime) you still pay the jit cost, which plainly compiled code doesn’t pay. also what are you on with PGO, as if this buzzword suddenly makes everything as fast as C?? the example they give is “devirtualization” of interfaces. seems like C just doesn’t have interfaces and can just do direct calls? how would optimizing up to C level make it faster than C?

      you just come off as a bit entitled and captured in MS bullshit claims

      • bitofhope@awful.systems
        link
        fedilink
        English
        arrow-up
        8
        ·
        9か月前

        GPU programs (specifically CUDA, although other vendors’ stacks are similar) combine code for the host system in a conventional programming language (typically C++), and code for the GPU written in CUDA language. Even if the C++ code for the host system can be optimized with hand written assembly, it’s not going to lead to significant gains when the performance bottleneck is on the GPU side.

        The CUDA compiler translates the high level CUDA code into something called PTX, machine code for a “virtual ISA” which is then translated by the GPU driver into native machine language for the proprietary instruction set of the GPU. This seems to be somewhat comparable to a compiler intermediate representation, such as LLVM. It’s plausible that hand written PTX assembly/IR language could have been used to optimize parts of the program, but that would be somewhat unusual.

        For another layer or assembly/machine languages, technically they could have reverse engineered the actual native ISA of the GPU core and written machine code for it, bypassing the compiler in the driver. This is also quite unlikely as it would practically mean writing their own driver for latest-gen Nvidia cards that vastly outperforms the official one and that would be at least as big of a news story as Yet Another Slightly Better Chatbot.

        While JIT and runtimes do have an overhead compared to direct native machine code, that overhead is relatively small, approximately constant, and easily amortized if the JIT is able to optimize a tight loop. For car analogy enjoyers, imagine a racecar that takes ten seconds to start moving from the starting line in exchange for completing a lap one second faster. If the race is more than ten laps long, the tradeoff is worth it, and even more so the longer the race. Ahead of time optimizations can do the same thing at the cost of portability, but unless you’re running Gentoo, most of the C programs on your computer are likely compiled for the lowest common denominator of x86/AMD64/ARMwhatever instruction sets your OS happens to support.

        If the overhead of a JIT and runtime are significant in the overall performance of the program, it’s probably a small program to begin with. No shame to small programs, but unless you’re running it very frequently, it’s unlikely to matter if the execution takes five or fifty milliseconds.

        • froztbyte@awful.systems
          link
          fedilink
          English
          arrow-up
          5
          ·
          9か月前

          For another layer or assembly/machine languages, technically they could have reverse engineered the actual native ISA of the GPU core and written machine code for it, bypassing the compiler in the driver. This is also quite unlikely as it would practically mean writing their own driver for latest-gen Nvidia cards that vastly outperforms the official one

          yeah, and it’d be a pretty fucking immense undertaking, as it’d be the driver and the application code and everything else (scheduling, etc etc). again, it’s not impossible, and there’s been significant headway across multiple parts of industry to make doing this kind of thing more achievable… but it’s also an extremely niche, extremely focused, hard-to-port thing, and I suspect that if they actually did do this it’d be something they’d be shouting about loudly in every possible PR outlet

          a look at every other high-optimisation field, from the mechanical sympathy lot stemming from HFT etc all the way through to where that’s gotten to in modern usage of FPGAs in high-perf runtime envs also gives a good backgrounder in the kind of effort cost involved for this shit, and thus gives me some extra reasons to doubt claims kicking around (along with the fact that everyone seems to just be making shit up)

    • justOnePersistentKbinPlease@fedia.io
      link
      fedilink
      arrow-up
      1
      arrow-down
      9
      ·
      9か月前

      I have have crafted assembly instructions and have made it faster than the same C code.

      Particular to if statements, C will do things push and pull values from the stack which takes a small but occasionally noticeable amount of cycles.

      • khalid_salad@awful.systems
        link
        fedilink
        English
        arrow-up
        10
        ·
        edit-2
        9か月前

        python, what are you doing?"

        idk, I’m written in C, it does things push and pull values from the stack, have you tried assembly, it’s faster

      • self@awful.systems
        link
        fedilink
        English
        arrow-up
        8
        ·
        9か月前

        Particular to if statements, C will do things push and pull values from the stack which takes a small but occasionally noticeable amount of cycles.

        holy fuck. llvm in shambles

        • bitofhope@awful.systems
          link
          fedilink
          English
          arrow-up
          6
          ·
          edit-2
          9か月前

          Meanwhile I’m reverse engineering some very much not performance sensitive video game binary patcher program some guy made a decade ago and Ghidra interprets a string splitting function as a no-op because MSVC decided calling conventions are a spook and made up a new one at link time. And it was right to do that.

          EDIT: Also me looking for audio data from another old video game, patiently waiting for my program to take about half an hour on my laptop every time I run it. Then I remember to add --release to cargo run and while the compilation takes three seconds longer, the runtime shrinks to about ten seconds. I wonder if the above guy ever tried adding -O2 to his CFLAGS?