I learned z80 assembly back when the cutting edge of technology was a ZX Spectrum, and 68k assembly when I upgraded to an Amiga. That knowledge served me quite well for my early career in industrial automation - it was hard real-time coding on eZ80’s and 65c02 processors, but the knowledge transfers.
Back in the day, when input got mapped straight into a memory location and the display output was another memory location, then assembly seems like magic. Read the byte they corresponds to the right-hand middle row of the keyboard, check if a certain bit is set in that byte, therefore a key is held down. Call your subroutine that copies a sequence of bytes into a known location. Boom, pressing a key updates the screen. Awesome.
Modern assembly (x64 and the like) has masses of rules about pointer alignment for the stacks, which you do so often you might as well write a macro for it. Since the OS doesn’t let you write system memory any more (a good thing) then you need to make system calls and call library functions to do the same thing. You do that so often that you might as well write a macro for that as well. Boom, now your assembly looks almost exactly like C. Might as well learn that instead.
In fact, that’s almost the purpose of C - a more readable, somewhat portable assembly language. Experienced C developers will know which sequence of opcodes they’d expect from any language construction. It’s quite a simple mapping in that regard.
It’s handy to know a little assembly occasionally, but unless you’re writing eg. crypto implementations, which must take the exact same time and power to execute regardless of the input, then it’s impractical for almost any purpose nowadays.








SIMD is pretty simple really, but it’s been 30 years since it’s been a standard-ish feature in CPUs, and modern compilers are “just about able to sometimes” use SIMD if you’ve got a very simple loop with fixed endpoints that might use it. It’s one thing that you might fall back to writing assembly to use - the FFmpeg developers had an article not too long ago about getting a 10% speed improvement by writing all the SIMD by hand.
Using an NPU means recognising algorithms that can be broken down into parallelizable, networkable steps with information passing between cells. Basically, you’re playing a game of TIS-100 with your code. It’s fragile and difficult, and there’s no chance that your compiler will do that automatically.
Best thing to hope for is that some standard libraries can implement it, and then we can all benefit. It’s an okay tool for ‘jobs that can be broken down into separate cells that interact’, so some kinds of image processing, maybe things like liquid flow simulations. There’s a very small overlap between ‘things that are just algorithms that the main CPU would do better’ and ‘things that can be broken down into many many simple steps that a GPU would do better’ where an NPU really makes sense, tho.