What happens though if the signal is NOT steady-state, and music is not?
This observation isn't complete. The lack of steady-state in music (any program material, for that matter) can be accurately described in terms of the superposition (summation) of a set of sine (or cosine) waves, each with a different amplitude and phase, relative to the others. Human hearing can only sense such waves if they are between the range of about 20 Hz and 20,000 Hz (less than that for many of us). It is those waves, which are steady-state for a period of time that determine the digitizing requirements, and must satisfy the Nyquist theorem.
The description of successive samples being sampled at different points in the cycle, resulting in "filling in" the waveform is illustrative, but not necessary. If you were sampling at 48 KHz, for example, and the material consisted of a 24 KHz sine wave, (assuming both frequencies were exact) it would sample the same two points of the cycle over and over again. However, part of the "decoding" process is a low pass filter which would reconstruct a 24 KHz sine wave out of those two points.
That example is useful, being the worst case. It does show one limitation of the process that I alluded to in a previous post. The sampling and reconstruction process introduces some amplitude changes that are frequency dependent. It is based on the formula sin(x)/x. They are most pronounced near the nyquist frequency and diminish as the program content goes lower in frequency. In this case, if the phase was just right, and the frequencies exact, the amplitude could be zero! However this is the ONLY case that is that extreme. Any frequency less than the Nyquist frequency exhibits significantly less amplitude loss.
This is one (of several) reasons why oversampling is useful. IMO 44.1 KHz is marginal if one expects to accurately reconstruct frequencies up to 20 KHz. Sin(x)/x variations are noticable below 20 KHz. OTOH if the upper limit expected is 15 KHz, then 44.1 KHz isn't as bad. Note also that it was chosen as an "odd" frequency. The worst frequencies for sin(x)/x errors are ones that multiply integrally to the Nyquist frequency. Using an "odd" sampling rate makes it less likely that a real world signal would be on one of those frequencies. 48 KHz gives significanatly more margin, and although not "odd", is high enough to have minimal impact in the audible range. (Of course, higher rates like 96 KHz) move the sin(x)/x problem further out. Note that in the example I gave, the 24 KHz signal, that could totally disappear under the right conditions, is well above 20 KHz. That means it does not contribute to any audible experience. It also means that it is probably going to be filtered out before digitization any way, so is of no concern.
Another method that is sometimes used in processing audio signals is the Fourier transform (in one of several forms). The Fourier transform also has some limitations of its own related to the OP. It transforms blocks of information (in time) between a time domain signal (the waveform we hear) and its frequency domain equivalent (the sum of cosine waves that I mentioned above). That transformation is only exact if there is an integer relationship between the block of time being sampled (which is related to sample rate) and the frequencies in it. In most real world cases, there is not such a relationship. When that happens, "leakage" occurs, which means that energy is "observed" in frequency bins that didn't really have any. Leakage is minimized by shaping the signal with a filter, called a window. Several mathematical windows have been created each of which has a different set of trade offs. But the bottom line is that a Fourier transform does introduce some of the artifacts the OP mentioned due to the lack of steady-state in the signal. Personally I avoid using the Fourier transform in working with audio signals, because of these limitations, although there are some things that can be done with it that are difficult or impossible without.
But just digitizing an analog (music) signal, if done properly, using a high enough sampling rate, and proper filters, is capable of creating a signal that is indistinguishable from the original, to the human ear. The problem is doing it properly. Especially in the early days (such as when the CD was invented), the necessary hardware, if available at all, was expensive. Extra data storage is required. In the commercial (especially consumer) world there is a strong temptation to cut corners. How much can I shave before the customer notices the difference? How many customers will notice the difference? How many will be willing to pay the difference, not to notice the difference? So we end up with things like MP3 that are clearly not accurate, but good enough for most people most of the time. This is no different than, say, the trade-offs in designing a turntable. How much rumble is acceptable? How much wow and flutter are caused by the (lack of) accuracy in punching the center hole in the disk? etc.