Sounds good - part 2

If you haven't yet, read Part 1 first, an introduction to what sound is.

I left off last time with a description of sound as changes in pressure, and notes as waves within these. On natural progression from this is: given the samples of a sound, how can you tell what notes are playing?

As I'm doing psychology, I'll take a small diversion here to rephrase that to: how does your ear do it? That is, given vibrations from your ear drum, how does your brain tell what notes are playing?
Before answering that, first it's good to know about resonance frequency - as you can see from this video, items that make noises when hit tend to have a resonant frequency, that is, the rate at which they vibrate in a cycle. Similar to pushing someone on a swing, if you push an object at this frequency, it will strengthen the resonance.

So how is this useful in regards to ears? The leading theory is that there's a section of your ear with lots of tiny hairs along a tube. Based on the tube shape, plus the hair position and sizes, the hairs each resonate at slightly different frequencies, and it is this resonance that get converted into neuronal signals processed by your brain. So, if you ear an A440 pure tone, the hairs around the 440hz frequency area will vibrate, which you will then interpret as that sound (similar to how an object looks 'blue' if the blue cone receptors are activated).

Fourier series - Wikipedia
Computer ears
So how does that work in computers? Unfortunately, computers don't have large ear tube full of resonating hairs to calculate all the frequencies in parallel - even Cochlear implants, which imitate this connected to a brain, still don't quite have the exact same ability. They still use a similar principle though, just in maths it's called the Fourier transform. You can think of it along the same lines - for each frequency, you collect up the 'pushes' in the sound samples, and see which ones get strong vs. cancel out.
[For those with more of a maths background: sine waves are orthogonal to eachother, so you're changing your basis from values of x, to values of sin(x) - same as with position vs. momentum in quantum physics!]

Fast ways of calculating this have been found, so instead that algorithm is what can be run in series on computers. After applying it for every sample in time, rather than a pressure reading we instead have lots of readings - one for the 'strength' of each of the frequencies examined. Because of this, the representation looks a bit different, and is referred to as a...

Spectrogram (provided in go-sound by cq.Spectrogram).
Cage Piano Sonata II spectrogram from this youtube video
To read this, remember that left-right is time, just like in the previous post. The difference is that up-down is now pitch (up = higher notes), and brighter colour = stronger, so a white horizontal line is a single note held over time. For a live video of the go-sound example, see this G+ post.

Now that we have a way to see what notes are being played, there are a few things worth mentioning:

- Harmonics: When you play a tuning fork, it's pretty close to a pure sine wave. When you play a key on a piano however, or a fret on a guitar, or a violin, or sing a 'single' note, you get that tone but also higher pitches known as harmonics. For example, when you pluck a guitar string, it vibrates at a certain frequency, but also multiples of that frequency and a bunch of other random things thrown in. In fact, it's the harmonics that tend to make an instrument sound like how it does.

Time/Frequency tradeoff: Unfortunately, due to the nature of the Fourier transform, we can't have both good time resolution (i.e. know exactly when certain frequencies start and end) as well as good frequency resolution (e.g. distinguish 550hz from 550.1hz) - like a musical version of Heisenberg uncertainty, which means you have to balance how much of each you want.

Non-linear in amplitude/energy: A brief mention here of a concept you may have heard before, which is decibels (dB). Given the spectrogram, you can convert that to the decibel scale (db = 20 * log10(value)) to find out how 'loud' each of the constituent tones is.

Non-linear in frequency: As mentioned in the last post, the interesting thing with frequencies is the ratio (double = octave, 3:2 = perfect fifth, ...). The Fourier transform doesn't really handle this very well, as it spaces frequencies out evenly, so you get more information at higher frequencies, but most of that is useless (you don't really need both 1760 and 1764hz, you can't tell the difference...) and there's also wasted data for the low notes (if you double the frequency, you need double the samples before you can identify it).

This leads me to the modification that go-sound provides, which is...

Constant Q
From this paper on pitch shifting with Constant Q
The way to get around this (while still having fast processing) is to still perform the pushing trick, but on a logarithmic scale - this is how your ear does it, and similar to, say, doing it for each key on a piano. You can then also halve the information needed as you go down octaves, while still maintaining the same quality of detection (this is the Q that is Constant! ...ish). This Constant-Q algorithm is the main analysis that go-sound provides (constantq.go), as well as an inverse so you can convert it back into a sound.

A good question to ask now is: with constant Q information telling you how 'strong' each semitone is throughout a piece of music, what can be done now? Thankfully, the answer is: lots!

This is reaching the limit of what is currently available in go-sound, but future plans are to use this data to:

- Harmonic analysis: If you have a recording of a note being played on an instrument, you can see what the harmonics are for it (i.e. how powerful the root tone is, and how audible other tones are). This can be used to identify the instruments in a song (e.g. 'piano' harmonics should be distinguishable from 'guitar' harmonics), but also to create new sounds - even to the extent of: given a root note (A440), add the harmonics learnt off the electric guitar used to play stairway to heaven.

- Pitch modification: If you think about a spectrogram, the y axis (up/down) is pitch. By relabeling that axis (i.e. shifting up or down by a constant amount), you are essentially modifying the pitch. There are a few problems due filling in the gaps in the image above, but the progress is promising (I uploaded this example to soundcloud, and some improvements are in the pipeline). One thought from this might be: what about shifting on the x axis (left/right)? Unfortunately this is the time axis, so it just changes where the song starts, but there are other related ways to speed up / slow down a song while maintaining pitch.

- Automatic transcription: This is one of the holy grails of music analysis - to be given a song, and be able to write out the sheet music that produced it. This is something that Mozart famously could do at the age of 14, so computers still lose to humans here, but with harmonic analysis and a few other tools (melody detection, beat detection, ...), it may be possible eventually. Constant Q can let you detect how 'strong' each particular note on a piano is, so e.g. finding the most likely current key is not hard (if C, E and G are the strongest, chances are you're playing in C), combining all these tools may lead to quite good transcriptions.

- Reversal: A final example is more in the humor department, but someone has already asked: if you know how 'strong' each note on a piano is for a given sound, can you play those notes at the right time and recreate the sound? What if the original sound wasn't a piano song, but something else - like someone speaking? I leave you with a video of researchers and their talking piano:

Hopefully this made sense - it was a lot more complex than part 1 (intentionally), but I feel this is the more interesting/fun/exciting parts, so ideally the complexity is worth it. I'm not planning a part 3 (unless maybe all these things are implemented), but as always, let me know if there's anything else that you'd be interested in hearing about.


Popular posts from this blog

Sounds good - part 1

Perceiving Frequencies

Real life is full of Kaizo blocks