Perceiving Frequencies

Fourier transforms are interesting things - a way of converting repetition to single values and back (roughly). Most commonly known in music (see my earlier post), it also shows up oddly as the relationship between position and momentum in quantum mechanics. While audio perception takes patterns over time and converts them to notes, our eyes do likewise, but taking patterns over space - called spatial frequencies. Just as with music, there's (bidirectional) ways to convert an image into its spatial frequency representation.

Part 1: Frequency swapped images
As an example, on the left is a (greyscale) photo of George Bush, and on the right is the frequency version. Note that frequencies are complex numbers: representing both how strong the frequency is (the power - top right greyscale image) as well as the phase - i.e. where in the cycle it starts. As this is cyclical, I'll represent it as colours, from red -> green -> blue -> back to red (bottom right). For colour pictures, you can do this three times, once per colour.

There's a question that comes up when looking at this: what happens when I combine the power data from one image with the phase data of another, and convert that back into an image? We can test this exact result, by including a second image:
As you can see, the 'combined' photo looks like the image that provided the phase. This is a known result, but somewhat suprising - remember that phase is not how strong the pattern is (what sensors usually measure...), but rather where in the cycle it is. You could say we comprehend images based on how the frequencies within them align.

Part 2: Frequency swapped sound:
Hopefully this makes you think of a follow-up question: What about with audio? In the sound domain, we can do exactly the same thing: split a sound into its frequencies (power and phase), then merge the power of one file and the phase of another, and see what it sounds like when converted back into audio.
Above is the audio equivalent of the first image: top row is the sound file, next row is the frequency power across the entire file, and the bottom is the corresponding phase. Note that we've lost a dimension: whereas these were 2D in the image case, in the sound case it's just 1D - this is true in general, the frequency space will be the same dimensionality (and in fact, size) as the input, so 2D for images and 1D for sound. (N.B: If you're used to audio spectrograms being 2D, that'll get covered in part 3 below)
To swap the power from one and the phase from another, we do the same as above, but now in 1D. Using these two input files:
Bush.wav
Churchill.wav
Taking the power from the Bush quote and the phase from Churchill gives:
And likewise, the power from Churchill and phase from Bush gives:

Listening to these, it should be obvious that the result in sound is the same as in images: We recognize the content as coming from the phase signal, with a bit of noise added. The original power source is no longer recognizeable.

Part 3: Chunked frequencies in sound:
It doesn't end there however. Those with familiarity with music analysis (or who read my earlier post) might be saying: "The power signal identifies which notes are playing, and I recognize music based on what notes are playing!! Surely it should sound like the power source?". This is indeed the case, the analysis in Part 2 cheated - many of the frequencies used in the analysis are beyond the range of human hearing; either because the pattern repeats too frequently (i.e. too high pitch) to be picked up by our ears, or too slowly (i.e. too low pitch) for the repetition to be recognized.
The fix for this is the Short-Time Fourier Transform - you chunk the original file into shorter segments, and transform those independantly. The image above is the result of chunking the Bush audio into blocks of duration 16ms, looking at the frequencies within each block, and plotting the result as columns progressing through the file. This should start to look more like a normal spectrogram, and probably is a better representation of what our brain is processing.  (ignore the vertical line artifacts: STFT fixes this with windowing, but that has been omitted for simplicity).

You may have guessed the next question: what if we merge the two sources as above, but this time using the chunked spectrogram? The values in the spectrogram are different when we perform the chunking (for those familiar with signal processing, it's similar to a high-pass filter), so you might expect that the merged result would be different, and in fact it is. These are the results using the power from the Churchill track, and phases from Bush, for different sizes of chunk ranging from the whole file (i.e. should be the same as in part 2) to very very short ones:
42912 samples/chunk
10728 samples/chunk
2682 samples/chunk
672 samples/chunk
168 samples/chunk
42 samples/chunk
12 samples/chunk
4 samples/chunk

Most interesting to me is that it goes from sounding like the phase source (i.e. Bush) for large chunks as found before, then becomes a weird mix at 2682 samples/chunk, then sounding like the power source at 168 samples/chunk, and finally back to sounding like the phase source for very short chunks.

Part 4: Chunked frequencies in images:
No surprises, we end by coming full cycle and applying the STFT chunking analysis back in the visual domain. It's a bit trickier however, as STFT doesn't seem to be a common approach with visual frequencies. The actual implementation is pretty similar (i.e. split the image into smaller images, transform each separately, and combine back) but representing the result is non-trivial. In the audio domain, we gained a dimension (1D to 2D) by laying the chunk spectrograms beside eachother in order to put similar frequencies next to eachother. In the image space, to do likewise we'd need four dimensions, which doesn't work well. If sticking to two, there seem to be two choices: Either tile the spectrograms in the place they were taken from (easy to find the source, but hard to see how particular frequencies change over the image) or group them by frequency (which makes the opposite tradeoff). To see the results applied to a greyscale version of our original Bush image:
First option: non-interleaved Bush STFT

Second option: interleaved by frequency
Firstly, these are small in the blog, so make sure to click on them to see the larger versions. I haven't come across these images at all in literature (other than talked about here) but there are a few interesting things to notice. In the first one, you can kind of make out the image of Bush in the 4x4 and 16x16 power graphs, but also weirdly in the 4x4 phase plot and I can't figure out why that'd be... The phase plot for 64x64 is also surprisingly coloured-blocky, while being more mixed at both larger and smaller sizes.

The interleaved version is even more interesting, as you end up getting little copies of Bush repeated in all directions, seemingly zooming out as the chunk size grows. The flat phase in the middle (as well as the more structured central axes) is also noteworthy, but might be a side-effect of how numpy's 2D FFT code I'm using works.

Finally, merging the two images (power from Bush, phase from Churchill) using spectrograms at different chunking levels:
Here we get a different effect from the sound example - as we saw before, the larger chunks (128x128) look like the phase source, but in this case, the smaller ones look like the power source, and there's a middle ground where they're both mixed together (like the 2682 samples/chunk mix in audio).

Final thoughts
Hopefully this has been an interesting journey through what you can do when you visit the (spatial or temporal) frequency domain rather than the usual way we consider these sensory inputs. Code is all up on my github account: https://github.com/padster/fftSwap - as always, it probably has some bugs, apologies in advance.

It's worth mentioning that the images chosen are both faces, and the audio samples are both people talking - it's very likely that the critical frequencies where perception changes on each would change if other types of inputs are used (e.g. text/landscapes for images, or music/ambient noise for sound) as we tend to use different frequencies to identify different things (see this paper). 

These have also made me want to look more into complex-valued neural networks - it's a fairly under-researched area of machine learning right now, many things in ML (especially music-related) only look at the power signal of a spectrogram, and as you can see above, that can lose vital information. It also makes me wonder how the brain does all of this stuff, assuming it can't do operations with complex numbers... 

Finally, for those more visually minded, it'd be cool to figure out if there is a good way to lay out the spectrograms in the chunked image case - as mentioned, you're kind of missing a few dimensions, but there may be a good way to display / animate it so that all the related information is as intuitive to follow as in the audio case. 



Comments

Popular posts from this blog

Sounds good - part 1

Project cutting floor