Virtual music box

A little music project I was writing required a melody be played on a music box. However, the paper-programmable music box I had (pictured) could only play notes on the C major scale. I couldn't easily find a realistic-sounding synthesizer version either. They all seemed to be missing something. Maybe they were too perfectly tuned? I wasn't sure.

Perhaps, if I digitized the sound myself, I could build a flexible virtual instrument to generate just the perfect sample for the piece!

[Image: A paper programmable music box.]

I haven't really made a sampled instrument before, short of perhaps using Impulse Tracker clones with terrible single-sample ones. So I proceeded in an improvised manner. Below I'll post some interesting findings and sound samples of how the instrument developed along the way. There won't be any source code as for now.

By the way, there is a great explanatory video by engineerguy about the workings of music boxes that will explain some terminology ("pins" and "teeth") used in this post.

Recording samples

[Image: A recording setup with a microphone.]

The first step was, obviously, to record the sound to be used as samples. I damped my room using towels and mattresses to minimize room echo; this could be added later if desired, but for now it would only make it harder to cleanly splice the audio. The microphone used was the Audio Technica AT2020, and I digitized it using the Behringer Xenyx 302 USB mixer.

I perforated a paper roll to play all the possible notes in succession, and rolled the paper through. The sound of the paper going through the mechanism posed a problem at first, but I soon learned to stop the paper at just the right moment to make way for the sound of the tooth.

Now I had pretty decent recordings of the whole two-octave range. I used Audacity to extract the notes from the recording, and named the files according to the actual playing MIDI pitch. (The music box actually plays a G# major scale, contrary to what's marked on the blank paper rolls.)

The missing notes

Next, we'll need to generate the missing notes that don't belong in the scale of this music box. This could be done by simply speeding up or slowing down an adjacent note by just the right factor. In equal temperament tuning, this factor would be the 12th root of 2, or roughly 1.05946. Such scaling is straightforward to do on the command line using SoX, for instance (sox c1.wav c_sharp1.wav speed 1.05946).

[Image: Musical notation explaining transposition by multiplication by the 12th root of 2.]

This method can also be used to generate whole new octaves; for example, a transposition of +8 semitones would have a ratio of (12√2)8 ≈ 1.5874. Inter-note variance could be retained by using a random source file for each resampled note. But large-interval transpositions would probably not sound very good due to coloring in the harmonic series.

First test!

Now I could finally write a script to play my melody!

It sounds pretty good already - there's no obvious noise and the samples line up seamlessly even though they were just naively glued together sample by sample. There's a lot of power in the lower harmonics, probably because of the big cardboard box I used, but this can easily be changed by EQ if we want to give the impression of a cute little music box.

Adding errors

The above sound still sounded quite artificial, I think mostly because simultaneous notes start on the same exact millisecond. There seems to be a small timing variance in music boxes that is an important contributor to their overall delicate sound. In the below sample I added a timing error from a normal distribution with a standard deviation of 11 milliseconds. It sounds a lot better already!

Other sounds from the teeth

If you listen to recordings of music boxes you can occasionally hear a high-pitched screech as well. It sounds a bit like stopping a tuning fork or guitar string with a metal object. That's why I thought it must be the sound of the pin stopping a vibrating tooth just before playing another note on the same tooth.

[Image: Spectrogram of the beginning of a note with the characteristic screech, centered around 12 kilohertz.]

Sure enough, this sound could always be heard by playing the same note twice in quick succession. I recorded this sound for each tooth and added it to my sound generator. The sound will be generated only if the previous note sample is still playing, and its volume will be scaled in proportion to the tooth's envelope amplitude at that moment. Also, it will silence the note. The amount of silence between the screech and the next note will depend on a tempo setting.

Adding this resonance definitely brings about a more organic feel:

The wind-up mechanism

For a final touch I recorded sounds from the wind-up mechanism of another music box, even though this one didn't have one. It's all stitched up from small pieces, so the number of wind-ups in the beginning and the speed of the whirring sound can all be adjusted. I was surprised at the smoothness of the background sound; it's a three-second loop with no cross-fading involved. You can also hear the box lid being closed in the end.


[Image: VIM screenshot of a text file containing music box markup.]

The native notation of a music box is some kind of a perforated tape or drum, so I ended up using a similar format. There's a tempo marking and tuning information in the beginning, followed by notation one eighth per line. Arpeggios are indicated by a pointy bracket >.

This format could include additional information as well, perhaps controlling the motor sound or box size and shape (properties of the EQ filter).

This format could also potentially be useful when producing or transcribing music from music drums.

There could also be a tool to convert MIDI files into this format. But the number of notes in a music box loop is usually so small that it's not very hard to write manually.

Future developments

Currently the music box generator has a hastily written "engineer's UI", which means I probably won't remember how to use it in a couple months' time. Perhaps it could it be integrated into some music software, as a plugin.

Possibilities for live performances are limited, I think. It wouldn't work exactly like a keyboard instrument usually does. At least there should be a way to turn on the background noise, and the player should take into account the 300-millisecond delay caused by the pin slowly rotating over the tooth. But it could be used to play a roll in an endless loop and the settings could be modified on the fly.

As such, the tool performs best at pre-rendering notated music. And I'm happy with the results!


  1. this is very neat, would love to see this virtual instrument in action in the future.

  2. Nice idea and realisation :)

  3. Great post! I'm a near-to-zero noob, so I need to ask: Why should the acceleration of the tone be the 12th root of 2? And the 8th power of this value is the next octave? I'm confused.

    1. It's called twelve-tone equal temperament tuning. It's a way to divide the octave into 12 semitones (notes) with equal distances. This kind of tuning is used on the piano, for example. See twelve-tone equal temperament on Wikipedia for more.

      When you multiply the 12th root of two n times, like 12√2 * 12√2 * 12√2 * 12√2, this operation can be written more concisely as (12√2)^4. It's just a shorter way of expressing the series of multiplications (accelerations). That number 8 I used was a random example. 8 semitones, which is an augmented fifth interval, does not necessarily lead us to the next octave.

      (For some reason, Blogger's spam filter was a bit too harsh on this comment!)

    2. Thanks for your answer! It took me a bit of sine wave experiments to understand that you tuned the notes as a "controlled Doppler effect" of the original sound. I found that pretty funny.
      Your blog is awesome. Keep it on!


The comments section is pre-moderated; it will take some time for the comment to show up.

You might want to check out the FAQ first.