My Birthday Present for 2022 - Successfully Conquering Mt. Ishizuchi, the Highest Peak in West Japan, with Our Amazing Staff!
Bring out Your KH 80 DSP Speaker’s Fullest Potential! Desktop Speaker Stand with Angle and Height Adjustment LH 65

AI and Musical Instruments - Part 1

2022-10-14

[Note]
This article contains many images. Please be mindful of your internet connection.

Cheena: Recently, AI-generated images and text have been a hot topic. I wonder if it could also generate musical instruments...

Nemoto: It could definitely assist with instrument design.

Cheena: MidJourney, Stable Diffusion, Dream by Wombo, AI Novelist, and more...

I generated a “4-string Acoustic Bass” using Stable Diffusion, and here’s what I got.

Similarly, here’s the “6 String Acoustic Guitar Designed By Avicii” using Stable Diffusion.

・　・　・

Cheena: So, let’s try... creating musical instruments with AI!

Nemoto: Let’s give it a go!

Cheena: The tool I’ll be using is Stable Diffusion (SD). The basic program is publicly available and can be customized by anyone. In addition to generating images from text, it can also generate images from other images, and the generated content can be freely used, even commercially.

Here’s a brief explanation of terms related to image generation...

- Prompt: This refers to the ‘text input’ given to the AI. The choice of words affects the image generated, such as its resolution and style. There is continuous research into crafting effective prompts, which are sometimes called ‘spells’ or ‘incantations’.
- Seed: This is a ‘random number’ used to generate randomness. By manipulating the seed in SD, you can mainly alter the composition and angle of the image.
- Training Data: This refers to data that labels an image, for example, “this image is a pipe organ.” An image of a pipe organ might be tagged with elements like ‘instrument’, ‘pipe organ’, ‘keyboard’, and even ‘church’ or ‘hymn’ depending on the context. In image generation, the skill lies in crafting the right prompt that contains the desired image elements.

[Example Prompts for Generating Instruments]

Acoustic Guitarを指定するとほぼ安定してアコギが出力されます。Acoustic Bassは歪んだ瓢箪になります。
エレキギター/エレキベースの場合、StratocasterやTelecasterでは案外綺麗に出力しますがJazz Bass、JazzmasterなどはJazzが優勢になり形状が安定しません。
また、メーカー名とモデル名と弦数を全て大文字に、ハイフンで繋ぐと多少明確に出力することができます。
また、描画エンジンを指定すると何故か安定する傾向にあります。
あとはシード値を変えて綺麗なものが出るまで回すだけです。

Cheena: First, here’s this one:
“Illust of wine red ORIGINAL-PRECISION-BASS, super realistic, unreal engine” with seed value 2048. The result looks like this.

Nemoto: There are a lot of things to note. The shape of the body, for instance, and the second of the three knobs – is it a knob or a jack hole? It’s slightly distorted, but the symmetrical bottom gives it that classic PB (Precision Bass) feel.

Cheena:The pickguard and control plate area closely resemble a Cabronita Telecaster.
It also seems to have a soapbar pickup, which is great. It looks like a simple, easy-to-handle, versatile bass.
By the way, here’s the version where the instrument name is in lowercase or not connected properly.

After experimenting with slightly modified prompts, it was confirmed that even with the same seed value, differences can appear, proving the effectiveness of capitalization and hyphenated connections.

Nemoto: I see.
The melting effect in the last one has a bit of a Dali vibe, which is nice.

Cheena: Unexpectedly, the horns turned out to resemble a Warwick, and there’s a hint of a multi-layered body feel too. Let’s check the next one.

"Illust of deep sea blue STRATOCASTER-GUITAR, super realistic, unreal engine" with seed value 2048.
It’s quite clear, but does it have some Lead II mixed in?
By the way, changing ‘Stratocaster guitar’ to ‘acoustic guitar’ results in this:

A 7-string!

Nemoto: This is starting to look very realistic, but a maple fingerboard on an acoustic guitar is pretty rare, right? It would be interesting if there was an all-maple acoustic guitar.

Cheena: Another heavy-looking one... it’s actually hard to imagine the sound direction, which is interesting. If we’re going all out, maybe a flame maple top would be cool...
So, I tried it as “Illust of ALL-MAPLE-ACOUSTIC-GUITAR”.

...It turned out looking like a normal acoustic guitar. By the way, it’s been a few days since I generated the deep sea blue acoustic guitar, but when I recreate the prompt and set the seed to 2048, the image above appears.
This means something else is being triggered by the ‘all maple’ and ‘deep sea blue’ elements.

Nemoto: Is the fingerboard and sideboard roasted?
But what’s actually triggering it? There doesn’t seem to be any common points.

Cheena: With this appearance, I think it’s likely that just a regular acoustic guitar was generated. When ‘maple’ is input, the shape of the instrument usually distorts, and even when I used ‘maple-colored’ or ‘maple-fingerboard’, it still resulted in the same thing.

The seed value 2055 for “Illust of maple-fingerboard acoustic guitar, super realistic, unreal engine” produced the most beautiful output, but when I switched ‘maple’ to other woods like ‘ebony’, the output was distorted.
So, it’s safe to assume that the mere mention of a wood type is triggering some kind of distortion.
Another interesting thing is that while seed 2048 produced clean outputs for ‘deep sea blue acoustic guitar’ and ‘deep wine red acoustic guitar’, the outputs for ‘wine red acoustic guitar’ and ‘sea blue acoustic guitar’ were rather distorted.

Deep wine red is on the left, and wine red is on the right.
And here’s what happens with sea blue:

It’s not exactly art collapse, but there’s a certain ‘charm’ in the drawing style.
This suggests that perhaps the word ‘deep’ is helping to stabilize the image, but when “deep + wood name” was used, it only caused more distortion.
So, I decided to specify a color rather than a wood type, and added ‘deep’, choosing the mysterious color ‘deep beige’. Here’s the result:

...As expected, it’s still distorted. It looks like something you’d find on the walls of a South American city.

Nemoto: Could it be that maple and ebony are treated as ‘natural materials’, so the system thinks it must distort them (since a uniform appearance would lack realism)? If ‘deep’ is processed as “enhancing the characteristics of the subject,” then it would cause distortion. Just a hypothesis, though.

Cheena: Let me pause here and show you a proper acoustic guitar.

Yamaha / FG830 Natural Acoustic Guitar

Yeah, no art collapse there.

If ‘maple’ by itself outputs a maple leaf, then another hypothesis to consider is that “acoustic guitar with a maple fingerboard” hasn’t been learned by the model. It might be that while the model understands ‘instrument type’, it hasn’t learned about ‘fingerboards’.
But that 7-string blue acoustic guitar... I kind of want it...

Nemoto: I see. If it hasn’t been learned, then it wouldn’t show up, right?
I’m also curious about the 7-string acoustic. It seems like it would produce a well-dried sound.

Cheena: Alright, shall we move on? There’s no point in discussing what's inside the black box...

Nemoto: True... It’s a world we don’t understand.

Cheena: Before we continue, how about I talk about the AI I’m using, the prompt methods, and the dataset? It’ll take a bit longer, so I’ll keep it stored here.

[Explanation]

SDは2022年8月現在、Hugging Face内のDemoで使用できるほか、公開されたデータをGoogle Colab上で動かす、PCにダウンロードしてローカルで動かす、などの手段によって使用できます。

上のリンクから起動する分には必要ありませんが、Google Colabやローカルで起動する際には、SDのアクセストークンが必要になります(執筆時の情報です。これ以降も改良が続き、アクセス方法やセットアップが変わる可能性があるためここでは詳細は割愛します)。
SDはオープンソースであり、改変版が多数ネット上にも出回っています。
良く見かけるのは不適切画像を削除するフィルターを回避するもの、画像を入力して似た画像を出力するもの、等であり、Google Colab上で駆動するものがほとんどです。

AI自体の学習については、MidJourney等と比較して小規模なこともあり、ジャンルによっては良い画像を生成できないことがあります。これについては後に言及します。

SDのバージョンは幾つかありますが、この記事内で使用しているのはver.0.2.4、Seed値を固定して画像を再現できるようになった最初のバージョンです。

[About the Stable Diffusion Dataset]

Andy Baio氏により、SDのデータセットのうち1200万枚(約0.5%)の画像を確認できるサイトが公開されています。
これを確認することにより、どのような画像がどの単語に対応するのか、多少なりとも確認することができます。
例えば、「Wallpaper」とすると単なる壁紙ではなく、PC用や携帯用の背景が大量にヒットすることから、これらの要素を持った画像を生成する際には「Wallpaper」が適するとわかります。
また、データセットの言語的・文化的な癖として、「Tempura」を検索してみれば、天蕎麦や天丼が多数ヒットする一方、海外のTempura Roll(巻きずしに衣をつけて揚げたもの)やTempura Pork(豚カツの別名)が混ざり、更にEgg Tempera Paintingの誤字と思われるEgg Tempuraもヒットする、天麩羅単体ではなかなか出てこない、という状態が確認できます。

Cheena: Let’s step away from instruments for a moment and try creating an album cover...
This is something MidJourney is generally better at (MidJourney can generate fairly accurate album artwork using the “artist’s name + ‘album artwork”, whereas with SD, it tends to output an image of the artist themselves. It requires a clearer design specification, and the creator’s intent is more apparent). However, you could say that it’s not as interesting because SD can over-interpret the prompt. So, I’ll go ahead and generate it with SD.
Which artist should we use?

Nemoto: Hmm...
How about David Bowie? He has a unique sense of style, and I imagine his album covers would be very distinctive.

Cheena: Let’s try it. First, I’ll use “David Bowie’s single album artwork,” seed 2048...

Could it be Aladdin Sane? For a control experiment, I also prepared another artist with the same seed, and although the composition and background color were somewhat similar... I thought it would turn out like that, but unlike David Bowie’s look (left), Avicii ended up looking quite similar. Meanwhile, Bon Jovi (right) turned out with an unexpectedly cool logo!

Avicii’s image probably ended up having a hat similar to the one worn on his Stories album cover, which was added to Tim Bergling’s likeness... as for Bon Jovi, I’m not sure what happened. I think the winged part might be an element from the logo.

Nemoto: If I had to say, it looks a bit like the pattern on the jacket Bon Jovi wore when he was younger. The '80s were usually pretty wild, but those guys still had a unique style...

Cheena: The mystery deepens. It’d still make a cool album cover as it is...
Let’s keep experimenting. I had a feeling the genre name might work better than the artist name, so I gave that a try.

“Classic”

“Rock n’ Roll”

“Jazzy Rock”

“Bluegrass”

“Punk Fusion”

“Techino Pop”

“Trance”

“Hard Rock”

“Irish Metal”

”Emo Rock”

“Alternative Rock”

“Dubstep”

“EDM”

“Chiptune”

I’ve generated these album artworks. The Chiptune style looks really cute, with its pixel art vibe, and the Emo Rock one is interesting too, with the body and text merging so that the face isn’t visible.
As for the prompt analysis, with seed 2048, there seems to be a circular element somewhere in the album artwork, and the word Electronic, common to both Dubstep and EDM, seems to make it easier to create symmetry in the image. Also, there seems to be a tendency for monochrome/monotone/low-saturation outputs, which could be linked to rock subgenres like alternative, emo, or metal.

Nemoto: This is amazing... I think I like the Irish Metal one the most. It’s really interesting how it produces something so fitting.

Cheena: Yeah. There are some really cool ones I’ve made with other AIs, but unfortunately, none of them can be used commercially or have expensive licensing, so I can’t really share them...

Nemoto: I was thinking of trying it out too, but with my computer specs, it’s a bit tough... I plan to try it once I upgrade.

Cheena: Local processing is great. When using Google Colab, there are processing limits, so if you want to generate a lot of images quickly, local processing is more suitable.

Nemoto: I’m thinking of building a new computer soon... I’ve been building a mid-range one for around 100,000 to 150,000 yen every few years, but it’s not quite time for an upgrade yet, and I’m not really having any problems. I’m not sure if it’s worth building one just for this, though.

Cheena: I think you’ll need at least a 4GB GPU for this. If you want to use it comfortably, you’ll need something more powerful, which can be quite a challenge.

Nemoto: A stronger GPU, huh... Maybe I’ll get back into Steam gaming. I’ve been playing on consoles for a while...

*Steam: A platform for computer gaming

Cheena: だWe’re starting to get off track here. To steer things back on course, I’ll try generating a few album covers with different seeds using the format “(genre name) -by- (artist name) album artwork”. This tends to be the most stable approach with SD...

“EDM-by-Zedd Album Artwork”

“256”

“512”

“1024”

“2048”

“4096”

“8192”

“16384”

“32768”

It seems like there are a lot of flashy results like True Colors and Clarity. And as expected, there are many images that are symmetrical, both vertically and horizontally.

For example, with the prompt “Jazz-by-Daft-Punk Album Artwork,” it turns out like this.

“256”

“512”

“1024”

“2048”

“4096”

“8192”

“16384”

“32768”

Surprisingly, it turned out pretty well, but some of the details are off. I think the reflection on Emmanuel’s mask might have been interpreted as a visor or something like that...
Still, I wonder what kind of sound a Daft Punk jazz album would have. I’m also curious about the solo album with seed 8192. Maybe it’s been distilled, or perhaps it’s a hand-cranked music box?

Nemoto: A hand-cranked music box seems like something they’d do.
Maybe with a heavily distorted saxophone as the lead?

Cheena: That would be so cool. But since they’ve disbanded, I’m looking forward to the arrival of AI specializing in music remixes. Although, the copyright issues are probably a huge headache...

Well then, let’s pause the image-based AI here for now. Next time, we’ll try working with text-based AI.

Nemoto: Sounds good. Thank you very much!

Cheena: Thank you!

List of Cheena’s Articles ▶︎

List of Nemoto’s Articles ▶︎

The “sound & person” column is made up of contributions from you.
For details about contributing, click here.