Simple MLP model for Guitar Steel vs Nylon strings classification


Luka Nguyen


November 26, 2022

1 Brief Introduction

Can a 2-hidden-layer MLP do a good job classifying musical instrument sounds? Let’s find out!

2 Introduction

For beginner guitar players, it’s sometimes difficult to tell apart the sound of steel strings vs nylon strings on the guitar. In this article, I’ll walk you through some easy steps to build a Machine Learning model to classify the two aforemtioned types of sound.

You can try a live DEMO via:

3 Import libraries & Setup constants

The first step is to install all required libraries. Even though torchaudio could handle audio, they lack support for some media formats. That’s why we need two additional sound libraries, namely libosa and soundfile. Our main data source is YouTube, and pytube allows easy and fast audio extraction from the social media platform.

!pip install torchaudio librosa soundfile pytube torchsummary matplotlib pandas
     |████████████████████████████████| 56 kB 3.6 MB/s  eta 0:00:01
!conda install libsoundfile
import torch
import torchaudio
import librosa
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import soundfile as sf
import threading
from glob import glob
import os

Use Google Drive if you need to store your data there.

# from google.colab import drive
# drive.mount('/content/drive')
Mounted at /content/drive

For simplicity, I will use 8000 Hz as the default sampling rate. This helps training faster on modest hardware. Also, I’d like to segment each YouTube audio clip into chunks of 5-second clip for training. This help us enrich our dataset and simplify our network architecture.

TARGET_SR = 8_000
CLASSES = ["nylon", "steel"]
SEGMENT_DURATION = 5 # seconds

<torch._C.Generator at 0x7faa2c165c90>

Then, we will create some folders to store our sound files.

ROOT_DIR = "./"
DATA_DIR = f"{ROOT_DIR}data/" # Google Drive & Colab

for subfolder in ["raw", "segments"]:
  for cls in CLASSES:
    new_dir = f"{DATA_DIR}{subfolder}/{cls}"
    if not os.path.exists(new_dir):

SEGMENT_DIR = f"{DATA_DIR}segments/"

This function helps get the best device possible for training.

def get_best_torch_device():
  if torch.cuda.is_available():
      device = "cuda"
      device = "cpu"
  print(f"Using device {device}")
  return device

device = get_best_torch_device()
Using device cuda

4 Collect Data

4.1 Download audio from YouTube

These are the clips that I handpicked from YouTube. They are solo guitar recordings and were recorded in a professional studio. To watch any of them, just add the youtube url prefix. For example: “foIPN-T7RGo” ➡️ “”

steel_clips = ["foIPN-T7RGo","10ATKnZLg9c","IP8vBL5Q8Ac"]
nylon_clips = ["qgb-bdEEI-M","qXwvz-nTiog","6jQ34uTmA9s"]

Next, we define our function to download and extract audio from a YouTube url:

from pytube import YouTube

def download_youtube_mp3(link, output_dir):
    Download and extract audio from a clip from youtube 
    t=yt.streams.filter(only_audio=True).first().download(output_dir, link + ".mp3")
    print(f"Downloaded YouTube Audio from: {link}")

Each clip is over 60 minutes long, which could take a long time to download. To accelerate, we will create a downloading thread for each clip and download all clips simultaneously.

download_thread_list = []

for link in steel_clips:
  new_thread = threading.Thread(target=download_youtube_mp3, args=(link, RAW_CLIP_PATH + "steel"))

for link in nylon_clips:
  new_thread = threading.Thread(target=download_youtube_mp3, args=(link, RAW_CLIP_PATH + "nylon"))
print("Download Raw Clips starting...")
# start each thread
for thread in download_thread_list:

# wait for all to finish
for thread in download_thread_list:

# successfully excecuted
print("Download Raw Clips finished!")
Download Raw Clips starting...
Downloaded YouTube Audio from: foIPN-T7RGo
Downloaded YouTube Audio from: 6jQ34uTmA9s
Downloaded YouTube Audio from: qgb-bdEEI-M
Downloaded YouTube Audio from: qXwvz-nTiog
Downloaded YouTube Audio from: 10ATKnZLg9c
Downloaded YouTube Audio from: IP8vBL5Q8Ac
Download Raw Clips finished!

4.2 Segmentize into 5-second clips

Now, let’s create some function to segment each audio clip into segments of 5 second long.

def segmentize_signal(signal, sr, dur):
    Segmentize the 1-d signal (mono) to a list of clips with custom duration (dur).
    seg_len = dur * sr

    # calculate number of segments
    no_segs = len(signal) // seg_len

    # truncate input signal to have length divisiable by seg_len
    trunc_len = int(no_segs * seg_len)

    # split equally
    return np.split(signal[:trunc_len], no_segs)

def save_audio(signal, sr, output_dir, filename):
    output_path = os.path.join(output_dir, filename)
    #, signal, sr)
    # print(output_path, sr)
    sf.write(output_path, signal, sr)

def segment_audio_file(audio_path, output_dir,  target_sr=TARGET_SR, segment_duration=SEGMENT_DURATION):
    print(f"Processing raw clip: {audio_path}")
    signal, _ = librosa.load(audio_path, sr=target_sr, mono=True)
    # signal, target_sr = librosa.load(audio_path,sr=None,  mono=True)
    print(f"\tLoaded clip from disk")
    segments_list = segmentize_signal(signal, target_sr, segment_duration)
    print(f"\tSegmented clip into {len(segments_list)} segments")
    for seg_idx, seg in enumerate(segments_list):
        seg_name = f"{audio_path.split('/')[-1][:-4]}_{seg_idx}.wav"
        save_audio(seg, target_sr, output_dir, seg_name)
    print(f"\tSegments are saved completely")

Next, we use threading to segmentize all clips at the same time. Beware that if your system has less than 32GB of RAM, this could cause the system to freeze and run out of memory. In such case, please modify the code before do it sequentially (i.e. without threading)

thread_list = []

for cls in CLASSES:
# get all raw files from subfolders
    raw_audio_paths = glob(f"{RAW_CLIP_PATH}{cls}/*mp3")
    for audio_path in raw_audio_paths:
        output_dir = f"{SEGMENT_DIR}{cls}"
        new_thread = threading.Thread(target=segment_audio_file, args=(audio_path, output_dir))
print("Segmentation starting...")
# start each thread
for thread in thread_list:

# wait for all to finish
for thread in thread_list:

# successfully excecuted
print("Segmentation finished!")
Segmentation starting...
Processing raw clip: /workspace/data/raw/nylon/qXwvz-nTiog.mp3
Processing raw clip: /workspace/data/raw/nylon/6jQ34uTmA9s.mp3
Processing raw clip: /workspace/data/raw/nylon/qgb-bdEEI-M.mp3
Processing raw clip: /workspace/data/raw/steel/IP8vBL5Q8Ac.mp3
Processing raw clip: /workspace/data/raw/steel/foIPN-T7RGo.mp3
/opt/conda/lib/python3.7/site-packages/librosa/util/ UserWarning: PySoundFile failed. Trying audioread instead.
  return f(*args, **kwargs)
/opt/conda/lib/python3.7/site-packages/librosa/util/ UserWarning: PySoundFile failed. Trying audioread instead.
  return f(*args, **kwargs)
Processing raw clip: /workspace/data/raw/steel/10ATKnZLg9c.mp3
/opt/conda/lib/python3.7/site-packages/librosa/util/ UserWarning: PySoundFile failed. Trying audioread instead.
  return f(*args, **kwargs)
    Loaded clip from disk
    Segmented clip into 648 segments
    Segments are saved completely
    Loaded clip from disk
    Segmented clip into 742 segments
    Segments are saved completely
    Loaded clip from disk
    Segmented clip into 1230 segments
    Segments are saved completely
    Loaded clip from disk
    Segmented clip into 1251 segments
    Segments are saved completely
    Loaded clip from disk
    Segmented clip into 1427 segments
    Segments are saved completely
    Loaded clip from disk
    Segmented clip into 2647 segments
    Segments are saved completely
Segmentation finished!

5 Dataset & Dataloader

PyTorch manages data through two types of classes: Dataset and Data;oader. Dataset could be thought of as an iterator that allows us to access each individual data point. And, Dataloader is a way to efficiently load data in batch, which is useful for mini-batch training. For more detailed description, read here:

5.1 Create annotations

Before creating our own dataset class, we need to have a csv file to describe our training / val / test sets.

This annotation dataframe stores the paths to each audio sample and its label:

annotation_dict = {"audio_path": [], "label": []}

for label, cls in enumerate(CLASSES):
  wav_dirs = f"{SEGMENT_DIR}{cls}/*wav"
  audio_path_list = glob(wav_dirs)
  count_audio_files = len(audio_path_list)
  label_list = [label] * count_audio_files

  annotation_dict["audio_path"] += audio_path_list
  annotation_dict["label"]      += label_list
annotation_df = pd.DataFrame.from_dict(annotation_dict)
audio_path label
7940 ./data/segments/steel/foIPN-T7RGo_575.wav 1
7941 ./data/segments/steel/10ATKnZLg9c_545.wav 1
7942 ./data/segments/steel/10ATKnZLg9c_1035.wav 1
7943 ./data/segments/steel/10ATKnZLg9c_602.wav 1
7944 ./data/segments/steel/IP8vBL5Q8Ac_1146.wav 1

The data is quite enormously for an average system. That’s why I seperated the training data set to full, half, quarter, and one eighth. This allows me to build and test model fast (by using a smaller training dataset). When I find something that works well, I can then use a larger training dataset to improve the training.

train_df_full = annotation_df.sample(frac=TRAIN_SIZE, random_state=RANDOM_SEED)
val_df = annotation_df.drop(train_df_full.index, axis=0)

# make smaller train datasets for quick experimentations
train_df_half = train_df_full.sample(frac=1/2, random_state=RANDOM_SEED)
train_df_quarter = train_df_full.sample(frac=1/4, random_state=RANDOM_SEED)
train_df_1eight = train_df_full.sample(frac=1/8, random_state=RANDOM_SEED)

We have 4816 samples of NYLON, and 3129 of STEEL

0    4816
1    3129
Name: label, dtype: int64

Finally, let’s write them to CSV files for later use.

df_list = [train_df_full, train_df_half, train_df_quarter, train_df_1eight, val_df]

df_names = ["train_df_full", "train_df_half", "train_df_quarter", "train_df_1eight", "val_df"]

for df_name, df_content in zip(df_names, df_list):
    df_content.to_csv(f"{DATA_DIR}{df_name}.csv", index=False)

5.2 Dataset class

We create GuitarSoundDataset which inherets Dataset from PyTorch. This class holds the annotation that we created earlier and helps us access and preprocess each individual input and label.

To create this class, I took inspiration from this awesome Deep Learning for Audio channel:

from import Dataset

class GuitarSoundDataset(Dataset):

    def __init__(self,
        self.annotations = pd.read_csv(annotations_file)
        self.device = device
        if transformation:
          self.transformation =
          self.transformation = None
        self.target_sample_rate = target_sample_rate
        self.num_samples = num_samples
        self.audio_col = audio_col
        self.label_col = label_col

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        audio_sample_path = self.__get_audio_sample_path(index)
        label = self.__get_audio_sample_label(index)
        signal, sr = torchaudio.load(audio_sample_path)
        if signal.dim() < 2:
          signal = signal[None, :]
        signal =
        signal, sr = self.preprocess_signal(signal, sr)
        if self.transformation:
          signal = self.transformation(signal)
        return signal, label

    def preprocess_signal(self, signal, sr):
        signal = self.__resample_if_necessary(signal, sr)
        signal = self.__mix_down_if_necessary(signal)
        signal = self.__cut_if_necessary(signal)
        signal = self.__right_pad_if_necessary(signal)
        return signal, sr

    def __cut_if_necessary(self, signal):
        if signal.shape[1] > self.num_samples:
            signal = signal[:, :self.num_samples]
        return signal

    def __right_pad_if_necessary(self, signal):
        length_signal = signal.shape[1]
        if length_signal < self.num_samples:
            num_missing_samples = self.num_samples - length_signal
            last_dim_padding = (0, num_missing_samples)
            signal = torch.nn.functional.pad(signal, last_dim_padding)
        return signal

    def __resample_if_necessary(self, signal, sr):
        if sr != self.target_sample_rate:
            resampler = torchaudio.transforms.Resample(sr, self.target_sample_rate).to(self.device)
            signal = resampler(signal)
        return signal

    def __mix_down_if_necessary(self, signal):
        if signal.shape[0] > 1:
            signal = torch.mean(signal, dim=0, keepdim=True)
        return signal

    def __get_audio_sample_path(self, index):
        path = self.annotations.iloc[index, :][self.audio_col]
        return path

    def __get_audio_sample_label(self, index):
        label =  self.annotations.iloc[index, :][self.label_col]
        return torch.tensor(label, dtype=torch.float)

5.3 DataLoader

from import DataLoader

def create_data_loader(dataset, batch_size):
    dataset_loader = DataLoader(dataset, batch_size=batch_size)
    return dataset_loader

Mel Spectrogram transforms our signal from time-domain into frequency-domain, which helps not only human but also computers to understand the characteristic of sound input better. Thus, we need to transform each audio input into mel spec before feeding it into the neural network.

mel_spectrogram = torchaudio.transforms.MelSpectrogram(

train_dataset = GuitarSoundDataset(
                      annotations_file =f"{DATA_DIR}train_df_half.csv",
                      transformation = mel_spectrogram,
                      target_sample_rate = TARGET_SR,
                      num_samples = TARGET_SR * SEGMENT_DURATION,
                      device = device)
print(f"There are {len(train_dataset)} samples in the TRAIN dataset.")

val_dataset = GuitarSoundDataset(f"{DATA_DIR}val_df.csv",
                      transformation = mel_spectrogram,
                      target_sample_rate = TARGET_SR,
                      num_samples = TARGET_SR * SEGMENT_DURATION,
                      device = device)
print(f"There are {len(val_dataset)} samples in the VAL dataset.")
There are 3774 samples in the TRAIN dataset.
There are 397 samples in the VAL dataset.

We will take one sample out to find out the exact input shape for our neural network

signal_sample, _ = val_dataset[0]
torch.Size([1, 64, 79])

6 Build Model

6.1 Training Loop

Because the training and validating loops are pretty basic, I don’t delve into these code too much. The official tutorial is where I took inspiration from:

def compute_accuracy(preds, target):
  _preds = preds.detach().cpu().numpy()
  _target = target.detach().cpu().numpy()
  return np.mean(_preds.squeeze().round() == _target.squeeze())

def train_single_epoch(model, data_loader, loss_fn, optimiser, device):
  size = len(data_loader.dataset)
  train_losses = []
  train_accs = []

  for batch, (input, target) in enumerate(data_loader):
      input, target =,

      # calculate loss
      preds = model(input)
      loss = loss_fn(preds.squeeze(), target.squeeze())

      # backpropagate error and update weights

      # calculate accuracy
      acc = compute_accuracy(preds, target)

  return np.mean(train_losses), np.mean(train_accs)

def validate(model, data_loader, loss_fn, device):
  # model.train(False)
  val_losses = []
  val_accs = []
  with torch.inference_mode():
    for input, target in data_loader:
      input, target =,

      # calculate loss
      preds = model(input)
      loss = loss_fn(preds.squeeze(), target.squeeze())

      # calculate acc
      acc = compute_accuracy(preds, target)

    return np.mean(val_losses), np.mean(val_accs)

def save_model(model, model_dir):, model_dir)

def train(model, train_dataloader, test_dataloader, loss_fn, optimiser, device, epochs, save_best=True, model_dir="bestmodel.pth"):
  train_losses = []
  train_accs = []
  val_losses = []
  val_accs = []
  for i in range(epochs):
      # training
      train_loss, train_acc = train_single_epoch(model, train_dataloader, loss_fn, optimiser, device)
      # val
      val_loss, val_acc = validate(model, test_dataloader, loss_fn, device)
      print(f"Epoch {i+1} | train loss: {train_loss:.5f}, train acc: {train_acc:.3%} | val loss: {val_loss:.5f}, val acc: {val_acc:.3%}")

      # save best val acc
      if save_best and len(val_losses) > 0 and val_acc > np.max(val_accs):
        # save model
        print("-> Best Model found! Saving to disk...")
        save_model(model, model_dir)

      # update losses
  print("Finished training")
  return train_losses, train_accs, val_losses, val_accs
def plot_model(model_history):
    train_losses, train_accs, val_losses, val_accs = model_history
    # Plot Loss
    plt.plot(range(len(train_losses)), train_losses, label='Training Loss')
    plt.plot(range(len(train_losses)), val_losses, label='Validation Loss')
    # Add in a title and axes labels
    plt.title('Training and Validation Loss')
    plt.legend(loc="upper left")    
    # Plot Acc
    plt.plot(range(len(train_accs)), train_accs, label='Training Acc')
    plt.plot(range(len(train_accs)), val_accs, label='Validation Acc')
    # Add in a title and axes labels
    plt.title('Training and Validation Acc')
    plt.legend(loc="upper left")
def describe_model_stats(model_history):
    train_losses, train_accs, val_losses, val_accs = model_history
    history = {"train_losses": train_losses, "train_accs": train_accs, "val_losses": val_losses, "val_accs": val_accs}

6.2 MLP Model Building: 2 hidden layers with ReLu Activation

I define a simple MLP with 2 hidden fully connected layers with relu activation. The final output is then taken by sigmoid to produce probabily prediction.

from torch import nn
from torchsummary import summary

class MLPNetwork(nn.Module):

    def __init__(self):
        self.flatten = nn.Flatten()
        self.linear = nn.Sequential(
            nn.Linear(1 * 64 * 79, 256), # I got the number (1 * 64 * 79) as input size from the code above
            nn.Linear(256, 128),
            nn.Linear(128, 1),

    def forward(self, input_data):
        x = self.flatten(input_data)
        logits = self.linear(x)
        predictions = torch.sigmoid(logits)
        return predictions
        # return x

if __name__ == "__main__":
    model2 = MLPNetwork()
    summary(, (1, 64, 79))
        Layer (type)               Output Shape         Param #
           Flatten-1                 [-1, 5056]               0
            Linear-2                  [-1, 256]       1,294,592
              ReLU-3                  [-1, 256]               0
            Linear-4                  [-1, 128]          32,896
              ReLU-5                  [-1, 128]               0
            Linear-6                    [-1, 1]             129
Total params: 1,327,617
Trainable params: 1,327,617
Non-trainable params: 0
Input size (MB): 0.02
Forward/backward pass size (MB): 0.04
Params size (MB): 5.06
Estimated Total Size (MB): 5.13

Audio input is complex, with an audio sample of 5-second long at 8000 Hz sampling rate, we have an input of 5056 already.

And, this simple MLP model already has 1.3+ millions params.

Now, let’s create a folder to store our trained params.

MODEL_DIR = f"{ROOT_DIR}weights/"

if not os.path.exists(MODEL_DIR):

Then, define some hyper params for training and create dataloader for each training and validation dataset

train_dataloader = create_data_loader(train_dataset, BATCH_SIZE)
val_dataloader = create_data_loader(val_dataset, BATCH_SIZE)

Now, let’s train our model!

MODEL_SAVE_PATH = f"{MODEL_DIR}model_mlp1.pth"
print(f"Best models will saved to: {MODEL_DIR} (based on val acc)")

model1 = MLPNetwork()

if os.path.exists(MODEL_SAVE_PATH):
  model1.load_state_dict(torch.load(MODEL_SAVE_PATH, map_location=torch.device(device)))

model1 =

# initialise loss funtion + optimiser
loss_fn = nn.BCELoss()

optimiser = torch.optim.Adam(model1.parameters(),

# train model
history_model1 = train(model1, train_dataloader, val_dataloader, loss_fn, optimiser, device, EPOCHS, save_best=True, model_dir=MODEL_SAVE_PATH)
Best models will saved to: ./weights/ (based on val acc)
Epoch 1 | train loss: 18.60054, train acc: 74.353% | val loss: 18.41207, val acc: 79.943%
Epoch 2 | train loss: 16.17105, train acc: 80.607% | val loss: 14.15756, val acc: 82.287%
-> Best Model found! Saving to disk...
Epoch 3 | train loss: 16.04773, train acc: 79.716% | val loss: 22.44626, val acc: 72.251%
Epoch 4 | train loss: 15.91658, train acc: 80.841% | val loss: 24.57104, val acc: 72.446%
Epoch 5 | train loss: 15.68584, train acc: 81.720% | val loss: 26.59615, val acc: 71.274%
Epoch 6 | train loss: 17.09794, train acc: 80.188% | val loss: 15.84533, val acc: 81.671%
Epoch 7 | train loss: 15.53885, train acc: 82.014% | val loss: 15.38519, val acc: 80.364%
Epoch 8 | train loss: 12.86597, train acc: 84.409% | val loss: 19.71215, val acc: 75.931%
Epoch 9 | train loss: 12.15642, train acc: 85.247% | val loss: 13.57315, val acc: 81.926%
Epoch 10 | train loss: 11.81812, train acc: 84.438% | val loss: 12.96833, val acc: 81.145%
Epoch 11 | train loss: 10.44457, train acc: 86.577% | val loss: 10.41160, val acc: 82.677%
-> Best Model found! Saving to disk...
Epoch 12 | train loss: 8.54932, train acc: 88.063% | val loss: 7.47481, val acc: 84.826%
-> Best Model found! Saving to disk...
Epoch 13 | train loss: 7.18438, train acc: 88.478% | val loss: 6.82821, val acc: 83.263%
Epoch 14 | train loss: 5.54964, train acc: 89.264% | val loss: 4.69765, val acc: 84.405%
Epoch 15 | train loss: 2.19403, train acc: 89.026% | val loss: 1.49113, val acc: 84.075%
Finished training
       train_losses  train_accs  val_losses   val_accs
count     15.000000   15.000000   15.000000  15.000000
mean      12.388064    0.836627   14.304709   0.798988
std        4.770562    0.042507    7.314085   0.046287
min        2.194033    0.743532    1.491129   0.712740
25%        9.496942    0.807237    8.943201   0.779372
50%       12.865967    0.844086   14.157557   0.816707
75%       15.982155    0.873198   19.062112   0.829703
max       18.600537    0.892641   26.596150   0.848257

7 Conclusion

With this simple architecture, we already achieve an acceptable accuracy of round 81%. Not bad for our first try.

In the live demo, I actually used a more complicated CNN model which achieves over 90% validation accuracy. The training was made possibly by running on a GPU cloud with RTX3090, AMD EPYC Cpu and 83GB of RAM. You can try them out here - RunPod (my affiliate link)

8 Acknowledgment

I would like to thank AI VIETNAM for providing the basic knowledge about mathematics and machine learning model, and Valerio Velardo for his tutorials about working with sound data. Last but not least, I give my thanks to all the artists whose clips I used in this model which is for educational purpose.