Getting Started with a New Codebase¶
If you are starting a new project, Scooch is an easy way to keep your code structured and configurable as you develop. This example provides a walkthrough in starting a new Scooch configurable codebase for creating mini-batches to be used in a gradient descent algorithm. It highlights many of the features and benefits of using Scooch.
A completed version of this example is available in the examples provided on github.
In this example we’ll need a few python packages in our environment:
pip install scooch numpy scipy matplotlib
Let’s also create a directory to work in:
mkdir ./scooch_getting_started
cd ./scooch_getting_started
1 - Parameterize a class¶
A core componet of this code will be a configurable Batcher
class. To make a new Scooch configurable class, you can simply inherit from scooch.Configurable
and define some parameters on that class’s definition. For the Batcher
class, we’ll start simple by placing the following class in ./batcher.py
:
import random
from scooch import Configurable
from scooch import Param
import numpy as np
class Batcher(Configurable):
"""
Constructs mini-batches for gradient descent.
"""
_batch_size = Param(type=int, default=128, doc="The number of samples in each mini-batch")
_audio_samples_per_smaple = Param(int, default=1024, doc="The number of audio samples to extract each feature from")
def set_data(self, data):
# Save a reference to a data array, to sample / batch from
self._data = data
def get_batch(self):
feature_data = []
while len(feature_data) < self._batch_size:
start_idx = np.random.randint(0, self._data.shape[0]-self._audio_samples_per_smaple)
audio_segment = self._data[start_idx:(start_idx+self._audio_samples_per_smaple)]
feature_data += [audio_segment]
random.shuffle(feature_data)
return np.vstack(feature_data[:self._batch_size])
This class will extract random samples from the provided test data.
2 - Write a config¶
Now that there is a class to configure, you can write a yaml file that will set its parameters, e.g., for the Batcher
class above:
Batcher:
batch_size: 8
Save this as ./config.yaml
in the current working directory.
Note that we need not configure a parameter if we want to use its default value. In this case we’ll use the default for audio_samples_per_sample
. All parameters without a default value defined, are required to be specified in the config file.
Note that the batch_size
parameter is private to the batcher class, but no leading underscores are used in the config file. With Scooch classes it is preferred to keep parameters private to each class, and expose them as @property
s where necessary. However this is not required. For example, the _batch_size
parameter could equivalently be named batch_size
if the developer prefers public parameters.
3 - Construct and use the class¶
At this point, it is simple to instantiate the class with the specified configuration file and execute methods on it. For example, a script to pull mini-batches from the Batcher
class could be as simple as:
from batcher import Batcher
from scooch import Config
import argparse
import scipy.io.wavfile
import matplotlib.pyplot as plt
NUM_BATCHES = 3
def main(config, data):
# Load data
audio_data = scipy.io.wavfile.read(data)[1]/32767 # <= Normalize 16 bit wav format
# Batch samples
batcher_instance = Batcher(Config(config))
batcher_instance.set_data(audio_data)
batches = [batcher_instance.get_batch() for _ in range(NUM_BATCHES)]
# Plot batches for inspection
fig, axs = plt.subplots(1, NUM_BATCHES, sharey=True)
for batch_num in range(NUM_BATCHES):
axs[batch_num].plot(batches[batch_num].T)
axs[batch_num].set_title(f"Batch {batch_num}")
plt.show()
if __name__=='__main__':
parser = argparse.ArgumentParser(description='Produces a few example mini-batches')
parser.add_argument("--config", default="./config.yaml", type=str)
parser.add_argument("--data", default="./data/test_data.wav", type=str)
kwargs = vars(parser.parse_args())
processed_kwargs = {key: arg for key, arg in kwargs.items() if arg}
main(**processed_kwargs)
For this example, we’ll simply plot the data, though this could easily be extended to do something more useful like dump the data to .npy
files for training a model. Save this script as ./batch_it.py
.
Note the code above expects a ./data/test_data.wav
file by default. In this case you can use an example from the scooch repository by executing the following from the same directory as batch_it.py
.
mkdir ./data
wget -O ./data/test_data.wav https://raw.githubusercontent.com/pandoramedia/scooch/main/examples/batcher_example/data/test_data.wav
With the data in place, the script can then be executed as:
python ./batch_it.py --config ./config.yaml --data ./data/test_data.wav
There we have it, a script that uses Scooch to configure a class for producing mini-batches. This can be done simply with many different python config libraries. Next we’ll look into some of the benefits of Scooch’s object oriented approach in particular.
4 - Encapsulation¶
One of the primary benefits of Scooch is that it constructs not only classes, but entire class hierarchies, with minimal code. Perhaps we want the Batcher
class above to produce augmentations of the data source it is reading from.
To get started it might make sense to place our Batcher
class in a python package. We can do this by organizing our previous files like so in the following directories:
./batch_it.py
./config.yaml
./batcher/__init__.py
./batcher/batcher.py
and place the following in the ./batcher/__init__.py
file:
from .batcher import Batcher
For data augmentations we’ll want to parameterize that augmentation itself. Let’s create an augmenter class that takes in some feature data and augments it. Put the following in the file ./batcher/augmenters.py
.
import numpy as np
from scooch import Configurable
from scooch import Param
class NoiseAugmenter(Configurable):
"""
Takes in audio samples and augments them by adding noise, distributed uniformly on
a logarithmic scale between the minimum and maximum provided noise values.
"""
_noise_min = Param(float, default=-10.0, doc="Minimum RMS power of noise to be added to an audio sample (in dB)")
_noise_max = Param(int, default=10.0, doc="Maximum RMS power of noise to be added to an audio sample (in dB)")
def augment(self, sample):
# Produce a random dB value for the noise
power_db = np.random.rand()*(self._noise_max - self._noise_min) + self._noise_min
# Convert to linear
power_linear = 10.0**(power_db/10.0)
# Synthesize and add the noise to the signal
noise_data = np.random.normal(scale=power_linear, size=sample.shape)
return sample + noise_data
We can now employ this new Configurable
inside the Batcher
class by adding a Param
with a type of another Configurable
class, (i.e., the NoiseAugmenter
class), in the class definition of Batcher
, e.g.,
import random
from scooch import Configurable
from scooch import Param
import numpy as np
from .augmenters import NoiseAugmenter
class Batcher(Configurable):
"""
Constructs mini-batches for gradient descent.
"""
_batch_size = Param(int, default=128, doc="The number of samples in each mini-batch")
_audio_samples_per_smaple = Param(int, default=1024, doc="The number of audio samples to extract each feature from")
_augmenter = Param(NoiseAugmenter, doc="An augmentation transformation to be applied to each sample")
...
Upon instantiation of Batcher
, this class will be constructed and assigned to the _augmenter
attribute, so using it is simple. We can adjust the get_batch
method of Batcher
to do this:
...
def get_batch(self):
feature_data = []
while len(feature_data) < self._batch_size:
start_idx = np.random.randint(0, self._data.shape[0]-self._audio_samples_per_smaple)
audio_segment = self._data[start_idx:(start_idx+self._audio_samples_per_smaple)]
feature_data += [self._augmenter.augment(audio_segment)]
random.shuffle(feature_data)
return np.vstack(feature_data[:self._batch_size])
...
We can now adjust the ./config.yaml
to configure the new Configurable
class parameter:
Batcher:
batch_size: 8
augmenter:
NoiseAugmenter:
min_noise: -5.0
max_noise: 5.0
Without any changes to the ./batch_it.py
script, Scooch will construct the new class hierarchy based on the parameters and configuration, to produce noise augmented samples. Try running the following again:
python ./batch_it.py --config ./config.yaml --data ./data/test_data.wav
Here we can see that Scooch has constructed the new class hierarchy based on the updated configuration and produced batches of what are now noisy samples.
5 - Inheritance¶
Scooch configures not only classes, but class hierarchies. As this codebase develops it is likely that there’ll be several different types of Augmenter
s. To support this, let’s construct an Augmenter
base class that NoiseAugmenter
will inherit from. In this class we might also want to include some functionality that is common to all augmenters, e.g., the number of augmentations performed per input sample. To do this, adjust ./augmenters.py
like so:
import numpy as np
from scooch import Configurable
from scooch import Param
class Augmenter(Configurable):
"""
An abstract augmenter base class for all feature augmentations to derive from.
"""
_augmentations_per_sample = Param(int, default=3, doc="The number of augmentations returned for each input sample")
def augment(self, sample):
return [self._get_augmentation(sample) for _ in range(self._augmentations_per_sample)]
def _get_augmentation(self, sample):
raise NotImplementedError(f"The augmenter class {self.__class__.__name__} has no defined method to augment a feature.")
class NoiseAugmenter(Augmenter):
"""
Takes in audio samples and augments them by adding noise, distributed uniformly on
a logarithmic scale between the minimum and maximum provided noise values.
"""
_noise_min = Param(float, default=-10.0, doc="Minimum RMS power of noise to be added to an audio sample (in dB)")
_noise_max = Param(int, default=10.0, doc="Maximum RMS power of noise to be added to an audio sample (in dB)")
def _get_augmentation(self, sample):
# Produce a random dB value for the noise
...
We now adjust the Configurable
Param
in the Batcher
class to refer to any class that derives from Augmenter
:
import random
from scooch import Configurable
from scooch import Param
import numpy as np
from .augmenters import Augmenter
class Batcher(Configurable):
"""
Constructs mini-batches for gradient descent.
"""
_batch_size = Param(int, default=128, doc="The number of samples in each mini-batch")
_audio_samples_per_smaple = Param(int, default=1024, doc="The number of audio samples to extract each feature from")
_augmenter = Param(Augmenter, doc="An augmentation transformation to be applied to each sample")
...
The config.yaml
file now specifies which type of Augmenter
to use, and may configure the parameters of that class and any of it’s Configurable
base classes:
Batcher:
batch_size: 8
augmenter:
NoiseAugmenter:
augmentations_per_sample: 2
min_noise: -5.0
max_noise: 5.0
The batch_it.py
script can be run again and will now produce two unique noise augmentations for each sample drawn from the data source.
6 - Abstraction and Polymorphism¶
Now that there is a class hierarchy set up for Augmenter
s, we can add new types of augmenters as we please. Because the interface and common parameters are defined in the base Augmenter
class, the Batcher
class will know how to use them, without any changes to that code.
Let’s create a DCOffsetAugmenter
to provide training examples with a non-zero offset. Add the following class to ./batcher/augmenters.py
:
class DCOffsetAugmenter(Augmenter):
"""
Adds random DC offsets to training samples.
"""
_offset_variance = Param(float, default=1.0, doc="The variance of random offset values applied as data augmentations")
def _get_augmentation(self, sample):
return sample + np.random.normal(scale=np.sqrt(self._offset_variance), size=sample.shape)
Simply by defining this class we can “select” it in the ./config.yaml
file like so:
Batcher:
batch_size: 8
augmenter:
DCOffsetAugmenter:
augmentations_per_sample: 2
offset_variance: 0.8
By running batch_it.py
again, we will see that there is no longer additive noise in the batches, but constant offsets.
7 - Explore Scooch hierarchies with the CLI¶
As codebases and class hierarchies grow, the number of configuration options can become daunting. To help with onboarding to a codebase that uses Scooch, you can view the options for a given Configurable
base class as follows:
scooch options -m batcher -f Augmenter
This will print out the doc strings for all subclasses of Augmenter
in the batcher
module, including the Scooch parameter information.
Note that any module here will have to be installed or in your python path. If you receive a ModuleNotFoundError
, you can add the batcher
module to your python path like so:
export PYTHONPATH=$PYTHONPATH:`pwd`
The structure of configuration for a Configurable
can become quite complex. To help new developers, it is recommender to include an example config.yaml
file in your codebase. Alternatively, there is a wizard to produce config.yaml
files for a given class, via the CLI.
scooch construct -c ./default_config.yaml -m batcher -f Batcher
This will prompt for the type of each Param
that is of type Configurable
in the class hierarchy, and construct a configuration for the Batcher
class in the batcher
module and place it in the file ./default_config.yaml
.