The `Sound` class

class audiomath.Sound(source=None, fs=None, nChannels=2, dtype=None, bits=None, label=None)

Bases: object

Sound is a class for editing and writing sound files. (If your main aim is simply to play back existing sounds loaded from a file, you probably want to start with a Player instead.)

A Sound instance s is a wrapper around a numpy array s.y which contains a floating-point representation of a (possibly multi-channel) sound waveform. Generally the array values are in the range [-1, +1].

The wrapper makes it easy to perform certain common editing and preprocessing operations using Python operators, such as:

Numerical operations:

The + and - operators can be used to superimpose sounds (even if lengths do not match).

The * and / operators can be used to scale amplitudes (usually by a scalar numeric factor, but you can also use a list of scaling factors to scale channels separately, or window a signal by multiplying two objects together).

The +=, -=, *= and /= operators work as you might expect, modifying a Sound instance’s data array in-place.

Concatenation of sound data in time:

The syntax s1 % s2 is the same as Concatenate( s1, s2 ): it returns a new Sound instance containing a new array of samples, in which the samples of s1 and s2 are concatenated in time.

Either argument may be a scalar, so s % 0.4 returns a new object with 400 msec of silence appended, and 0.4 % s returns a new object with 400 msec of silence pre-pended.

Concatenation can be performed in-place with s %= arg or equivalently using the instance method s.Concatenate( arg1, arg2, ... ): in either case the instance s gets its internal sample array replaced by a new array.

Creating multichannel objects:

The syntax s1 & s2 is the same as Stack( s1, s2 ): it returns a new Sound instance containing a new array of samples, comprising the channels of s1 and the channels of s2. Either one may be automatically padded with silence at the end as necessary to ensure that the lengths match.

Stacking may be performed in-place with s1 &= s2 or equivalently with the instance method s1.Stack( s2, s3, ... ): in either case instance s1 gets its internal sample array replaced by a new array.

Slicing, expressed in units of seconds:

The following syntax returns Sound objects wrapped around slices into the original array:

s[:0.5]   #  returns the first half-second of `s`
s[-0.5:]  #  returns the last half-second of `s`

s[:, 0]   # returns the first channel of `s`
s[:, -2:] # returns the last two channels of `s`
s[0.25:0.5, [0,1,3]] # returns a particular time-slice of the chosen channels

Where possible, the resulting Sound instances’ arrays are views into the original sound data. Therefore, things like s[2.0:-1.0].AutoScale() or s[1.0:2.0] *= 2 will change the specified segments of the original sound data in s. Note one subtlety, however:

# Does each of these examples modify the selected segment of `s` in-place?

    s[0.1:0.2,  :] *= 2            # yes
    q = s[0.1:0.2,  :];  q *= 2    # yes  (`q.y` is a view into `s.y`)

    s[0.1:0.2, ::2] *= 2           # yes
    q = s[0.1:0.2, ::2];  q *= 2   # yes  (`q.y` is a view into `s.y`)

    s[0.1:0.2, 0] *= 2             # yes (creates a copy, but then uses `__setitem__` on the original)
    q = s[0.1:0.2, 0];  q *= 2     # - NO (creates a copy, then just modifies the copy in-place)

    s[0.1:0.2, [1,3]] *= 2         # yes (creates a copy, but then uses `__setitem__` on the original)
    q = s[0.1:0.2, [1,3]];  q *= 2 # - NO (creates a copy, then just modifies the copy in-place)

A Sound instance may be constructed in any of the following ways:

s = Sound( '/path/to/some/sound_file.wav' )
s = Sound( some_other_Sound_instance )   # creates a shallow copy
s = Sound( y, fs )                       # where `y` is a numpy array
s = Sound( duration_in_seconds, fs )     # creates silence
s = Sound( raw_bytes, fs, nChannels=2 )

Parameters:

source – a filename, another Sound instance, a numpy array, a scalar numeric value indicating the desired duration of silence in seconds, or a buffer full of raw sound data, as in the examples above.
fs (float) – sampling frequency, in Hz. If source is a filename or another Sound instance, the default value will be inferred from that source. Otherwise the default value is 44100.
nChannels (int) – number of channels. If source is a filename, another Sound instance, or a numpy array, the default value will be inferred from that source. Otherwise the default value is 2.
dtype (str) – Sound data are always represented internally in floating-point. However, the dtype argument specifies the Sound instance’s dtype_encoded property, which dictates the format in which the instance imports or exports raw data by default.
bits (int) – This is another way of initializing the instance’s dtype_encoded property (see dtype, above), assuming integer encoding. It should be 8, 16, 24 or 32. If dtype is specified, this argument is ignored. Otherwise, the default value is 16.
label (str) – An optional string to be assigned to the label attribute of the Sound instance.

classmethod AddMethod(func): Use this as a function decorator, to add a function as a new class method.

Amplitude(norm=2, threshold=0.0)

Returns a 1-D numpy array containing the estimated amplitude of each channel. With norm=2, that’s the root-mean-square amplitude, whereas norm=1 would get you the mean-absolute amplitude.

Note that, depending on the content, the mean may reflect not only how loud the content is when it happens, but also how often it happens. For example, loud speech with a lot of pauses might have a lower RMS than continuous quiet speech. This would make it difficult to equalize the volumes of the two speech signals. To work around this, use the threshold argument. It acts as a noise-gate: in each channel, the average will only include samples whose absolute value reaches or exceeds threshold times that channel’s maximum (so the threshold value itself is relative, expressed as a proportion of each channel’s maximum—this makes the noise-gate invariant to simple rescaling).

ApplyWindow(func=<function Hann>, axis=0, **kwargs)

If s is a numpy.ndarray, return a windowed copy of the array. If s is an audiomath.Sound object (for example, if this is being used a method of that class), then its internal array s.y will be replaced by a windowed copy.

Windowing means multiplication by the specified window function, along the specified time axis.

func should take a single positional argument: length in samples. Additional **kwargs, if any, are passed through. Suitable examples include numpy.blackman, numpy.kaiser, and friends.

AutoScale(max_abs_amp=0.95, center='median')

Remove the DC offset from each channel (see Center()) and then rescale the waveform so that its maximum absolute value (across all channels) is max_abs_amp.

Valid center options include 'median', 'mean', 'none', or numeric values (scalar, or one per channel).

The array self.y is modified in place.

Bulk(encoded=False)

Return the number of bytes occupied by the sound waveform…

encoded=False:: …currently, in memory.
encoded=True:: …if it were to be encoded according to the currently-specified dtype_encoded and written to disk in an uncompressed format (excluding the bytes required to store any format header).

Cat(*args): Concatenate the instance, in time, with the specified arguments (which may be other Sound instances and/or numpy arrays and/or numeric scalars indicating durations of silent intervals in seconds). Replace the sample array self.y with the result. Similar results can be obtained with the %= operator or the global function Concatenate().

Center(center='median')

Remove the DC offset from each channel by subtracting its center value.

Valid center options include 'median', 'mean', 'none', or numeric values (scalar, or one per channel).

The array self.y is modified in place.

Concatenate(*args): Concatenate the instance, in time, with the specified arguments (which may be other Sound instances and/or numpy arrays and/or numeric scalars indicating durations of silent intervals in seconds). Replace the sample array self.y with the result. Similar results can be obtained with the %= operator or the global function Concatenate().

Copy(empty=False)

Create a new Sound instance whose array y is a deep copy of the original instance’s array. With empty=True the resulting array will be empty, but will have the same number of channels as the original.

Note that most other preprocessing methods return self as their output argument. This makes it easy to choose between modifying an instance in- place and creating a modified copy. For example:

s.Reverse()             # reverse `s` in-place
t = s.Copy().Reverse()  # create a reversed copy of `s`

Cut(start=None, stop=None, units='seconds'): Shorten the instance’s internal sample array by taking only samples from start to stop. Either end-point may be None. Either may be a positive number of seconds (measured from the start) or a negative number of seconds (measured from the end).

Detect(threshold=0.05, center=True, p=0, eachChannel=False, units='seconds')

This method finds time(s) at which the absolute signal amplitude exceeds the specified threshold.

If center is truthy, subtract the median of each channel first.

p denotes the location of interest, as a proportion of the duration of the above-threshold data. So, 0.0 means “the first time the signal exceeds the threshold”, 1.0 means “the last time the signal exceeds the threshold”, and 0.5 means halfway in between those two time points. p may be a scalar, in which case the method returns one output argument. Alternatively p may be a sequence, in which case a sequence of output arguments is returned, each one corresponding to an element of p.

If eachChannel is truthy, each output will itself be a sequence (one element per channel). If it is untruthy, each output will be a scalar (the signal is considered to have exceeded the threshold when any of its channels exceeds the threshold).

units may be 'seconds', 'milliseconds' or 'samples' and it dictates the units in which the outputs are expressed.

Duration(): Returns the duration of the sound in seconds.

Envelope(granularity=0.002, fs=None, includeDC=False, minThickness=0.01, bars=False)

This is both a global function (working on a numpy array s together with a sampling frequency fs) and a Sound method (working on a Sound instance s, in which case no separate fs argument is required).

It returns (timebase, lower, upper), where lower and upper are the lower and upper bounds on the signal amplitude. Amplitude os computed in adjacent non-overlapping bins, each bin being of width granularity (expressed in seconds).

lower and upper will be adjusted such that they are always at least minThickness apart. Also, if you supply per-channel values as includeDC, then the bounds will be adjusted such that the specified values are always included. (If you simply specify includeDC=True, then the per-channel median values of s will be used in this way.)

The return values are suitable for plotting as follows:

s = TestSound('12')
t, lower, upper = s.Envelope()
import matplotlib.pyplot as plt
for iChannel in range(s.nChannels):
        plt.fill_between(t, lower[:, i], upper[:, i])

(NB: You do not actually need to plot it by hand in this way, because the Sound.Plot method will do it for you: “envelope mode” is used by default for sounds longer than 60 seconds, and this decision can be overridden in either direction by passing envelope=True or envelope=False.)

By default, t is strictly increasing. But if you set bars=True then values will be repeated in t, lower and upper such that only horizontal and vertical edges will appear in the plots (Manhattan skyline style).

Fade(risetime=0, falltime=0, hann=False)

If risetime is greater than zero, it denotes the duration (in seconds) over which the sound is to be faded-in at the beginning.

If falltime is greater than zero, it denotes the duration of the corresponding fade-out at the end.

If hann is true, then a raised-cosine function is used for fading instead of a linear ramp.

The array self.y is modified in-place.

GenerateWaveform(freq_hz=1.0, phase_rad=None, phase_deg=None, amplitude=1.0, dc=0.0, samplingfreq_hz=None, duration_msec=None, duration_samples=None, axis=None, waveform=<ufunc 'cos'>, waveform_domain='auto', **kwargs)

Create a signal (or multiple signals, if the input arguments are arrays) which is a function of time (time being defined along the specified axis).

If this is being used as a method of an audiomath.Sound instance, then the container argument is automatically set to that instance. Otherwise (if used as a global function), the container argument is optional—if supplied, it should be a audiomath.Sound object. With a container, the axis argument is set to 0, and the container object’s sampling frequency number of channels and duration (if non-zero) are used as fallback values in case these are not specified elsewhere. The resulting signal is put into container.y and a reference to the container is returned.

Default phase is 0, but may be changed by either phase_deg or phase_rad (or both, as long as the values are consistent).

Default duration is 1000 msec, but may be changed by either duration_samples or duration_msec (or both, as long as the values are consistent).

If duration_samples is specified and samplingfreq_hz is not, then the sampling frequency is chosen such that the duration is 1 second—so then freq_hz can be interpreted as cycles per signal.

The default waveform function is numpy.cos which means that amplitude, phase and frequency arguments can be taken straight from the kind of dictionary returned by fft2ap() for an accurate reconstruction. A waveform function is assumed by default to take an input expressed in radians, unless the first argument in its signature is named cycles, samples, seconds or milliseconds, in which case the input argument is adjusted accordingly to achieve the named units. (To specify the units explicitly as one of these options, pass one of these words as the waveform_domain argument.)

In this module, SineWave(), SquareWave(), TriangleWave() and SawtoothWave() are all functions of cycles (i.e. the product of time and frequency), whereas Click() is a function of milliseconds. Any of these can be passed as the waveform argument.

IsolateChannels(ind, *moreIndices)

Select particular channels, discarding the others. The following are all equivalent, for selecting the first two channels:

t = s.IsolateChannels(0, 1)       # ordinary 0-based indexing
t = s.IsolateChannels([0, 1])
t = s.IsolateChannels('1', '2')   # when you use strings, you
t = s.IsolateChannels(['1', '2']) # can index the channels
t = s.IsolateChannels('12')       # using 1-based numbering

Equivalently, you can also use slicing notation, selecting channels via the second dimension:

t = s[:, [0, 1]]
t = s[:, '12'] # again, strings mean 1-based indexing

Left(): Return a new Sound instance containing a view of alternate channels, starting at the first.

MakeHannWindow(plateau_duration=0): Return a single-channel Sound object of the same duration and sampling frequency as self, containing a Hann or Tukey window—i.e. a raised-cosine “fade-in”, followed by an optional plateau, followed by a raised- cosine “fade-out”.

MixDownToMono(): Average the sound waveforms across all channels, replacing self.y with the single-channel result.

ModulateAmplitude(freq_hz=1.0, phase_rad=None, phase_deg=None, amplitude=0.5, dc=0.5, samplingfreq_hz=None, duration_msec=None, duration_samples=None, axis=None, waveform=<ufunc 'sin'>, **kwargs)

If s is a numpy.ndarray, return a modulated copy of the array. If s is an audiomath.Sound object (for example, if this is being used a method of that class), then its internal array s.y will be replaced by a modulated copy.

Modulation means multiplication by the specified waveform, along the specified time axis.

Default phase is such that amplitude is 0 at time 0, which corresponds to phase_deg=-90 if waveform follows sine phase (remember: by default the modulator is a raised waveform, because dc=0.5 by default). To change phase, specify either phase_rad or phase_deg.

Uses GenerateWaveform()

NumberOfChannels(): Returns the number of channels.

NumberOfSamples(): Returns the length of the sound in samples.

PadEndTo(seconds): Append silence to the instance’s internal array as necessary to ensure that the total duration is at least the specified number of seconds.

PadStartTo(seconds): Prepend silence to the instance’s internal array as necessary to ensure that the total duration is at least the specified number of seconds.

Play(*pargs, **kwargs)

This quick-and-dirty method allows you to play a Sound. It creates a Player instance in verbose mode, uses it to play the sound, waits for it to finish (or for the user to press ctrl-C), then destroys the Player again.

You will get a better user experience, and better performance, if you explicitly create a Player instance of your own and work with that.

Arguments are passed through to the Player.Play() method.

Plot(zeroBased=False, maxDuration=None, envelope='auto', title=True, timeShift=0, timeScale=1, hold=False, finish=True, axes=None)

Plot the sound waveform for each channel. The third-party Python package matplotlib is required to make this work— you may need to install this yourself.

Parameters:

zeroBased (bool) – This determines whether the y-axis labels show zero-based channel numbers (Python’s normal convention) or one-based channel numbers (the convention followed by almost every other audio software tool). In keeping with audiomath’s slicing syntax, zero-based indices are expressed as integers whereas one-based channel indices are expressed as string literals.
maxDuration (float, None) – Long sounds can take a prohibitive amount of time and memory to plot. If maxDuration is not None, this method will plot no more than the first maxDuration seconds of the sound. If this means some of the sound has been omitted, a warning is printed to the console.
envelope (bool, float, ‘auto’) – With envelope=True, plot the sound in “envelope mode”, which is less veridical when zoomed-in but which uses much less time and memory in the graphics back end. With envelope=False, plot the sound waveform as a line. With the default value of envelope='auto', only go into envelope mode when plotting more than 60 seconds’ worth of sound. You can also supply a floating-point value, expressed in seconds: this explicitly enforces envelope mode with the specified bin width.
title (bool, str) – With title=True, the instance’s label attribute is used as the axes title. With title=False or title=None, the axes title is left unchanged. If a string is supplied explicitly, then that string is used as the axes title.
timeShift (float) – The x-axis (time) begins at this value, expressed in seconds.
timeScale (float) – After time-shifting, the time axis is multiplied by this number. For example, you can specify timeScale=1000 to visualize your sound on a scale of milliseconds instead of seconds.
hold (bool) – With hold=False, the axes are cleared before plotting. With hold=True, the plot is superimposed on top of whatever is already plotted in the current axes.

Read(source, raw_dtype=None, toolkit='auto', verbose=False)

Parameters:

source – A filename, or a byte string containing raw audio data. With filenames, files are decoded according to their file extension, unless the raw_dtype argument is explicitly specified, in which case files are assumed to contain raw data without header, regardless of extension.
raw_dtype (str) – If supplied, source is interpreted either as raw audio data, or as the name of a file containing raw audio data without a header. If source is a byte string containing raw audio data, and raw_dtype is unspecified, raw_dtype will default to self.dtype_encoded. Examples might be float32 or even float32*2—the latter explicitly overrides the current value of self.NumberOfChannels() and interprets the raw data as 2-channel.
toolkit (str) – Must be 'auto' (which is the default), or 'wave', 'audioread' or 'avbin'. Determines which of the back-ends to use for decoding audio. AVbin binaries for common platforms are included, but the AVbin project is no longer maintained, so support may become sparser/buggier over time. It is recommended that you install the optional third-party package audioread which appears to offer better performance (you may also need to install the ffmpeg utility on some platforms/for some file types).
verbose (bool) – If set to True, report toolkit decision-making and decoding progress to stdout.

Resample(newfs): Change the instance’s sampling frequency. Replace its internal array with a new array, interpolated at the new sampling frequency newfs (expressed in Hz).

Reverse(): Reverse the sound in time. The array self.y is modified in-place.

Right(): Return a new Sound instance containing a view of alternate channels, starting at the second (unless there is only one channel, in which case return that).

SamplesToSeconds(samples)

Convert samples to seconds given at the sampling frequency of the instance. samples may be a scalar, or a sequence or array. The result is returned in floating-point.

The Sound class

The `Sound` class