Damavand

Motivation

The main motivation behind developing Damavand is to democratize rotary machine intelligent predictive maintenance, through the development of an end-to-end unified data processing framework, covering from downloading the raw data to data preprocessing. On this page, for a simple task of health state classification, we compare the experience with and without using Damavand, to give a demonstration on the value provided by the use of Damavand.

Hands-on-Example: Machinery Health Classification on SEU Dataset

For the practical example, we go through comparing the implementation with and without Damavand, for a machinery health state classification problem on gearbox and bearing fault dataset by Southeast University (SEU). In the upcoming sections, we go through each step and provide both versions of implementation, with and without Damavand.

0.Downloading and extracting the dataset

Inevitably, the starting step is to download the dataset; SEU is available as single .zip file and is downloadable as below:

with Damavand without Damavand

# Importings
from damavand.damavand.datasets.downloaders import read_addresses, ZipDatasetDownloader

# Instantiating a downloader object
addresses = read_addresses()
downloader = ZipDatasetDownloader(addresses['SEU'])

# Downloading and extracting
downloader.download_extract('SEU.zip', 'SEU/')

# Importings
import requests
from zipfile import ZipFile

# Specifying the address
download_link = "https://drive.usercontent.google.com/download?id=1y_f9si3fh7dhTYvTwjSWBTdottTXM6KN&export=download&confirm=t&uuid=ba143b27-5d6f-459e-a583-e1ff07b921b6"

# Downloading the .zip file
download_file_name = "SEU.zip"
response = requests.get(download_link)
with open(download_file_name, "wb") as file:
    file.write(response.content)

# Extracting the .zip file
extraction_path = "SEU/"
with ZipFile(extration_path, "r") as zObject:
    zObject.extract(self.extraction_path)

1. Mining the dataset

Once the dataset is downloaded and extracted, it comes to the mining step; mining step invloves transforming raw dataset files into structured annotated (e.g. working/loading conditions, health state and etc.) signals. The following shows how it can be done for the SEU dataset:

with Damavand without Damavand

# Importing
from damavand.damavand.datasets.digestors import SEU

# Instanciating a digestor object
seu = SEU('SEU/')

# Setting the mining parameters
mining_params = {
    'win_len': 10000,
    'hop_len': 10000
}
# Mining the dataset
seu.mine(mining_params)

# Aggregating the data over the second accelerometer
df = pd.concat(seu.data[1]).reset_index(drop = True)

import os
import numpy as np
import pandas as pd

# Defining a splitter function (splitting super-long original signlas into segments)
def splitter(array, win_len, hop_len, return_df = True):
    N = array.shape[0]
    m = 0
    ids = []
    while m + win_len <= N:
        ids.append([m, m + win_len])
        m = m + hop_len

    if return_df:
        return pd.DataFrame([array[i[0]: i[1]] for i in ids])
    else:
        return np.array([array[i[0]: i[1]] for i in ids])

# Setting the mining parameters
mining_params = {
    'win_len': 10000,
    'hop_len': 10000
}

# Setting the base directory
base_dir = 'SEU/'

# Setting the sensors of interest
channels = [1]

# Initializing the output data interface
data = {key: [] for key in channels}

# Mining the dataset

# Iterating over subdirecories of the dataset
for sub_directory in os.listdir(base_dir):
    # Iterating over the files inside each subdirectory
    for file in os.listdir(base_dir + sub_directory):
        # Checking if the file is a valid measurement (.csv) file
        if file.endswith(".csv"):
            # Splitting the file name to address metadata
            file_split = file.split(".csv")[0].split("_")
            test_bed = sub_directory
            state = file_split[0]
            rot_speed = file_split[1]
            # Opening the file
            with open(
                "SEU/" + sub_directory + "/" + file,
                "r",
                encoding="gb18030",
                errors="ignore",
            ) as f:
                # Loading the content
                content = f.readlines()
                # Handling the cases where delimeter is ","
                if file == "ball_20_0.csv":
                    arr = np.array(
                        [i.split(",")[:-1] for i in content[16:]]
                    ).astype(float)
                # Handling the cases where delimeter is "\t"
                else:
                    arr = np.array(
                        [i.split("\t")[:-1] for i in content[16:]]
                    ).astype(float)

            print("Mining: ", file)
            # Iterating over each channel
            for key in self.data.keys():
                # Segmenting the original signal according to the mining parameters
                temp_df = splitter(
                    arr[:, key],
                    mining_params["win_len"],
                    mining_params["hop_len"],
                )
                # Assigning annotations
                temp_df["test_bed"] = test_bed
                temp_df["state"] = state
                temp_df["rot_speed"] = rot_speed
                # Adding the mined signals to the data delivery interface, according to the channel
                self.data[key].append(temp_df)
                # Using garbage collector to keep RAM usage managable.
                gc.collect()

# Aggregating the data over the second accelerometer
df = pd.concat(seu.data[1]).reset_index(drop = True)

2. Feature extraction

Transforming raw measurements into a enriched feature spaces with higher potential in the task in hand is considered as a usual practice in machine learning; in this section, we aim to extract two different feature spaces:

Frequency domain representation
Time-domain statistical features

Damavand facilitates statistical feature extraction in two ways; it not only provides a standard and unified interface to download every feature (as long as implementable as a Python function), but also gathers a wide range of specific features (both from time and frequency domains) believed to be highly effective for rotating machinery fault classification. For the sake of this presentation, following features will be used:

Squared Mean of Square Roots of Absolutes: This will be implemented as a Python function, however it is accessible as damavand.damavand.signal_processing.feature_extraction import smsa.
Root Mean Square
Crest Factor
Skewness: This feature is extracted using scipy.stats.skew.
Kurtosis: This feature is extracted using scipy.stats.kurtosis.

with Damavand without Damavand

# Importing
from damavand.damavand.signal_processing.transformations import fft
from damavand.damavand.signal_processing.feature_extraction import rms, crest_factor, feature_extractor
from scipy.signal.windows import hann
from scipy.signal import butter
from scipy.stats import skew, kurtosis

# Declaring signals from metadata
signals, metadata = df.iloc[:, : - 3], df.iloc[:, - 3 :]

### Extracting frequency domain representation

# Defining a window (to avoid leakage error) and a frequency filter (to avoid aliasing)
window = scipy.signal.windows.hann(signals.shape[1])
freq_filter = scipy.signal.butter(25, [15, 950], 'bandpass', fs = 2000, output='sos')

# Applying FFT to extract frequency domain signals
signals_fft = fft(signals, freq_filter = freq_filter, window = window)

### Extracting time-domain features

# Defining a custom feature (Squared Mean of Square Roots of Absolutes)

def smsa(arr):
    return np.square(np.mean(np.sqrt(np.abs(arr))))

# Defining the list of features to extract
time_features = {
  'smsa': (smsa, (), {}),
  'rms': (rms, (), {}),
  'skew': (skew, (), {}),
  'kurtosis': (kurtosis, (), {}),
  'crest_factor': (crest_factor, (), {}),
}

# Extracting time-domain features
time_features_df = feature_extractor(signals, time_features)

# Importings
from scipy.fft import fft

# Declaring signals from metadata
signals, metadata = df.iloc[:, : - 3], df.iloc[:, - 3 :]

### Extracting frequency domain representation

# Defining a window (to avoid leakage error) and a frequency filter (to avoid aliasing)
window = scipy.signal.windows.hann(signals.shape[1])
freq_filter = scipy.signal.butter(25, [15, 950], 'bandpass', fs = 2000, output='sos')

# Application of fft to extract raw frequency domain representation
fft_raw = fft(signals)

# Transforming raw frequency representation into positive frequency only and absolute values and fixing the scaling

signals_fft = pd.DataFrame(2.0/signals.shape[1] * np.abs(fft[:, 0:signals.shape[1]//2]))

### Extracting time-domain features

# Implementing "Squared Mean of Square Roots of Absolutes" as a Python function.
def smsa(arr):
    return np.square(np.mean(np.sqrt(np.abs(arr))))

# Implementing "Root Mean Squared" as a Python function.
def rms(arr):
    return np.sqrt(np.mean(np.square(arr)))

# Implementing "Crest Factor" as a Python function.
def crest_factor(arr):
    return np.max(np.abs(arr)) / rms(arr)

# Organizing the functions as a feature set

time_features = {
  'smsa': (smsa, (), {}),
  'rms': (rms, (), {}),
  'skew': (skew, (), {}),
  'kurtosis': (kurtosis, (), {}),
  'crest_factor': (crest_factor, (), {}),
}

# Iterating over the dataframe to extract desired features

all_extracted_features = []

for index, row in signals.iterrows():

    row_features = {}

    for feat_name, (func, args, kwargs) in time_features.items():
        row_features[feat_name] = func(row, *args, **kwargs)

    all_extracted_features.append(row_features)

time_features_df = pd.DataFrame(all_extracted_features)

Takeaway

Our simple demonstration shows how shorter and efficient can become operation and processes invloved in a simplified health classification problem, once using Damavand; however, the true advantage brought by Damavand is the fact that mining and feature extraction steps can become considerably easier to repeat with different sets of hyperparameters (and options for the sake of signal processing) to make the experiment implementation faster and more straightforward.