How to develop a custom digestor?¶

In [ ]:

Copied!

!git clone https://github.com/pydamavand/damavand
!git clone https://github.com/pydamavand/damavand

fatal: destination path 'damavand' already exists and is not an empty directory.

In [2]:

Copied!

!pip install -r damavand/requirements.txt
!pip install -r damavand/requirements.txt

Requirement already satisfied: certifi==2024.7.4 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 1)) (2024.7.4)
Requirement already satisfied: charset-normalizer==3.3.2 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 2)) (3.3.2)
Requirement already satisfied: idna==3.7 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 3)) (3.7)
Requirement already satisfied: numpy==1.26.4 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 4)) (1.26.4)
Requirement already satisfied: pandas==2.1.4 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 5)) (2.1.4)
Requirement already satisfied: python-dateutil==2.9.0.post0 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 6)) (2.9.0.post0)
Requirement already satisfied: pytz==2024.1 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 7)) (2024.1)
Requirement already satisfied: rarfile==4.2 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 8)) (4.2)
Requirement already satisfied: requests==2.32.3 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 9)) (2.32.3)
Requirement already satisfied: scipy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 10)) (1.13.1)
Requirement already satisfied: six==1.16.0 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 11)) (1.16.0)
Requirement already satisfied: tzdata==2024.1 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 12)) (2024.1)
Requirement already satisfied: urllib3==2.2.2 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 13)) (2.2.2)

A digestor is basically a crawler walking the directories inside the base directory of a downloaded dataset to not only extract data from the raw dataset files (usually files with mat, csv or xlsx extensions) but also organize the corresponding metadata.

In this section, we demonstrate how to develop a digestor for UoO dataset from scratch.

Let's start from downloading the dataset:

In [3]:

Copied!

from damavand.damavand.datasets.downloaders import read_addresses, ZipDatasetDownloader

addresses = read_addresses()
downloader = ZipDatasetDownloader(addresses['UoO'])
downloader.download_extract('UoO.zip', 'UoO/')
from damavand.damavand.datasets.downloaders import read_addresses, ZipDatasetDownloader

addresses = read_addresses()
downloader = ZipDatasetDownloader(addresses['UoO'])
downloader.download_extract('UoO.zip', 'UoO/')

The code snippet above, downloads and extract the dataset into UoO folder. Next step would be to investigate its content:

In [4]:

Copied!

import os

os.listdir('UoO')
import os

os.listdir('UoO')

Out[4]:

['H-B-2.mat',
 'O-D-2.mat',
 'I-D-1.mat',
 'H-D-3.mat',
 'I-A-3.mat',
 'H-C-1.mat',
 'H-A-3.mat',
 'H-B-3.mat',
 'H-B-1.mat',
 'H-C-2.mat',
 'O-B-3.mat',
 'O-A-2.mat',
 'H-A-2.mat',
 'I-B-3.mat',
 'I-D-2.mat',
 'I-D-3.mat',
 'H-D-2.mat',
 'I-C-1.mat',
 'I-B-2.mat',
 'O-C-3.mat',
 'I-B-1.mat',
 'H-A-1.mat',
 'I-A-1.mat',
 'O-D-1.mat',
 'H-C-3.mat',
 'O-A-3.mat',
 'H-D-1.mat',
 'I-A-2.mat',
 'O-A-1.mat',
 'O-B-2.mat',
 'O-D-3.mat',
 'I-C-3.mat',
 'O-C-2.mat',
 'O-C-1.mat',
 'I-C-2.mat',
 'O-B-1.mat']

According to the official paper, use a naming convention in the form of C-L-R.mat, where:

C corresponding to the health class (H for the healthy, O for the outer race and I for the inner-race faults)
L corresponding to the loading pattern (A for increasing rotational speed, B for decreasing rotational speed, C for increasing the decreasing rotational speed and D for decreasing the increasing rotational speed)
R corresponding to the repetition number (from 1 to 3)

Let's examine what is stored in one of the files; to do so, we need to open a .mat, easily by scipy.io.loadmat as below:

In [5]:

Copied!

from scipy.io import loadmat

base_dir = 'UoO/'
mat_data = loadmat(base_dir + 'H-B-2.mat')
mat_data.keys()
from scipy.io import loadmat

base_dir = 'UoO/'
mat_data = loadmat(base_dir + 'H-B-2.mat')
mat_data.keys()

Out[5]:

dict_keys(['__header__', '__version__', '__globals__', 'Channel_1', 'Channel_2'])

According to the authors, they have collected accelerometer data and encoder as Channel_1 and Channel_2 respectively.

Mining a dataset involves openning, minining (splitting the signals) and attaching the metadata.

Following code snippet fulfills this for the Channel_1.

In [6]:

Copied!





from damavand.damavand.utils import splitter

base_dir = 'UoO/'
win_len, hop_len = 10000, 10000
data = []

for file in os.listdir(base_dir):
  # Splitting the data file to extract three pieces of metadata state, loading and repetition
  state = file.split('.')[0].split('-')[:-1][0]
  loading = file.split('.')[0].split('-')[:-1][1]
  rep = file.split('.')[0].split('-')[-1]

  # Opening the .mat file as a python dictionary
  mat_data = loadmat(base_dir + file)
  # Mining the data available in Channel_1, using the splitter function and saving it to the temp_df variable
  temp_df = splitter(mat_data['Channel_1'].reshape((-1)), win_len, hop_len)
  # Assigning the metadata, as new columns to the temp_df
  temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
  # Appending the temp_df to the data list
  data.append(temp_df)
from damavand.damavand.utils import splitter

base_dir = 'UoO/'
win_len, hop_len = 10000, 10000
data = []

for file in os.listdir(base_dir):
  # Splitting the data file to extract three pieces of metadata state, loading and repetition
  state = file.split('.')[0].split('-')[:-1][0]
  loading = file.split('.')[0].split('-')[:-1][1]
  rep = file.split('.')[0].split('-')[-1]

  # Opening the .mat file as a python dictionary
  mat_data = loadmat(base_dir + file)
  # Mining the data available in Channel_1, using the splitter function and saving it to the temp_df variable
  temp_df = splitter(mat_data['Channel_1'].reshape((-1)), win_len, hop_len)
  # Assigning the metadata, as new columns to the temp_df
  temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
  # Appending the temp_df to the data list
  data.append(temp_df)

The code snippet above, mines merely Channel_1 data; moreover, it goes through all the files available, however, cases are possible where the user is only interested in a limited number of repetitions. To fix those, the above code snippet can be modified as below:

In [7]:

Copied!





base_dir = 'UoO/'

channels = ['Channel_1', 'Channel_2']
win_len, hop_len = 10000, 10000
reps = [1]
data = {channel: [] for channel in channels}

for file in os.listdir(base_dir):
  if file.endswith('.mat'):
    rep = int(file.split('.')[0].split('-')[-1])
    if rep in reps:
      state = file.split('.')[0].split('-')[:-1][0]
      loading = file.split('.')[0].split('-')[:-1][1]
      mat_data = loadmat(base_dir + file)
      for channel in data.keys():
        temp_df = splitter(mat_data[channel].reshape((-1)), win_len, hop_len)
        temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
        data[channel].append(temp_df)
base_dir = 'UoO/'

channels = ['Channel_1', 'Channel_2']
win_len, hop_len = 10000, 10000
reps = [1]
data = {channel: [] for channel in channels}

for file in os.listdir(base_dir):
  if file.endswith('.mat'):
    rep = int(file.split('.')[0].split('-')[-1])
    if rep in reps:
      state = file.split('.')[0].split('-')[:-1][0]
      loading = file.split('.')[0].split('-')[:-1][1]
      mat_data = loadmat(base_dir + file)
      for channel in data.keys():
        temp_df = splitter(mat_data[channel].reshape((-1)), win_len, hop_len)
        temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
        data[channel].append(temp_df)

We regard object-oriented programming essential to develop reusable and easily maintainable code; therefore, in Damavand, every digestor must be implemented as a Python class.

The code snippet above can be transformed into the following class, easily:

In [8]:

Copied!





class UoO():
  # Instantiating of a digestor object, requires the declaration of the base directory, channels and repetitions the user is interested in.
  def __init__(self, base_directory, channels = ['Channel_1', 'Channel_2'], reps = list(range(1,4))):
    self.base_dir = base_directory
    self.channels = channels
    self.reps = reps

    # Once the dataset is mined, data will be presented in a Python dictionary whose keys are elements of the channels, user has specified during the instantiation.
    self.data = {key: [] for key in self.channels}

  # To mine the dataset, user is supposed to declare window length and hop length. To do so, these must be passed in the form of a Python dictionary whose keys are 'win_len' and 'hop_len'.
  def mine(self, mining_params):
    for file in os.listdir(self.base_dir):
      if file.endswith('.mat'):
        rep = int(file.split('.')[0].split('-')[-1])
        if rep in self.reps:
          state = file.split('.')[0].split('-')[:-1][0]
          loading = file.split('.')[0].split('-')[:-1][1]
          mat_data = loadmat(self.base_dir + file)
          for channel in self.data.keys():
            temp_df = splitter(mat_data[channel].reshape((-1)), mining_params['win_len'], mining_params['hop_len'])
            temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
            self.data[channel].append(temp_df)
class UoO():
  # Instantiating of a digestor object, requires the declaration of the base directory, channels and repetitions the user is interested in.
  def __init__(self, base_directory, channels = ['Channel_1', 'Channel_2'], reps = list(range(1,4))):
    self.base_dir = base_directory
    self.channels = channels
    self.reps = reps

    # Once the dataset is mined, data will be presented in a Python dictionary whose keys are elements of the channels, user has specified during the instantiation.
    self.data = {key: [] for key in self.channels}

  # To mine the dataset, user is supposed to declare window length and hop length. To do so, these must be passed in the form of a Python dictionary whose keys are 'win_len' and 'hop_len'.
  def mine(self, mining_params):
    for file in os.listdir(self.base_dir):
      if file.endswith('.mat'):
        rep = int(file.split('.')[0].split('-')[-1])
        if rep in self.reps:
          state = file.split('.')[0].split('-')[:-1][0]
          loading = file.split('.')[0].split('-')[:-1][1]
          mat_data = loadmat(self.base_dir + file)
          for channel in self.data.keys():
            temp_df = splitter(mat_data[channel].reshape((-1)), mining_params['win_len'], mining_params['hop_len'])
            temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
            self.data[channel].append(temp_df)

The digestor is easily usable as below:

In [9]:

Copied!





dataset = UoO('UoO/', ['Channel_1', 'Channel_2'], [1])
mining_params = {'win_len': 10000, 'hop_len': 10000}
dataset.mine(mining_params)

# Concatenate all observations under `Channel_1`
import pandas as pd

df = pd.concat(dataset.data['Channel_1']).reset_index(drop = True)
df
dataset = UoO('UoO/', ['Channel_1', 'Channel_2'], [1])
mining_params = {'win_len': 10000, 'hop_len': 10000}
dataset.mine(mining_params)

# Concatenate all observations under `Channel_1`
import pandas as pd

df = pd.concat(dataset.data['Channel_1']).reset_index(drop = True)
df

Out[9]:

	0	1	2	3	4	5	6	7	8	9	...	9993	9994	9995	9996	9997	9998	9999	state	loading	rep
0	-0.026534	-0.012723	0.008981	0.025424	0.026410	0.016874	0.008324	0.021806	0.042195	0.051074	...	-0.032453	-0.040675	-0.037386	-0.018642	0.000102	0.008981	0.009968	I	D	1
1	0.014572	0.032658	0.059624	0.085274	0.097770	0.101716	0.104676	0.116843	0.137232	0.149399	...	0.002076	0.016545	0.022135	0.011941	-0.001871	-0.006474	-0.002199	I	D	1
2	0.003391	0.003720	-0.002199	-0.004501	0.001418	0.013585	0.019176	0.013585	0.004377	-0.001213	...	0.010297	-0.011078	-0.033111	-0.050211	-0.061392	-0.067311	-0.069942	I	D	1
3	-0.068627	-0.060077	-0.039688	-0.010749	0.017860	0.036933	0.040880	0.045812	0.049101	0.049430	...	-0.007132	0.007995	0.011283	0.000431	-0.014367	-0.016011	-0.000884	I	D	1
4	0.013914	0.015558	0.006022	-0.002857	0.005364	0.019176	0.022793	0.011941	-0.004173	-0.009434	...	0.010297	0.021478	0.029699	0.034303	0.035618	0.033645	0.036605	I	D	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2395	-0.002199	-0.001213	0.000431	0.001747	0.003062	0.003391	0.003391	0.003720	0.004049	0.004377	...	0.003391	0.003062	0.002404	0.001747	0.001089	0.001089	0.002404	O	B	1
2396	0.002404	0.000760	0.000102	-0.000884	-0.001213	-0.011736	-0.003186	-0.001542	-0.000226	0.000431	...	-0.002857	-0.001542	-0.000226	0.001418	0.003062	0.004377	0.006022	O	B	1
2397	0.006351	0.008324	0.005693	0.009310	0.007995	0.009310	0.008653	0.007666	0.007995	0.000102	...	-0.001542	-0.002528	-0.001871	-0.001213	-0.001213	-0.001213	-0.000884	O	B	1
2398	-0.001542	-0.002199	-0.000555	-0.000226	0.000431	0.001747	0.002076	0.002733	0.004049	-0.016669	...	0.002076	0.002404	0.005693	0.006351	0.006679	0.006679	0.007995	O	B	1
2399	0.008981	0.009968	0.008981	0.007666	0.006679	0.006022	0.006679	0.007995	0.007666	0.007995	...	-0.000884	-0.000555	-0.000884	-0.001542	-0.002528	-0.001213	-0.001871	O	B	1

2400 rows × 10003 columns

In [9]: