How to develop a custom digestor?¶
!git clone https://github.com/pydamavand/damavand
fatal: destination path 'damavand' already exists and is not an empty directory.
!pip install -r damavand/requirements.txt
Requirement already satisfied: certifi==2024.7.4 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 1)) (2024.7.4) Requirement already satisfied: charset-normalizer==3.3.2 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 2)) (3.3.2) Requirement already satisfied: idna==3.7 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 3)) (3.7) Requirement already satisfied: numpy==1.26.4 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 4)) (1.26.4) Requirement already satisfied: pandas==2.1.4 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 5)) (2.1.4) Requirement already satisfied: python-dateutil==2.9.0.post0 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 6)) (2.9.0.post0) Requirement already satisfied: pytz==2024.1 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 7)) (2024.1) Requirement already satisfied: rarfile==4.2 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 8)) (4.2) Requirement already satisfied: requests==2.32.3 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 9)) (2.32.3) Requirement already satisfied: scipy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 10)) (1.13.1) Requirement already satisfied: six==1.16.0 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 11)) (1.16.0) Requirement already satisfied: tzdata==2024.1 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 12)) (2024.1) Requirement already satisfied: urllib3==2.2.2 in /usr/local/lib/python3.11/dist-packages (from -r damavand/requirements.txt (line 13)) (2.2.2)
A digestor is basically a crawler walking the directories inside the base directory of a downloaded dataset to not only extract data from the raw dataset files (usually files with mat, csv or xlsx extensions) but also organize the corresponding metadata.
In this section, we demonstrate how to develop a digestor for UoO dataset from scratch.
Let's start from downloading the dataset:
from damavand.damavand.datasets.downloaders import read_addresses, ZipDatasetDownloader
addresses = read_addresses()
downloader = ZipDatasetDownloader(addresses['UoO'])
downloader.download_extract('UoO.zip', 'UoO/')
The code snippet above, downloads and extract the dataset into UoO
folder. Next step would be to investigate its content:
import os
os.listdir('UoO')
['H-B-2.mat', 'O-D-2.mat', 'I-D-1.mat', 'H-D-3.mat', 'I-A-3.mat', 'H-C-1.mat', 'H-A-3.mat', 'H-B-3.mat', 'H-B-1.mat', 'H-C-2.mat', 'O-B-3.mat', 'O-A-2.mat', 'H-A-2.mat', 'I-B-3.mat', 'I-D-2.mat', 'I-D-3.mat', 'H-D-2.mat', 'I-C-1.mat', 'I-B-2.mat', 'O-C-3.mat', 'I-B-1.mat', 'H-A-1.mat', 'I-A-1.mat', 'O-D-1.mat', 'H-C-3.mat', 'O-A-3.mat', 'H-D-1.mat', 'I-A-2.mat', 'O-A-1.mat', 'O-B-2.mat', 'O-D-3.mat', 'I-C-3.mat', 'O-C-2.mat', 'O-C-1.mat', 'I-C-2.mat', 'O-B-1.mat']
According to the official paper, use a naming convention in the form of C-L-R.mat, where:
- C corresponding to the health class (H for the healthy, O for the outer race and I for the inner-race faults)
- L corresponding to the loading pattern (A for increasing rotational speed, B for decreasing rotational speed, C for increasing the decreasing rotational speed and D for decreasing the increasing rotational speed)
- R corresponding to the repetition number (from 1 to 3)
Let's examine what is stored in one of the files; to do so, we need to open a .mat, easily by scipy.io.loadmat
as below:
from scipy.io import loadmat
base_dir = 'UoO/'
mat_data = loadmat(base_dir + 'H-B-2.mat')
mat_data.keys()
dict_keys(['__header__', '__version__', '__globals__', 'Channel_1', 'Channel_2'])
According to the authors, they have collected accelerometer data and encoder as Channel_1 and Channel_2 respectively.
Mining a dataset involves openning, minining (splitting the signals) and attaching the metadata.
Following code snippet fulfills this for the Channel_1.
from damavand.damavand.utils import splitter
base_dir = 'UoO/'
win_len, hop_len = 10000, 10000
data = []
for file in os.listdir(base_dir):
# Splitting the data file to extract three pieces of metadata state, loading and repetition
state = file.split('.')[0].split('-')[:-1][0]
loading = file.split('.')[0].split('-')[:-1][1]
rep = file.split('.')[0].split('-')[-1]
# Opening the .mat file as a python dictionary
mat_data = loadmat(base_dir + file)
# Mining the data available in Channel_1, using the splitter function and saving it to the temp_df variable
temp_df = splitter(mat_data['Channel_1'].reshape((-1)), win_len, hop_len)
# Assigning the metadata, as new columns to the temp_df
temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
# Appending the temp_df to the data list
data.append(temp_df)
The code snippet above, mines merely Channel_1 data; moreover, it goes through all the files available, however, cases are possible where the user is only interested in a limited number of repetitions. To fix those, the above code snippet can be modified as below:
base_dir = 'UoO/'
channels = ['Channel_1', 'Channel_2']
win_len, hop_len = 10000, 10000
reps = [1]
data = {channel: [] for channel in channels}
for file in os.listdir(base_dir):
if file.endswith('.mat'):
rep = int(file.split('.')[0].split('-')[-1])
if rep in reps:
state = file.split('.')[0].split('-')[:-1][0]
loading = file.split('.')[0].split('-')[:-1][1]
mat_data = loadmat(base_dir + file)
for channel in data.keys():
temp_df = splitter(mat_data[channel].reshape((-1)), win_len, hop_len)
temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
data[channel].append(temp_df)
We regard object-oriented programming essential to develop reusable and easily maintainable code; therefore, in Damavand, every digestor must be implemented as a Python class.
The code snippet above can be transformed into the following class, easily:
class UoO():
# Instantiating of a digestor object, requires the declaration of the base directory, channels and repetitions the user is interested in.
def __init__(self, base_directory, channels = ['Channel_1', 'Channel_2'], reps = list(range(1,4))):
self.base_dir = base_directory
self.channels = channels
self.reps = reps
# Once the dataset is mined, data will be presented in a Python dictionary whose keys are elements of the channels, user has specified during the instantiation.
self.data = {key: [] for key in self.channels}
# To mine the dataset, user is supposed to declare window length and hop length. To do so, these must be passed in the form of a Python dictionary whose keys are 'win_len' and 'hop_len'.
def mine(self, mining_params):
for file in os.listdir(self.base_dir):
if file.endswith('.mat'):
rep = int(file.split('.')[0].split('-')[-1])
if rep in self.reps:
state = file.split('.')[0].split('-')[:-1][0]
loading = file.split('.')[0].split('-')[:-1][1]
mat_data = loadmat(self.base_dir + file)
for channel in self.data.keys():
temp_df = splitter(mat_data[channel].reshape((-1)), mining_params['win_len'], mining_params['hop_len'])
temp_df['state'], temp_df['loading'], temp_df['rep'] = state, loading, rep
self.data[channel].append(temp_df)
The digestor is easily usable as below:
dataset = UoO('UoO/', ['Channel_1', 'Channel_2'], [1])
mining_params = {'win_len': 10000, 'hop_len': 10000}
dataset.mine(mining_params)
# Concatenate all observations under `Channel_1`
import pandas as pd
df = pd.concat(dataset.data['Channel_1']).reset_index(drop = True)
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 9993 | 9994 | 9995 | 9996 | 9997 | 9998 | 9999 | state | loading | rep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.026534 | -0.012723 | 0.008981 | 0.025424 | 0.026410 | 0.016874 | 0.008324 | 0.021806 | 0.042195 | 0.051074 | ... | -0.032453 | -0.040675 | -0.037386 | -0.018642 | 0.000102 | 0.008981 | 0.009968 | I | D | 1 |
1 | 0.014572 | 0.032658 | 0.059624 | 0.085274 | 0.097770 | 0.101716 | 0.104676 | 0.116843 | 0.137232 | 0.149399 | ... | 0.002076 | 0.016545 | 0.022135 | 0.011941 | -0.001871 | -0.006474 | -0.002199 | I | D | 1 |
2 | 0.003391 | 0.003720 | -0.002199 | -0.004501 | 0.001418 | 0.013585 | 0.019176 | 0.013585 | 0.004377 | -0.001213 | ... | 0.010297 | -0.011078 | -0.033111 | -0.050211 | -0.061392 | -0.067311 | -0.069942 | I | D | 1 |
3 | -0.068627 | -0.060077 | -0.039688 | -0.010749 | 0.017860 | 0.036933 | 0.040880 | 0.045812 | 0.049101 | 0.049430 | ... | -0.007132 | 0.007995 | 0.011283 | 0.000431 | -0.014367 | -0.016011 | -0.000884 | I | D | 1 |
4 | 0.013914 | 0.015558 | 0.006022 | -0.002857 | 0.005364 | 0.019176 | 0.022793 | 0.011941 | -0.004173 | -0.009434 | ... | 0.010297 | 0.021478 | 0.029699 | 0.034303 | 0.035618 | 0.033645 | 0.036605 | I | D | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2395 | -0.002199 | -0.001213 | 0.000431 | 0.001747 | 0.003062 | 0.003391 | 0.003391 | 0.003720 | 0.004049 | 0.004377 | ... | 0.003391 | 0.003062 | 0.002404 | 0.001747 | 0.001089 | 0.001089 | 0.002404 | O | B | 1 |
2396 | 0.002404 | 0.000760 | 0.000102 | -0.000884 | -0.001213 | -0.011736 | -0.003186 | -0.001542 | -0.000226 | 0.000431 | ... | -0.002857 | -0.001542 | -0.000226 | 0.001418 | 0.003062 | 0.004377 | 0.006022 | O | B | 1 |
2397 | 0.006351 | 0.008324 | 0.005693 | 0.009310 | 0.007995 | 0.009310 | 0.008653 | 0.007666 | 0.007995 | 0.000102 | ... | -0.001542 | -0.002528 | -0.001871 | -0.001213 | -0.001213 | -0.001213 | -0.000884 | O | B | 1 |
2398 | -0.001542 | -0.002199 | -0.000555 | -0.000226 | 0.000431 | 0.001747 | 0.002076 | 0.002733 | 0.004049 | -0.016669 | ... | 0.002076 | 0.002404 | 0.005693 | 0.006351 | 0.006679 | 0.006679 | 0.007995 | O | B | 1 |
2399 | 0.008981 | 0.009968 | 0.008981 | 0.007666 | 0.006679 | 0.006022 | 0.006679 | 0.007995 | 0.007666 | 0.007995 | ... | -0.000884 | -0.000555 | -0.000884 | -0.001542 | -0.002528 | -0.001213 | -0.001871 | O | B | 1 |
2400 rows × 10003 columns