paper:

http://zhouxiuze.com/pub/Transformer.pdf

code:

https://github.com/XiuzeZhou/RUL

referce:

D. Chen, W. Hong, and X. Zhou, "Transformer Network for Remaining Useful Life Prediction of Lithium-Ion Batteries", IEEE Access, 2022, vol. 10, pp. 19621-19628.

1. Introduction

Recently, Transformer-based models are very hot in CV and NLP, but articles on Li-ion battery remain blank. Therefore, I write an article about Transformer for Remaining Useful Life (RUL) of Li-ion batteries and release all Pytorch codes, hoping to help others.

2. Problem Definition

With the charge-discharge cycle increases, the performance of Li-ion batteries generally degrades. Battery performance can be measured by capacity, so SOH, a health indicator for battery aging, can be defined by the following capacity ratio:

\(
{\textit{SOH}}(t)=\frac{C_t}{C_0}\times 100\%,
\)

where \(C_0\) denotes rated capacity, and \(C_t\) denotes the measured capacity of cycle, \(t\). As the number of times a battery is charged/discharged increases, capacity degrades. For a battery, End of Life (EOL) which is closely related to its capacity, is defined as the point when remaining capacity reaches 70-80% of initial capacity. Fig. 2 illustrates an example of RUL prediction.

3. Proposed Model

Our proposed model consisting of four parts: input and normalization, denoising, Transformer, and prediction. The architecture is shown in Fig. 3.

Input and Normalization. To reduce the influence of input data distribution changes on neural networks, the data must be normalized. Let \(\mathbf{x}=\left\{x_{1},x_{2},\dots,x_{n} \right\}\) denote the input sequence of capacity with length \(n\), which is mapped to (0,1]:

\(
\mathbf{x}'=\frac{\mathbf{x}}{C_0},
\)
where \(C_0\) denotes rated capacity.

Denoising. Raw input is always full of noise, especially when charge/discharge regeneration occurs. To maintain stability and robustness, input data must be denoised before being fed into deep neural networks. Denoising Auto-Encoder (DAE), an unsupervised method in learning useful features, which is adopted by our method, reconstructs input data from lower-dimensional representation preserving as much information as possible in the process.

Gaussian noise is added to the normalized input,\(\mathbf{x}’_t\), to obtain the corrupted vector, \(\widetilde{\mathbf{x}}_t\). DAE is defined as follows:
\(
\mathbf{z}=a\left({W}^T\widetilde{\mathbf{x}}_t+{b} \right),
\)

\(
\widehat{\mathbf{x}}_t=f'\left({W}'{z} + {b}'\right),
\)

Loss function of DAE:
\(
\mathcal L_d=\frac{1}{n}\sum_{t=1}^{n}
\ell(\widetilde{{x}}_t-\widehat{{x}}_t) +
\lambda \left ( \left \| {W} \right \|^2_{F} + \left \| {W'} \right \|^2_{F} \right)
\)

Transformer. The Transformer layers are a stack of Transformer encoders that extract the degradation features from the reconstructed data, with two sub-layers: Multi-Head Self-Attention and Feed-Forward.
\(
\textit{MultiHead}\left({H}^{l-1}\right) = [head_1;head_2;\cdots;head_h]{W}^{O};
\)

\(
head_i = \textit{Attention}\left({H}^{l-1}{W}^{l}{Q},{H}^{l-1}{W}^{l}{K},{H}^{l-1}{W}^{l}_{V} \right);
\) \(
\textit{Attention}\left({Q},{K},{V} \right) = \textit{softmax} \left(\frac{{Q}{K}^T}{\sqrt{d_h} } \right){V}
\)

Prediction. Finally, to predict unknown capacity, a full connection layer is used to map the representation learned by the last Transformer cell to arrive at the final prediction \(\widehat{x}_t\):

\(
\widehat{x}_t=f\left({W}_p{H}^h + {b}_p\right)
\)

Loss function. The learning procedure optimizes both tasks simultaneously in a unified framework. Mean Square Error (MSE) is used to evaluate loss, and the objective function is defined as follows:

\(
\mathcal L=\sum_{t=T+1}^{n}\left(x_t-\widehat{x}_{t} \right)^2+\alpha\sum_{i=1}^{n}
\ell(\widetilde{\mathbf{x}}_i-\widehat{\mathbf{x}}_i)+\lambda\Omega \left(\Theta \right)
\)

4. Experiments

4.1 Data sets

We conducted experiments using two public data sets: NASA and CALCE. The NASA data set, available from the NASA Ames Research Center web site. CALCE data set is available from the Center for Advanced Life Cycle Engineering (CALCE) of the University of Maryland.

4.2 Code

The module is mainly used for noise reduction, the data from sensors often has a lot of noise, which degrades the training of networks. Therefore, it is better remove noise before training.

class Autoencoder(nn.Module):
    def __init__(self, input_size=16, hidden_dim=8, noise_level=0.01):
        super(Autoencoder, self).__init__()
        self.input_size, self.hidden_dim, self.noise_level = input_size, hidden_dim, noise_level
        self.fc1 = nn.Linear(self.input_size, self.hidden_dim)
        self.fc2 = nn.Linear(self.hidden_dim, self.input_size)

    def encoder(self, x):
        x = self.fc1(x)
        h1 = F.relu(x)
        return h1

    def mask(self, x):
        corrupted_x = x + self.noise_level * torch.randn_like(x)
        return corrupted_x

    def decoder(self, x):
        h2 = self.fc2(x)
        return h2

    def forward(self, x):
        out = self.mask(x)
        encode = self.encoder(out)
        decode = self.decoder(encode)
        return encode, decode

4.3 Main Network

The main network has two functions: denoising and Transformer.

class Net(nn.Module):
    def __init__(self, feature_size=16, hidden_dim=32, num_layers=1, nhead=8,
                 dropout=0.0, noise_level=0.01, is_autoencoder=False):
        super(Net, self).__init__()
        self.is_autoencoder = is_autoencoder
        self.auto_hidden = int(feature_size/2)
        input_size = self.auto_hidden if is_autoencoder else feature_size
        self.pos = PositionalEncoding(d_model=input_size, max_len=input_size)
        encoder_layers = nn.TransformerEncoderLayer(d_model=input_size, nhead=nhead,
                                                    dim_feedforward=hidden_dim, dropout=dropout)
        self.cell = nn.TransformerEncoder(encoder_layers, num_layers=num_layers)
        self.linear = nn.Linear(input_size, 1)

        if self.is_autoencoder:
            self.autoencoder = Autoencoder(input_size=feature_size,
                                           hidden_dim=self.auto_hidden, noise_level=noise_level)

    def forward(self, x): 
        batch_size, feature_num, feature_size  = x.shape 
        if self.is_autoencoder:
            encode, decode = self.autoencoder(x.reshape(batch_size, -1))# batch_size*seq_len
            out = encode.reshape(batch_size, -1, self.auto_hidden)
        else:
            decode = x.reshape(batch_size, -1)
            out = x
        out = self.pos(out)
        out = out.reshape(1, batch_size, -1) # (1, batch_size, feature_size)
        out = self.cell(out)  
        out = out.reshape(batch_size, -1) # (batch_size, hidden_dim)
        out = self.linear(out)            # out shape: (batch_size, 1)

        return out, decode

4.4 Training

def train(lr=0.01, feature_size=8, hidden_dim=32, num_layers=1, nhead=8, 
          weight_decay=0.0, EPOCH=1000, seed=0, is_autoencoder=True, alpha=0.0, 
          noise_level=0.0, dropout=0.0, metric='re', is_load_weights=True):
    score_list, result_list = [], []
    setup_seed(seed)
    for i in range(4):
        name = Battery_list[i]
        window_size = feature_size
        train_x, train_y, train_data, test_data = get_train_test(Battery, name, 
                                                                 window_size)
        # print('sample size: {}'.format(train_size))

        model = Net(feature_size=feature_size, hidden_dim=hidden_dim, 
                    num_layers=num_layers, nhead=nhead, dropout=dropout,
                    is_autoencoder=is_autoencoder, noise_level=noise_level)
        model = model.to(device)
        optimizer = torch.optim.Adam(model.parameters(), lr=lr, 
                                     weight_decay=weight_decay)
        criterion = nn.MSELoss()

        '''
        # save ramdom data for repetition
        if torch.__version__.split('+')[0] >= '1.6.0':
            torch.save(model.state_dict(), 'model_NASA'+str(seed)+'.pth')
        else:
            torch.save(model.state_dict(), 
            'model_CALCE.pth', _use_new_zipfile_serialization=False)        
        '''
        # load the random data generated by my device
        if is_load_weights: 
            if torch.__version__.split('+')[0] >= '1.6.0':
                model.load_state_dict(torch.load('model_NASA.pth')) 
            else:
                model.load_state_dict(torch.load('model_NASA_1.5.0.pth'))

        test_x = train_data.copy()
        loss_list, y_ = [0], []
        rmse, re = 1, 1
        score_, score = [1],[1]
        for epoch in range(EPOCH):
            X = np.reshape(train_x/Rated_Capacity,(-1, 1, 
                                                   feature_size)).astype(np.float32)
            #(batch_size, seq_len, input_size)
            y = np.reshape(train_y[:,-1]/Rated_Capacity,(-1,1)).astype(np.float32)
            # shape 为 (batch_size, 1)

            X, y = torch.from_numpy(X).to(device), torch.from_numpy(y).to(device)
            output, decode = model(X)
            output = output.reshape(-1, 1)
            loss = criterion(output, y) + alpha * criterion(
                decode, X.reshape(-1, feature_size))
            optimizer.zero_grad()     # clear gradients for this training step
            loss.backward()           # backpropagation, compute gradients
            optimizer.step()          # apply gradients

            if (epoch + 1)%10 == 0:
                test_x = train_data.copy()
                point_list = []
                while (len(test_x) - len(train_data)) < len(test_data):
                    x = np.reshape(np.array(test_x[-feature_size:])/Rated_Capacity,
                                   (-1, 1, feature_size)).astype(np.float32)
                    x = torch.from_numpy(x).to(device) 
                    # (batch_size,feature_size=1,input_size)
                    pred, _ = model(x)      # pred shape (batch_size=1, feature_size=1)
                    next_point = pred.data.cpu().numpy()[0,0] * Rated_Capacity
                    test_x.append(next_point)     
                    point_list.append(next_point) 
                    # Saves the predicted value of the last point in the output sequence
                y_.append(point_list)       # Save all the predicted values
                loss_list.append(loss)
                rmse = evaluation(y_test=test_data, y_predict=y_[-1])
                re = relative_error(
                    y_test=test_data, y_predict=y_[-1], threshold=Rated_Capacity*0.7)
            if metric == 're':
                score = [re]
            elif metric == 'rmse':
                score = [rmse]
            else:
                score = [re, rmse]
            if (loss < 1e-3) and (score_[0] < score[0]):
                break
            score_ = score.copy()

        score_list.append(score_)
        result_list.append(y_[-1])
    return score_list, result_list

4.5 Setting and Running

The following parameters are the best of my computer using grid search method to obtain. If your results are different from mine, you can load the computer generated random weights generated from my device. Of course, it is better to use grid search method to obtain the optimal parameters.

Rated_Capacity = 2.0
window_size = 16
feature_size = window_size
is_autoencoder = True
dropout = 0.0
EPOCH = 2000
nhead = 8
hidden_dim = 16
num_layers = 1
lr = 0.01    # learning rate
weight_decay = 0.0
noise_level = 0.0
alpha = 1e-5
is_load_weights = True
metric = 're'
seed = 0

SCORE = []
print('seed:{}'.format(seed))
score_list, _ = train(lr=lr, feature_size=feature_size, hidden_dim=hidden_dim,
                      num_layers=num_layers, nhead=nhead, weight_decay=weight_decay,
                      EPOCH=EPOCH, seed=seed, dropout=dropout, 
                      is_autoencoder=is_autoencoder, alpha=alpha, 
                      noise_level=noise_level, metric=metric,
                      is_load_weights=is_load_weights)
print(np.array(score_list))
for s in score_list:
    SCORE.append(s)
print('------------------------------------------------------------------')
print(metric + ' mean: {:<6.4f}'.format(np.mean(np.array(SCORE))))

4.6 Grid Search Method

Grid search method retrains the model to obtain the optimal parameters.

Rated_Capacity = 2.0
window_size = 16
feature_size = window_size
is_autoencoder = True
dropout = 0.0
EPOCH = 2000
nhead = 8
is_load_weights = False

weight_decay = 0.0
noise_level = 0.0
alpha = 0.0
metric = 're'

states = {}
for lr in [1e-3, 1e-2]:
    for num_layers in [1, 2]:
        for hidden_dim in [16, 32]:
            for alpha in [1e-5, 1e-4]:
                show_str = 'lr={}, num_layers={}, hidden_dim={}, alpha={}'.format(
                lr, num_layers, hidden_dim)
                print(show_str)
                SCORE = []
                for seed in range(5):
                    print('seed:{}'.format(seed))
                    score_list, _ = train(lr=lr, feature_size=feature_size, 
                                      hidden_dim=hidden_dim, num_layers=num_layers, 
                                      nhead=nhead, weight_decay=weight_decay, 
                                      EPOCH=EPOCH, seed=seed, dropout=dropout, 
                                      is_autoencoder=is_autoencoder, alpha=alpha, 
                                      noise_level=noise_level, metric=metric, 
                                      is_load_weights=is_load_weights)
                    print(np.array(score_list))
                    print(metric + ': {:<6.4f}'.format(np.mean(np.array(score_list))))
                    print('----------------------------------------------------------------')
                    for s in score_list:
                        SCORE.append(s)

                print(metric + ' mean: {:<6.4f}'.format(np.mean(np.array(SCORE))))
                states[show_str] = np.mean(np.array(SCORE))
                print('===================================================================')

min_key = min(states, key = states.get)
print('optimal parameters: {}, result: {}'.format(min_key, states[min_key]))

5. More