6. Pytorch-based Transformer Network for Remaining Useful Life Prediction of Lithium-Ion Batteries
paper:
http://zhouxiuze.com/pub/Transformer.pdf
code:
https://github.com/XiuzeZhou/RUL
referce:
D. Chen, W. Hong, and X. Zhou, "Transformer Network for Remaining Useful Life Prediction of Lithium-Ion Batteries", IEEE Access, 2022, vol. 10, pp. 19621-19628.
1. Introduction
Recently, Transformer-based models are very hot in CV and NLP, but articles on Li-ion battery remain blank. Therefore, I write an article about Transformer for Remaining Useful Life (RUL) of Li-ion batteries and release all Pytorch codes, hoping to help others.
2. Problem Definition
With the charge-discharge cycle increases, the performance of Li-ion batteries generally degrades. Battery performance can be measured by capacity, so SOH, a health indicator for battery aging, can be defined by the following capacity ratio:
\({\textit{SOH}}(t)=\frac{C_t}{C_0}\times 100\%,
\)
where \(C_0\) denotes rated capacity, and \(C_t\) denotes the measured capacity of cycle, \(t\). As the number of times a battery is charged/discharged increases, capacity degrades. For a battery, End of Life (EOL) which is closely related to its capacity, is defined as the point when remaining capacity reaches 70-80% of initial capacity. Fig. 2 illustrates an example of RUL prediction.
3. Proposed Model
Our proposed model consisting of four parts: input and normalization, denoising, Transformer, and prediction. The architecture is shown in Fig. 3.
Input and Normalization. To reduce the influence of input data distribution changes on neural networks, the data must be normalized. Let \(\mathbf{x}=\left\{x_{1},x_{2},\dots,x_{n} \right\}\) denote the input sequence of capacity with length \(n\), which is mapped to (0,1]:
\(
\mathbf{x}'=\frac{\mathbf{x}}{C_0},
\)
where \(C_0\) denotes rated capacity.
Denoising. Raw input is always full of noise, especially when charge/discharge regeneration occurs. To maintain stability and robustness, input data must be denoised before being fed into deep neural networks. Denoising Auto-Encoder (DAE), an unsupervised method in learning useful features, which is adopted by our method, reconstructs input data from lower-dimensional representation preserving as much information as possible in the process.
Gaussian noise is added to the normalized input,\(\mathbf{x}’_t\), to obtain the corrupted vector, \(\widetilde{\mathbf{x}}_t\). DAE is defined as follows:
\(
\mathbf{z}=a\left({W}^T\widetilde{\mathbf{x}}_t+{b} \right),
\)
\widehat{\mathbf{x}}_t=f'\left({W}'{z} + {b}'\right),
\)
Loss function of DAE:
\(
\mathcal L_d=\frac{1}{n}\sum_{t=1}^{n}
\ell(\widetilde{{x}}_t-\widehat{{x}}_t) +
\lambda \left ( \left \| {W} \right \|^2_{F} + \left \| {W'} \right \|^2_{F} \right)
\)
Transformer. The Transformer layers are a stack of Transformer encoders that extract the degradation features from the reconstructed data, with two sub-layers: Multi-Head Self-Attention and Feed-Forward.
\(
\textit{MultiHead}\left({H}^{l-1}\right) = [head_1;head_2;\cdots;head_h]{W}^{O};
\)
head_i = \textit{Attention}\left({H}^{l-1}{W}^{l}{Q},{H}^{l-1}{W}^{l}{K},{H}^{l-1}{W}^{l}_{V} \right);
\) \(
\textit{Attention}\left({Q},{K},{V} \right) = \textit{softmax} \left(\frac{{Q}{K}^T}{\sqrt{d_h} } \right){V}
\)
Prediction. Finally, to predict unknown capacity, a full connection layer is used to map the representation learned by the last Transformer cell to arrive at the final prediction \(\widehat{x}_t\):
\(\widehat{x}_t=f\left({W}_p{H}^h + {b}_p\right)
\)
Loss function. The learning procedure optimizes both tasks simultaneously in a unified framework. Mean Square Error (MSE) is used to evaluate loss, and the objective function is defined as follows:
\(\mathcal L=\sum_{t=T+1}^{n}\left(x_t-\widehat{x}_{t} \right)^2+\alpha\sum_{i=1}^{n}
\ell(\widetilde{\mathbf{x}}_i-\widehat{\mathbf{x}}_i)+\lambda\Omega \left(\Theta \right)
\)
4. Experiments
4.1 Data sets
We conducted experiments using two public data sets: NASA and CALCE. The NASA data set, available from the NASA Ames Research Center web site. CALCE data set is available from the Center for Advanced Life Cycle Engineering (CALCE) of the University of Maryland.
4.2 Code
The module is mainly used for noise reduction, the data from sensors often has a lot of noise, which degrades the training of networks. Therefore, it is better remove noise before training.
class Autoencoder(nn.Module):
def __init__(self, input_size=16, hidden_dim=8, noise_level=0.01):
super(Autoencoder, self).__init__()
self.input_size, self.hidden_dim, self.noise_level = input_size, hidden_dim, noise_level
self.fc1 = nn.Linear(self.input_size, self.hidden_dim)
self.fc2 = nn.Linear(self.hidden_dim, self.input_size)
def encoder(self, x):
x = self.fc1(x)
h1 = F.relu(x)
return h1
def mask(self, x):
corrupted_x = x + self.noise_level * torch.randn_like(x)
return corrupted_x
def decoder(self, x):
h2 = self.fc2(x)
return h2
def forward(self, x):
out = self.mask(x)
encode = self.encoder(out)
decode = self.decoder(encode)
return encode, decode
4.3 Main Network
The main network has two functions: denoising and Transformer.
class Net(nn.Module):
def __init__(self, feature_size=16, hidden_dim=32, num_layers=1, nhead=8,
dropout=0.0, noise_level=0.01, is_autoencoder=False):
super(Net, self).__init__()
self.is_autoencoder = is_autoencoder
self.auto_hidden = int(feature_size/2)
input_size = self.auto_hidden if is_autoencoder else feature_size
self.pos = PositionalEncoding(d_model=input_size, max_len=input_size)
encoder_layers = nn.TransformerEncoderLayer(d_model=input_size, nhead=nhead,
dim_feedforward=hidden_dim, dropout=dropout)
self.cell = nn.TransformerEncoder(encoder_layers, num_layers=num_layers)
self.linear = nn.Linear(input_size, 1)
if self.is_autoencoder:
self.autoencoder = Autoencoder(input_size=feature_size,
hidden_dim=self.auto_hidden, noise_level=noise_level)
def forward(self, x):
batch_size, feature_num, feature_size = x.shape
if self.is_autoencoder:
encode, decode = self.autoencoder(x.reshape(batch_size, -1))# batch_size*seq_len
out = encode.reshape(batch_size, -1, self.auto_hidden)
else:
decode = x.reshape(batch_size, -1)
out = x
out = self.pos(out)
out = out.reshape(1, batch_size, -1) # (1, batch_size, feature_size)
out = self.cell(out)
out = out.reshape(batch_size, -1) # (batch_size, hidden_dim)
out = self.linear(out) # out shape: (batch_size, 1)
return out, decode
4.4 Training
def train(lr=0.01, feature_size=8, hidden_dim=32, num_layers=1, nhead=8,
weight_decay=0.0, EPOCH=1000, seed=0, is_autoencoder=True, alpha=0.0,
noise_level=0.0, dropout=0.0, metric='re', is_load_weights=True):
score_list, result_list = [], []
setup_seed(seed)
for i in range(4):
name = Battery_list[i]
window_size = feature_size
train_x, train_y, train_data, test_data = get_train_test(Battery, name,
window_size)
# print('sample size: {}'.format(train_size))
model = Net(feature_size=feature_size, hidden_dim=hidden_dim,
num_layers=num_layers, nhead=nhead, dropout=dropout,
is_autoencoder=is_autoencoder, noise_level=noise_level)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr,
weight_decay=weight_decay)
criterion = nn.MSELoss()
'''
# save ramdom data for repetition
if torch.__version__.split('+')[0] >= '1.6.0':
torch.save(model.state_dict(), 'model_NASA'+str(seed)+'.pth')
else:
torch.save(model.state_dict(),
'model_CALCE.pth', _use_new_zipfile_serialization=False)
'''
# load the random data generated by my device
if is_load_weights:
if torch.__version__.split('+')[0] >= '1.6.0':
model.load_state_dict(torch.load('model_NASA.pth'))
else:
model.load_state_dict(torch.load('model_NASA_1.5.0.pth'))
test_x = train_data.copy()
loss_list, y_ = [0], []
rmse, re = 1, 1
score_, score = [1],[1]
for epoch in range(EPOCH):
X = np.reshape(train_x/Rated_Capacity,(-1, 1,
feature_size)).astype(np.float32)
#(batch_size, seq_len, input_size)
y = np.reshape(train_y[:,-1]/Rated_Capacity,(-1,1)).astype(np.float32)
# shape 为 (batch_size, 1)
X, y = torch.from_numpy(X).to(device), torch.from_numpy(y).to(device)
output, decode = model(X)
output = output.reshape(-1, 1)
loss = criterion(output, y) + alpha * criterion(
decode, X.reshape(-1, feature_size))
optimizer.zero_grad() # clear gradients for this training step
loss.backward() # backpropagation, compute gradients
optimizer.step() # apply gradients
if (epoch + 1)%10 == 0:
test_x = train_data.copy()
point_list = []
while (len(test_x) - len(train_data)) < len(test_data):
x = np.reshape(np.array(test_x[-feature_size:])/Rated_Capacity,
(-1, 1, feature_size)).astype(np.float32)
x = torch.from_numpy(x).to(device)
# (batch_size,feature_size=1,input_size)
pred, _ = model(x) # pred shape (batch_size=1, feature_size=1)
next_point = pred.data.cpu().numpy()[0,0] * Rated_Capacity
test_x.append(next_point)
point_list.append(next_point)
# Saves the predicted value of the last point in the output sequence
y_.append(point_list) # Save all the predicted values
loss_list.append(loss)
rmse = evaluation(y_test=test_data, y_predict=y_[-1])
re = relative_error(
y_test=test_data, y_predict=y_[-1], threshold=Rated_Capacity*0.7)
if metric == 're':
score = [re]
elif metric == 'rmse':
score = [rmse]
else:
score = [re, rmse]
if (loss < 1e-3) and (score_[0] < score[0]):
break
score_ = score.copy()
score_list.append(score_)
result_list.append(y_[-1])
return score_list, result_list
4.5 Setting and Running
The following parameters are the best of my computer using grid search method to obtain. If your results are different from mine, you can load the computer generated random weights generated from my device. Of course, it is better to use grid search method to obtain the optimal parameters.
Rated_Capacity = 2.0
window_size = 16
feature_size = window_size
is_autoencoder = True
dropout = 0.0
EPOCH = 2000
nhead = 8
hidden_dim = 16
num_layers = 1
lr = 0.01 # learning rate
weight_decay = 0.0
noise_level = 0.0
alpha = 1e-5
is_load_weights = True
metric = 're'
seed = 0
SCORE = []
print('seed:{}'.format(seed))
score_list, _ = train(lr=lr, feature_size=feature_size, hidden_dim=hidden_dim,
num_layers=num_layers, nhead=nhead, weight_decay=weight_decay,
EPOCH=EPOCH, seed=seed, dropout=dropout,
is_autoencoder=is_autoencoder, alpha=alpha,
noise_level=noise_level, metric=metric,
is_load_weights=is_load_weights)
print(np.array(score_list))
for s in score_list:
SCORE.append(s)
print('------------------------------------------------------------------')
print(metric + ' mean: {:<6.4f}'.format(np.mean(np.array(SCORE))))
4.6 Grid Search Method
Grid search method retrains the model to obtain the optimal parameters.
Rated_Capacity = 2.0
window_size = 16
feature_size = window_size
is_autoencoder = True
dropout = 0.0
EPOCH = 2000
nhead = 8
is_load_weights = False
weight_decay = 0.0
noise_level = 0.0
alpha = 0.0
metric = 're'
states = {}
for lr in [1e-3, 1e-2]:
for num_layers in [1, 2]:
for hidden_dim in [16, 32]:
for alpha in [1e-5, 1e-4]:
show_str = 'lr={}, num_layers={}, hidden_dim={}, alpha={}'.format(
lr, num_layers, hidden_dim)
print(show_str)
SCORE = []
for seed in range(5):
print('seed:{}'.format(seed))
score_list, _ = train(lr=lr, feature_size=feature_size,
hidden_dim=hidden_dim, num_layers=num_layers,
nhead=nhead, weight_decay=weight_decay,
EPOCH=EPOCH, seed=seed, dropout=dropout,
is_autoencoder=is_autoencoder, alpha=alpha,
noise_level=noise_level, metric=metric,
is_load_weights=is_load_weights)
print(np.array(score_list))
print(metric + ': {:<6.4f}'.format(np.mean(np.array(score_list))))
print('----------------------------------------------------------------')
for s in score_list:
SCORE.append(s)
print(metric + ' mean: {:<6.4f}'.format(np.mean(np.array(SCORE))))
states[show_str] = np.mean(np.array(SCORE))
print('===================================================================')
min_key = min(states, key = states.get)
print('optimal parameters: {}, result: {}'.format(min_key, states[min_key]))