!! ATTENTION !!
- For those custom metrics, the average accross minibatches is namely not equal to the metric evaluated on the whole dataset.
- The metric on the validation set is calculated in batches, and then averaged (of course the trained model at the end of the epoch is used,
in contrast to how the metric score is calculated for the training set)
1. How to compute precision and recall in Keras? --> https://www.thinbug.com/q/43076609
2. How are metrics computed in Keras? --> https://stackoverflow.com/questions/49359489/how-are-metrics-computed-in-keras
filepath = model_snapshot_directory + '/' + 'lstm_model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
model.fit(X_train,y_train,epochs=100,batch_size=128,
verbose=1,callbacks=[checkpoint],validation_data=(X_test,y_test)
checkpoint设置的监控值是monitor=val_loss,当val_loss值不发生很大的改善就不保存模型.
best_run,best_model=optim.minimize(model=train,data=prepare_data,algo=tpe.suggest, max_evals=100,trials=Trials())
在这里max_eval=100表示在训练过程中要对不同的组合评估100次,每一次的模型参数都不一样。这个只可以根据实际参数的多少来设置,越大可能训练的模型就越多。**
best_model.evaluate(X_test,y_test)
这个evaluate 的返回值是一个元组(score,acc),loss值=-score
hist=model.fit(X_train, y_train, epochs=100, batch_size={{choice([64, 128, 256])}}, verbose=1,
callbacks=callback_list, validation_data=(X_test, y_test))
h1=hist.history
acc_=np.asarray(h1['acc'])
loss_=np.asarray((h1['loss']))
val_acc=np.asarray(h1['val_acc'])
val_loss=np.asarray(h1['val_loss'])
acc_and_loss=np.column_stack((acc_,loss_,val_acc,val_loss))
save_file_mlp = model_snapshot_directory+'/mlp_run_' + '_' + str(globalvars.globalVar) + '.txt'
with open(save_file_mlp, 'w') as f:
np.savetxt(save_file_mlp, acc_and_loss, delimiter=" ")
fit()函数返回一个名为history的变量,其中包含损失追踪以及在编译模型时指定的任何其他指标,这些分数都记录在每个训练轮数的末尾。
可以使用Matplotlib库绘制模型的性能图:
from matplotlib import pyplot
pyplot.plot(history.history['loss'])
pyplot.plot(history.history['val_loss'])
pyplot.title('model train vs validation loss')
pyplot.ylabel('loss')
pyplot.xlabel('epoch')
pyplot.legend(['train', 'validation'], loc='upper right')
pyplot.show()
https://baijiahao.baidu.com/s?id=1577431637601070077&wfr=spider&for=pc
新建一个py文件,内容如下:
from keras.callbacks import ModelCheckpoint
class AltModelCheckpoint(ModelCheckpoint):
def __init__(self, filepath, alternate_model, **kwargs):
"""
Additional keyword args are passed to ModelCheckpoint; see those docs for information on what args are accepted.
:param filepath:
:param alternate_model: Keras model to save instead of the default. This is used especially when training multi-
gpu models built with Keras multi_gpu_model(). In that case, you would pass the original
"template model" to be saved each checkpoint.
:param kwargs: Passed to ModelCheckpoint.
"""
self.alternate_model = alternate_model
super().__init__(filepath, **kwargs)
def on_epoch_end(self, epoch, logs=None):
model_before = self.model
self.model = self.alternate_model
super().on_epoch_end(epoch, logs)
self.model = model_before
然后在训练的文件中:
from alt_model_checkpoint import AltModelCheckpoint
from keras.models import Model
from keras.utils import multi_gpu_model
base_model = Model(...)
gpu_model = multi_gpu_model(base_model,numbers_of_gpu)
gpu_model.compile(...)
gpu_model.fit(..., callbacks=[
AltModelCheckpoint('save/path/for/model.hdf5', base_model)
])
如果要加上earlystopping,则修改fit 的内容,比如:
hist = gpu_model.fit(X_train, y_train,
batch_size={{choice([64, 128, 256])}},
epochs=100,
verbose=1,
callbacks=[
AltModelCheckpoint(
filepath,
model,
monitor='val_loss',
verbose=1,
save_best_only=True,
mode='min'
)
],
validation_data=(X_test, y_test))
因为AltModelCheckpoint是继承自ModelCheckpoint,所以可以直接添加。
这个是使用多GPU的例子
- 主要调用了multi_gpu_model这个函数
- 在训练的时候,保存检查点模型,自定义了一个函数,保存的是base_model而不是gpu_model。 在模型保存之后,load的时候:
- 用load_model load保存的模型文件
- 需要gpu_model=multi_gpu_model(base_model) gpu_model.complie()
3.做预测: gpu_model.predict()
在keras的 saving.py文件中,添加这个:
# ... earlier get_json_type code
# NOTE: Hacky fix to serialize Dimension objects.
from tensorflow.python.framework.tensor_shape import Dimension
if type(obj) == Dimension:
return int(obj)
# original error raised here
raise TypeError('Not JSON Serializable:', obj)
Keras调用多GPU的例子: https://www.jianshu.com/p/d57595dac5a9
多GPU+earlystopping+hyperas进行调参
# in the function train
def train():
...
# first distribute GPUs according to the gpu which you possess
gpu_model = multi_gpu_model(model, gpus=2)
gpu_model.compile(loss=loss_fn, optimizer=optim, metrics=['accuracy'])
# set earlystopping using ModelCheckPoint
filepath='...'
early_stopping = EarlyStopping(monitor='val_loss', patience=20, mode='min')
checkpointer = ModelCheckpoint(filepath=filepath, monitor='val_loss', verbose=1,
save_best_only=True, save_weights_only=True, mode='min')
callback_list = [early_stopping, checkpointer]
# train the model
hist = gpu_model.fit(X_train, y_train, epochs=100, batch_size=128,
verbose=1, callbacks=callback_list, validation_data=(X_test, y_test))
gpu_model.load_weights(filepath)
score, acc = gpu_model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', acc)
return {'loss': -acc, 'status': STATUS_OK, 'model': gpu_model}
保存的时候使用gpu_model,并且在return的时候也是gpu_model
注意:使用多GPU训练的模型需要依旧使用多GPU来做预测,比如在节点g-1-4上训练,就仍要使用g-1-4predict,并且使用相同的GPU数量。
注:
1.在使用多GPU并且设置ModelCheckpoint的时候,不能设置save_model_only=True,只有设置save_weights_only=True的时候,才能够正常训练模型,并且代码运行正常结束。
2.使用多GPU+Hyperas+ModelCheckpoint时,因为按照第一条只能保存权重,而且hyperas的优化得到的best_model不能进行正常报错,会报错,can not pickle the module。并且即使能够保存,在进行预测的时候也要重构模型,并且要使用训练时相同的GPU数,但是却无法得知最优的是哪一个权重。因此就无法进行预测。
解决方法是:
def train():
...
# first distribute GPUs according to the gpu which you possess
gpu_model = multi_gpu_model(model, gpus=2)
gpu_model.compile(loss=loss_fn, optimizer=optim, metrics=['accuracy'])
# set earlystopping using ModelCheckPoint
filepath='...'
early_stopping = EarlyStopping(monitor='val_loss', patience=20, mode='min')
checkpointer = ModelCheckpoint(filepath=filepath, monitor='val_loss', verbose=1,
save_best_only=True, save_weights_only=True, mode='min')
callback_list = [early_stopping, checkpointer]
# train the model
hist = gpu_model.fit(X_train, y_train, epochs=100, batch_size=128,
verbose=1, callbacks=callback_list, validation_data=(X_test, y_test))
score, acc = gpu_model.evaluate(X_test, y_test, verbose=0)
model.save(filepath) # 这一段一定要放在gpu_model.evaluate下面,否则会出错
print('Test accuracy:', acc)
return {'loss': -acc, 'status': STATUS_OK, 'model': model}
注意保存的时候是model,返回的模型是model,虽然没有compile model,但是在fit结束后,model的权重就是gpu_model的权重。
经过测试,是可以得到best_model的。并且可以在单GPU上进行预测。
input_embed = Input(shape=(700,), name='input_embed')
input_extra = Input(shape=(700, 25,), name='input_extra')
embedded = Embedding(num_amino_acids, 50, input_length=700)(input_embed)
x = concatenate([embedded, input_extra], axis=2)
......
x = BatchNormalization()(x)
output = Activation(activation='sigmoid')(x)
model = Model(inputs=[input_embed, input_extra], outputs=output)
gpu_model = multi_gpu_model(model, 4)
gpu_model.compile(...)
callback_list = [early_stopping, checkpointer]
hist = gpu_model.fit(x={'input_embed': X_all, 'input_extra': X_extra},
y=y_all,
epochs=100,
batch_size=256,
verbose=1,
callbacks=callback_list,
class_weight=class_weights,
validation_split=0.2)
虽然设置了validation_split但是在训练的时候只会在每一个epoch验证,每一个batch没有验证,而且,对于多输入来说,不能对validation_data里面添加多输入.
注意:此处多次踩坑,特别重要,下面的代码中的X_train,X_extra,y_train很重要,必须得在prepare_data()函数中是这个名称,否则会报错名称不存在
def prepare_data():
......
return (X_train,X_extra,y_train)
if __name__ == "__main__":
best_run,best_model=optim.minimize(model=train,data=prepare_data,algo=tpe.suggest,
max_evals=30,trials=Trials())
X_train,X_extra,y_train=prepare_data()
方法一:经过试验,当使用hyperas进行调参时,会报错,说没有auc_roc
from tensorflow.contrib.metrics import streaming_auc
def auc_roc(y_true,y_pred):
value,update_op=streaming_auc(y_pred,y_true)
# find all variables created for this metric
metric_vars=[i for i in tf.local_variables() if 'auc_roc' in i.name.split('/')[1]]
# Add metric variables to GLOBAL_VARIABLES collection.
# They will be initialized for new session.
for v in metric_vars:
tf.add_to_collection(tf.GraphKeys.GLOBAL_VARIABLES, v)
# force to update metric values
with tf.control_dependencies([update_op]):
value = tf.identity(value)
return value
def train():
......
gpu_model.compile(loss=loss_fn, optimizer=adam, metrics=['accuracy',auc_roc])
early_stopping = EarlyStopping(monitor='val_auc_roc', patience=20, mode='max')
checkpointer = ModelCheckpoint(filepath=filepath, monitor='val_auc_roc', verbose=1,
save_best_only=True,save_weights_only=True, mode='max')
1.hyperas调参时,默认返回的是
# return {'loss': -acc, 'status': STATUS_OK, 'model': model}
但是当调参次数max_evals设置很大,比如设置100或者200时,容易出现超出显存,此时,只需要将上面的代码修改为以下部分便可解决:
return {'loss': -acc, 'status': STATUS_OK}
此外,在每一次调参时,在代码部分要加上K.clear_session()
refers to Issue #259 of Hyperas.
def train():
K.clear_session()
model()
return {'loss': -acc, 'status': STATUS_OK}
2.keras默认的训练函数为fit,fit函数默认一次性加载全部的数据,使用fit_generator代替fit函数,分批加载数据。
def data_generator(data1, data2, targets, batch_size):
batches = (len(data1) + batch_size - 1) // batch_size
while (True):
for i in range(batches):
X = data1[i * batch_size:(i + 1) * batch_size]
X_extra = data2[i * batch_size:(i + 1) * batch_size]
Y = targets[i * batch_size:(i + 1) * batch_size]
yield [X, X_extra], Y
# 训练的时候用fit_generator
hist=model.fit_generator(generator=data_generator(X_train,X_extra,y_train,batch_size=batch_size),
steps_per_epoch=(len(X_train)+batch_size-1)//batch_size,
epochs=150,verbose=1,callbacks=callback_list,
validation_data=([X_validate,X_validate_extra],y_validate))
3.减少batch_size
refers to Issue #16 of Hyperas.
Add this code after compile the model to initialize the variables.
See issue #5427 of keras for more details.
from keras import backend as K
import tensorflow as tf
K.set_session(tf.Session(graph=model.output.graph))
init = K.tf.global_variables_initializer()
K.get_session().run(init)
job exception: list index out of range
Traceback (most recent call last):
File "regressor.py", line 248, in <module>
data_args=(neighbor_obj,))
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperas/optim.py", line 72, in minimize
data_args=data_args)
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperas/optim.py", line 142, in base_minimizer
return_argmin=True),
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/fmin.py", line 482, in fmin
show_progressbar=show_progressbar,
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/base.py", line 686, in fmin
show_progressbar=show_progressbar,
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/fmin.py", line 509, in fmin
rval.exhaust()
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/fmin.py", line 330, in exhaust
self.run(self.max_evals - n_done, block_until_done=self.asynchronous)
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/fmin.py", line 286, in run
self.serial_evaluate()
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/fmin.py", line 165, in serial_evaluate
result = self.domain.evaluate(spec, ctrl)
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/base.py", line 892, in evaluate
print_node_on_error=self.rec_eval_print_node_on_error,
File "/root/anaconda3/envs/python367/lib/python3.6/site-packages/hyperopt/pyll/base.py", line 874, in rec_eval
rval_var = node.pos_args[int(switch_i) + 1]
IndexError: list index out of range
This is a known BUG of ray, see Issue #615 and #7058, this problem does only occur when using ATPE instead of TPE.
- Keras
- Hyperas
- CSDN Blog by JNYxiaocao
- F. Chollet, Deep Learning With Python, New York, NY, USA:Manning Publications Co, 2017. - ISBN 9781617294433