Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Add new evaluate_model function which can return a more generalized metric #6

Open
isaacgerg opened this issue May 24, 2019 · 3 comments

Comments

@isaacgerg
Copy link

In evaluate_model, the code below can be used to return metrics which can only be computed on all of the data as opposed to averaged by batches as currently done. For simplicity, you can set numThreads and qSize to 1.

def evaluate(model, generator, steps, numThreads=2, qSize=5):
    numItemsPushed_predict = 0
    dataQueue = queue.Queue(maxsize=qSize)
    mutex = threading.Lock()

    def producer(steps):
        nonlocal numItemsPushed_predict
        killMe = False
        while True:
            mutex.acquire()
            if numItemsPushed_predict < steps:
                numItemsPushed_predict += 1
            else:
                killMe = True
            myUid = numItemsPushed_predict
            mutex.release()
            if killMe:
                return
            #
            x, y = generator.next(myUid-1)
            dataQueue.put((x,y,myUid-1))
            #
        #
    #

    tVec = []
    for k in range(numThreads):
        t = threading.Thread(target=producer, args=(steps,))
        t.daemon = True
        t.start()
        tVec.append(t)

    resultVec = []
    batchSize = None
    pBar = tqdm.tqdm(range(steps), desc='EVALUATE')
    for k in pBar:
        currentQSize = dataQueue.qsize()
        item = dataQueue.get()
        x = item[0]
        y = item[1]
        uid = item[2] # For debug
        if batchSize is None:
            if type(x) is list:
                batchSize = x[0].shape[0]
            else:
                batchSize = x.shape[0]
            #
            resultVec = np.zeros(steps)
        r = model.evaluate(x, y, batch_size = batchSize, verbose=0)
        resultVec[k] = r
        #if type(y_pred) is list:
        #    predVec[k*batchSize : (k+1)*batchSize] = y_pred[0].flatten()
        #else:
        #    predVec[k*batchSize : (k+1)*batchSize] = y_pred.flatten()
        pBar.set_description('EVALUATE | QSize: {0}/{1}'.format(currentQSize, qSize))
    #

    return resultVec

evaluate_model(self, model) becomes:

y_true, y_pred = evaluate(model, data -- will have to convert from generator (easy), 1, 1)
loss = lossFunction(y_true, y_pred)
accuracy can be computed from sklearn.accuracy_score

You could also support losses like AUC now.

@Pattio
Copy link
Owner

Pattio commented May 25, 2019

What is the problem with evaluating the model by averaging batches? Sure the results might be slightly different due to the floating point error, but isn't negligible? Furthermore, Keras evaluate method has batch_size argument which allows you to change the batch size i.e. you could set it to 1.

@isaacgerg
Copy link
Author

isaacgerg commented May 25, 2019

Not all metrics can be evaluated properly in batches and averages. Area under the reciever operating curve (auc roc) is a popular metric which you have to compute over the whole validation set and not average over batches. Average over batches will be wildly inaccurate or undefined. For example, say your validation set has 2 classes which are imbalanced, a common setup, some batches may not have both classes so computing the auc is undefined.

@Pattio
Copy link
Owner

Pattio commented May 26, 2019

Oh, I see that’s a good point! In that case I think it would be better to introduce a new metric (roc auc) and then refactor all available metrics (accuracy, loss, roc auc) to a separate class Metrics which would be returned by evaluate method. This refactored evaluate method could internally call your proposed method to receive the labels and then if needed calculate roc auc using sklearn.metrics.roc_auc_score or tf.metrics.auc.

However, this means that we should also refactor create_early_stop_callback and create_checkpoint_callback to use newly introduced metrics. Furthermore, few places where cfg['metrics'] is used should be also refactored accordingly.

I'll add this request for new metrics to the roadmap, however, feel free to create your own PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants