Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default eos token not working & gap in clinical performance in reproduced results #11

Open
pjumruspun opened this issue Oct 10, 2021 · 0 comments

Comments

@pjumruspun
Copy link

Hello, I'm currently doing research about medical report generation. And your work CDGPT-2 really has caught my interest.

But currently I'm facing 2 issues: Default EOS Token Not Working and Not being able to reproduce the exact predictions and large gap in clinical accuracy which I will elaborate them.

I've also attached ipynb files for you to investigate and reproduce if needed, each of them are below each issue topic.

Default EOS Token Not Working

CDGPT_2_Reproduce.ipynb

This Python notebook contains code of my experiment on different eos tokens and an attempt to reproduce results

My code modification

I've modified a part of code in test.py to get predictions in batch as following:

def generate_batch(FLAGS, encoder, decoder, tokenizer_wrapper, images, eos_token_ids, no_repeat_ngram_size):
    """ This function was modified from evaluate_full in test.py to predict in batch
    """
    visual_features, tags_embeddings = encoder(images)
    dec_input = tf.convert_to_tensor([tokenizer_wrapper.GPT2_encode("startseq", pad=False)] * len(images))
    
    num_beams = FLAGS.beam_width

    visual_features = tf.tile(visual_features, [num_beams, 1, 1])
    tags_embeddings = tf.tile(tags_embeddings, [num_beams, 1, 1])
    start_time = time.time()
    tokens = decoder.generate(dec_input, max_length=FLAGS.max_sequence_length, num_beams=num_beams, min_length=3,
                              eos_token_ids=eos_token_ids, no_repeat_ngram_size=no_repeat_ngram_size,
                              visual_features=visual_features,
                              tags_embedding=tags_embeddings, do_sample=False, early_stopping=True)
    
    end_time = time.time() - start_time

    sentences = [tokenizer_wrapper.filter_special_words((tokenizer_wrapper.GPT2_decode(toks))) for toks in tokens]
    return sentences

def generate_all_batch(enqueuer, FLAGS, encoder, decoder, tokenizer_wrapper, 
                            test_steps, eos_token_ids, filename=None, no_repeat_ngram_size=None, verbose=False):

    """ This function was modified from evaluate_enqueuer in test.py to predict
        enqueuer data in batch.

    Parameters:
    test_steps (int): Number of test steps should predict
    filename (string): Directory to save predicted results csv file
    verbose (boolean): Set to true to print every predicted results for quick preview

    Returns:
    pandas.dataframe: Predicted results

   """

    tf.keras.backend.set_learning_phase(0)

    if not enqueuer.is_running():
        enqueuer.start(workers=FLAGS.generator_workers, max_queue_size=FLAGS.generator_queue_length)
    start = time.time()
    csv_dict = {"image_path": [], "real": [], "prediction": []}
    generator = enqueuer.get()
    for i in tqdm(range(test_steps)):
        
        
        images, target, img_path = next(generator)
        if verbose:
          print(f'\n({i+1}/{test_steps}) predicting {img_path}...')
        
        start_batch = time.time()
        predicted_sentences = generate_batch(FLAGS, encoder, decoder, tokenizer_wrapper,
                                           images, eos_token_ids, no_repeat_ngram_size)
        time_taken = time.time() - start_batch

        csv_dict["prediction"].extend(predicted_sentences)
        csv_dict["image_path"].extend(img_path)

        target_sentences = [tokenizer_wrapper.filter_special_words((tokenizer_wrapper.GPT2_decode(toks))) for toks in target]
        csv_dict["real"].extend(target_sentences)

        if verbose:
          print('predicted sentences: ')
          for sentence in predicted_sentences:
            print(f'Length: {len(sentence.split())}')
            print(sentence)
          print('')
        
        print(f'Time taken for this batch: {time_taken:.3f}s, ({time_taken/images.shape[0]:.3f}s/image)')

    enqueuer.stop()

    print('Time taken for evaluation {} sec\n'.format(time.time() - start))
    tf.keras.backend.set_learning_phase(1)
    df = pd.DataFrame(csv_dict)
    if filename != None:
      print(f"Saving to {filename}")
      df.to_csv(filename, index=False)
    return df

So what's not working?

With the default eos token used in this code repository here (test.py line 52). The sentences generated seemed to not ended properly.

And the eos token in the paper which was mentioned that it was the standard GPT2 end of sentence token tokenizer_wrapper.GPT2_encode('<|endoftext|>', pad=False)[0] also seem to not work properly.

Both eos token variants generated the exact same sentences as shown below

Generated sentences of encoded "<|endoftext|>" as eos_token (tokenizer_wrapper.GPT2_encode('<|endoftext|>', pad=False)[0])

  0%|          | 0/3 [00:00<?, ?it/s]
(1/3) predicting ['CXR3247_IM-1538-1001.png']...
 33%|███▎      | 1/3 [00:14<00:28, 14.43s/it]predicted sentences: 
Length: 102
"no acute cardiopulmonary disease.
the heart, pulmonary xxxx and mediastinum are within normal limits. there is no pleural effusion or pneumothorax. there is no focal air space opacity to suggest a pneumonia. there are mild degenerative changes of the thoracic spine."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no

Time taken for this batch: 14.266s, (14.266s/image)

(2/3) predicting ['CXR3483_IM-1692-1001.png']...
 67%|██████▋   | 2/3 [00:28<00:14, 14.09s/it]predicted sentences: 
Length: 122
"no acute pulmonary disease.
the lungs are clear. there is no pleural effusion. the heart and mediastinum are normal. there are atherosclerotic changes of the thoracic aorta. arthritic changes of the skeletal structures are noted."  "1. no pneumothorax or pleural effusion. surgical clips are present in the arthritic changes of the skeletal structures. surgical clips are present in the arthritic changes of the skeletal structures."  "no pneumothorax or pleural effusion."  "no pleural surgical clips are present in the arthritic changes of the skeletal structures."  "no pneumothorax or pleural surgical clips are present in the arthritic changes of the skeletal structures."  "no pleural surgical clips are present in the arthritic changes of the skeletal structures."  "no surgical clips are present in the arthritic

Time taken for this batch: 13.843s, (13.843s/image)

(3/3) predicting ['CXR1353_IM-0230-2001.png']...
100%|██████████| 3/3 [00:42<00:00, 14.01s/it]predicted sentences: 
Length: 121
"right middle lobe infiltrate consistent with pneumonia.
the heart is normal in size. the pulmonary vascularity is within normal limits in the lungs are clear. a large hiatal hernia is noted. calcified left hilar lymph xxxx are noted. there are surgical clips in the left lung base. a hiatal hernia is noted."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line

Time taken for this batch: 13.740s, (13.740s/image)
Time taken for evaluation 42.03989219665527 sec

Generated sentences of default eos token (tokenizer_wrapper.GPT2_eos_token_id())

  0%|          | 0/3 [00:00<?, ?it/s]
(1/3) predicting ['CXR3247_IM-1538-1001.png']...
 33%|███▎      | 1/3 [00:14<00:28, 14.31s/it]predicted sentences: 
Length: 102
"no acute cardiopulmonary disease.
the heart, pulmonary xxxx and mediastinum are within normal limits. there is no pleural effusion or pneumothorax. there is no focal air space opacity to suggest a pneumonia. there are mild degenerative changes of the thoracic spine."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no radiographic evidence for thoracic injury."  "no

Time taken for this batch: 14.129s, (14.129s/image)

(2/3) predicting ['CXR3483_IM-1692-1001.png']...
 67%|██████▋   | 2/3 [00:28<00:14, 14.03s/it]predicted sentences: 
Length: 122
"no acute pulmonary disease.
the lungs are clear. there is no pleural effusion. the heart and mediastinum are normal. there are atherosclerotic changes of the thoracic aorta. arthritic changes of the skeletal structures are noted."  "1. no pneumothorax or pleural effusion. surgical clips are present in the arthritic changes of the skeletal structures. surgical clips are present in the arthritic changes of the skeletal structures."  "no pneumothorax or pleural effusion."  "no pleural surgical clips are present in the arthritic changes of the skeletal structures."  "no pneumothorax or pleural surgical clips are present in the arthritic changes of the skeletal structures."  "no pleural surgical clips are present in the arthritic changes of the skeletal structures."  "no surgical clips are present in the arthritic

Time taken for this batch: 13.827s, (13.827s/image)

(3/3) predicting ['CXR1353_IM-0230-2001.png']...
100%|██████████| 3/3 [00:41<00:00, 13.93s/it]predicted sentences: 
Length: 121
"right middle lobe infiltrate consistent with pneumonia.
the heart is normal in size. the pulmonary vascularity is within normal limits in the lungs are clear. a large hiatal hernia is noted. calcified left hilar lymph xxxx are noted. there are surgical clips in the left lung base. a hiatal hernia is noted."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line has been removed."  "left picc line

Time taken for this batch: 13.628s, (13.628s/image)
Time taken for evaluation 41.78707838058472 sec

Summary of what's wrong

Both eos token variants took around 40 seconds to generated 3 sentences of batch_size=1, and all sentences seemed to not have ended properly.

The possible fix

I've discovered that tokenizer_wrapper.GPT2_encode("seq", pad=False)[0] works as a valid eos token, as in it manages to end generated sentences properly.

Generated sentences of encoded "seq" as eos token (tokenizer_wrapper.GPT2_encode("seq", pad=False)[0])

  0%|          | 0/3 [00:00<?, ?it/s]
(1/3) predicting ['CXR3247_IM-1538-1001.png']...
 33%|███▎      | 1/3 [00:04<00:08,  4.14s/it]predicted sentences: 
Length: 30
"no acute pulmonary disease.
the lungs are clear. there is no pleural effusion or pneumothorax. the heart and mediastinum are normal. the skeletal structures and soft tissues are normal." end

Time taken for this batch: 3.929s, (3.929s/image)

(2/3) predicting ['CXR3483_IM-1692-1001.png']...
 67%|██████▋   | 2/3 [00:09<00:04,  4.71s/it]predicted sentences: 
Length: 36
"no acute pulmonary disease.
the lungs are clear. there is no pleural effusion. the heart and mediastinum are normal. there are atherosclerotic changes of the thoracic aorta. arthritic changes of the skeletal structures are noted." end

Time taken for this batch: 5.100s, (5.100s/image)

(3/3) predicting ['CXR1353_IM-0230-2001.png']...
100%|██████████| 3/3 [00:13<00:00,  4.58s/it]predicted sentences: 
Length: 35
"right middle lobe and lower lobe pneumonia.
right middle lobe and lower lobe consolidation and bilateral costophrenic xxxx blunting is present. heart size normal. pulmonary vascularity is normal. there is a large hiatal hernia." end

Time taken for this batch: 4.461s, (4.461s/image)
Time taken for evaluation 13.733989238739014 sec

Each sentence now took much shorter time to generate (around 13s), and they seem to have ended properly.

Summary of this section

The default GPT2 end of sentence token and a manually encoded string <|endoftext|> did not seem to work as a proper eos token. Instead, a manually encoded string seq seems to work with unknown reason. If possible I would like to know the reasons behind this.

Not being able to reproduce the exact predictions and large gap in clinical accuracy

VisualCheXbert_CDGPT2.ipynb

This Python notebook contains code to evaluate predicted results generated from CDGPT-2 in form of clinical accuracy

The issue

I cannot seem to reproduce the exact prediction results which is attached in your provided model checkpoint folder here

And while I haven't evaluated my prediction results with the metrics in the paper to compare if the results were close enough, I've evaluated the results with clinical accuracy using VisualCheXbert

To explain briefly, VisualCheXbert is a model that take chest x-ray report as input, and then output the labels of diseases found in the text in the following categories: Fracture, Consolidation, Enlarged Cardiomediastinum, No Finding, Pleural Other, Cardiomegaly, Pneumothorax, Atelectasis, Support Devices, Edema, Pleural Effusion, Lung Lesion, and Lung Opacity

I tried to reproduce results with the same test case you used (testing_set.csv), changed the configuration to be as close as possible to the config.json file you've provided. What I've changed are:

  • tokenizer_vocab_size from default of 1001 to 2000
  • tags_threshold from default of -1 to 0.1

However, the model seems to predict a completely different sentences which can be found here. I've also attached a side-by-side comparison between your predictions and my reproduced prediction attempt here. original column is your predictions from your checkpoint folder, and reproduced is my reproducing attempt.

I've evaluated the clinical accuracy of both your predictions.csv and my attempt to reproduce prediction_reproduced.csv by comparing the labels generated by VisualCheXbert between ground truth of the predictions and the predicted sentences.

The clinical accuracy evaluated by VisualCheXbert has shown a large gap in performance between your provided predictions.csv and the prediction_reproduced.csv despite the effort to adjust all the settings to be as close as the config.json file. The largest gap can be seen in precision metrics, while on recall metrics both your predictions and reproduced predictions seem to be closer to each other. This results in a significant difference in F1 score metrics. I've attached bar charts to further provide information in the next section as well (which are the same charts found in VisualCheXbert_CDGPT2.ipynb

Even without changing any settings from the default configs in this code repository, the clinical accuracy is still not as close as your original results (unfortunately I didn't keep any results here, but if needed I can try evaluating with different configs).

Precision

original: Evaluation results of your provided predictions.csv
reproduced: Evaluation results of my attempt to reproduce prediction_reproduced.csv

image

image

Recall

original: Evaluation results of your provided predictions.csv
reproduced: Evaluation results of my attempt to reproduce prediction_reproduced.csv

image

image

F1 Score

original: Evaluation results of your provided predictions.csv
reproduced: Evaluation results of my attempt to reproduce prediction_reproduced.csv

image

image

@pjumruspun pjumruspun changed the title Reproducing results attempt question & default eos token not working Default eos token not working & gap in clinical performance in reproduced results Oct 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant