Skip to content

PyTorch implementation of the ICASSP-24 paper: "Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation"

License

Notifications You must be signed in to change notification settings

slSeanWU/beats-conformer-bart-audio-captioner

Repository files navigation

Audio Captioning with BEATs, Conformer & BART

Winning model of DCASE Challenge 2023 Task 6A, with the follow-up publication:

  • Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
    Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe
    Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2024
    [arXiv page] [DCASE results]
  • BibTex citation
    @inproceedings{wu2024improving,
      title={Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation},
      author={Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Fran{\c{c}}ois and Le Roux, Jonathan and Watanabe, Shinji},
      booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)},
      year={2024}
    }
    

Install Packages

  • (Recommended) Create Conda environment with Python 3.9
  • Install PyTorch with the correct CUDA version
  • Install dependencies for SPICE metric
    cd caption_evaluation_tools/coco_caption
    bash get_stanford_models.sh
    cd ../../
    
  • Install other dependencies
    pip install -r requirements.txt
    

Download Dataset & Pretrained Model

  • Install p7zip (required for unpacking dataset)
    # if using conda
    conda install bioconda::p7zip
    # if installing to system
    # sudo apt-get install p7zip-full
    
  • Download Clotho dataset
    bash download_clotho.sh
    
  • Install Git-LFS
    # if using conda
    conda install conda-forge::git-lfs
    git-lfs install
    
    # if installing to system
    # curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    # sudo apt-get install git-lfs
    # git-lfs install
    
  • Get pretrained model (stored on HuggingFace)
    bash download_model.sh
    

Reproduce Best Model Results

  • Run inference & evaluation code
    bash run_sampling_reranking.sh
    
    • metrics can then be found at ./exp/inference_evaluation_nucleus_t0.5_p95/inference_metrics.json

(Bonus) Augmented Dataset

Our 50K mix-up caption augmentations generated by ChatGPT (see paper Section 2.3 for details) can be found at:

Acknowledgements

Our model/repository would not have been possible without the following great open-source works. Thank you so much!

About

PyTorch implementation of the ICASSP-24 paper: "Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published