Skip to content

FoundationVision/Liquid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Liquid: Language Models are Scalable and Unified
Multi-modal Generators

Junfeng Wu1,2 · Yi Jiang2† · Chuofan Ma2,3
Yuliang Liu1 · Hengshuang Zhao3
Zehuan Yuan2 · Song Bai2* · Xiang Bai1*

1HUST   2ByteDance   3HKU
†project lead   *corresponding author

Paper PDF Project Page

This repo implements Liquid, a scalable and unified autoregressive generation paradigm that seamlessly integrates multimodal comprehension and generation.

teaser

News

2025-02-28: Paper, demo, model, and project page for Liquid are all released.

📑 Open-Source Plan

  • Liquid-7B (Mix-pretrained Multimodal Model with T2I and Language Ability)
    • Web Demo
    • Inference
    • Checkpoints
  • Liquid-7B-Multiratio (Multi-Ratio Image Generation Model)
    • Web Demo
    • Inference
    • Checkpoints
  • Liquid-7B-IT (Instruction Tuned Multimodal Model with Instruction Following Ability)
    • Web Demo
    • Inference
    • Checkpoints

📖 Introduction

  • We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation.

  • Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model (LLM), eliminating the need for external pretrained visual embeddings such as CLIP.

  • For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks diminishes as the model size increases.

  • Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other

🔥 Multimodal Generation

  • Liquid : Scalable and Versatile Unified Multimodal Generator which supports Visual Understanding, Visual Generation and Multi-modal Generation

teaser

  • Liquid can generate high-quality, photorealistic images of any aspect ratio by language in an autoregressive paradigm.

teaser

🔥 Scaling Law for multimodal generation

  • Liquid shows clear Scaling Law in multimodal generation across different sizes(0.5B to 32B).

teaser

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find this project useful, please consider citing:

@article{liquid,
  title={Liquid: Language Models are Scalable and Unified Multi-modal Generators},
  author=author={Wu, Junfeng and Jiang, Yi and Ma, Chuofan and Liu, Yuliang and Zhao, Hengshuang and Yuan, Zehuan and Bai, Song and Bai, Xiang},
  journal={arXiv preprint arXiv:2412.04332},
  year={2024}
}