Awesome-Foundation-Models

A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.

Survey

2024

Language Agents (from Princeton Shunyu Yao's PhD thesis. Blog1, Blog2)
A Systematic Survey on Large Language Models for Algorithm Design (from City Univ. of Hong Kong)
Image Segmentation in Foundation Model Era: A Survey (from Beijing Institute of Technology)
Towards Vision-Language Geo-Foundation Model: A Survey (from Nanyang Technological University)
An Introduction to Vision-Language Modeling (from Meta)
The Evolution of Multimodal Model Architectures (from Purdue University)
Efficient Multimodal Large Language Models: A Survey (from Tencent)
Foundation Models for Video Understanding: A Survey (from Aalborg University)
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (from GigaAI)
Prospective Role of Foundation Models in Advancing Autonomous Vehicles (from Tongji University)
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (from Northeastern University)
A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (from Lehigh)
Large Multimodal Agents: A Survey (from CUHK)
The Uncanny Valley: A Comprehensive Analysis of Diffusion Models (from Mila)
Real-World Robot Applications of Foundation Models: A Review (from University of Tokyo)
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities (from Shanghai AI Lab)
Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey (from JHU)

Before 2024

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision (from SDSU)
Multimodal Foundation Models: From Specialists to General-Purpose Assistants (from Microsoft)
Towards Generalist Foundation Model for Radiology (from SJTU)
Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
Towards Generalist Biomedical AI (from Google)
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from Oxford)
Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft)
A Survey on Multimodal Large Language Models (from USTC and Tencent)
Vision-Language Models for Vision Tasks: A Survey (from Nanyang Technological University)
Foundation Models for Generalist Medical Artificial Intelligence (from Stanford)
A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
Vision-language pre-training: Basics, recent advances, and future trends
On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)

Papers by Date

Papers by Topic

Large Language/Multimodal Models

LLaVA: Visual Instruction Tuning (from University of Wisconsin-Madison)
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (from KAUST)
GPT-4 Technical Report (from OpenAI)
GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)
LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
LLaMA: Open and Efficient Foundation Language Models (models ranging from 7B to 65B parameters; from Meta)
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (from Google)

Linear Attention

Large Benchmarks

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding (large-scale annotated video benchmark for ophthalmic surgery. from Monash, 2024)
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI (from Shanghai AI Lab, 2024)
BLINK: Multimodal Large Language Models Can See but Not Perceive (multimodal benchmark. from University of Pennsylvania, 2024)
CAD-Estate: Large-scale CAD Model Annotation in RGB Videos (RGB videos with CAD annotation. from Google 2023)
ImageNet: A Large-Scale Hierarchical Image Database (vision benchmark. from Stanford, 2009)

Vision-Language Pretraining

FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (proposes a generic and efficient VLP strategy based on off-the-shelf frozen vision and language models. from Salesforce Research)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
RegionCLIP: Region-Based Language-Image Pretraining
CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)

Perception Tasks: Detection, Segmentation, and Pose Estimation

SAM 2: Segment Anything in Images and Videos (from Meta)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)

Training Efficiency

Green AI (introduces the concept of Red AI vs Green AI)
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (the lottery ticket hypothesis, from MIT)

Towards Artificial General Intelligence (AGI)

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models (from Huawei)

AI Safety and Responsibility

Bounding the probability of harm from an AI to create a guardrail (blog from Yoshua Bengio)
Managing Extreme AI Risks amid Rapid Progress (from Science, May 2024)

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Foundation-Models

Survey

2024

Before 2024

Papers by Date

2024

2023

2022

2021

Before 2021

Papers by Topic

Large Language/Multimodal Models

Linear Attention

Large Benchmarks

Vision-Language Pretraining

Perception Tasks: Detection, Segmentation, and Pose Estimation

Training Efficiency

Towards Artificial General Intelligence (AGI)

AI Safety and Responsibility

Related Awesome Repositories

About

Releases

Packages

Contributors 3

uncbiag/Awesome-Foundation-Models

Folders and files

Latest commit

History

Repository files navigation

Awesome-Foundation-Models

Survey

2024

Before 2024

Papers by Date

2024

2023

2022

2021

Before 2021

Papers by Topic

Large Language/Multimodal Models

Linear Attention

Large Benchmarks

Vision-Language Pretraining

Perception Tasks: Detection, Segmentation, and Pose Estimation

Training Efficiency

Towards Artificial General Intelligence (AGI)

AI Safety and Responsibility

Related Awesome Repositories

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages