mew

A custom implementation of a GPT-like language model developed entirely from scratch. This project provides the fundamental building blocks for training and running inference on an autoregressive neural probabilistic language model.

Features

Custom Tokenization: Byte-Pair Encoding (BPE) implementation.
Efficient Data Loading: Memory-optimized Numpy batch loaders capable of handling massive memmap files.
Neural Network Architecture: Custom Transformer blocks, Rotary Positional Embeddings (RoPE), and linear layers built with PyTorch.
Optimizers: Custom AdamW optimizer with learning rate scheduling.
Generators: Autoregressive text generation logic.
Trainer: Training loops with experiment tracking (W&B integration).
Configuration Management: Hydra-based configuration for easy parameter sweeping and experiment management.

Project Structure

The repository is divided into two primary packages:

1. `@mew/` (Core Library)

The core engine behind the language model:

mew/data_loaders/: Efficient batching and data loading logic (numpy_batch_loader).
mew/generators/: Text generation utilities (conditional_generator).
mew/nn/: Neural network architectures, modules, and layers (Transformers, RoPE).
mew/optimizers/: Custom optimizers and schedulers (AdamW, LR scheduling).
mew/tokenization/: BPE tokenizer and text processing tools.
mew/trainers/: Implementations of the training loops (e.g., NPTTrainer).

2. `@apps/` (Application Layer)

High-level scripts and configurations:

apps/cfgs/: Hydra configuration files (training.yaml, inference.yaml, tokenization.yaml).
apps/launch_training.py: Entry point for launching model training.
apps/tokenization.py: Entry point for running the data tokenization pipelines.

Setup and Installation

This project strictly uses uv for fast and reliable Python package management.

Ensure you have uv installed.
Install the project and its dependencies:
```
uv sync
```

Usage

You can run the application scripts using uv run.

Tokenization:

uv run apps/tokenization.py

Training:

uv run apps/launch_training.py

Note: The launch scripts use Hydra, so you can override configurations via the CLI (e.g., uv run apps/launch_training.py wandb.enable=True).

Development Guidelines

Formatting: Always format the code using black.
Linting: Check for lint errors using flake8, but ignore the "line too long" error (E501).

uv run black mew/ apps/
uv run flake8 --ignore=E501 mew/ apps/

See AGENTS.md for more details regarding instructions for AI agents and code contributors.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
apps		apps
mew		mew
resources		resources
.flake8		.flake8
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mew

Features

Project Structure

1. `@mew/` (Core Library)

2. `@apps/` (Application Layer)

Setup and Installation

Usage

Development Guidelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mew

Features

Project Structure

1. @mew/ (Core Library)

2. @apps/ (Application Layer)

Setup and Installation

Usage

Development Guidelines

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `@mew/` (Core Library)

2. `@apps/` (Application Layer)

Packages