A custom implementation of a GPT-like language model developed entirely from scratch. This project provides the fundamental building blocks for training and running inference on an autoregressive neural probabilistic language model.
- Custom Tokenization: Byte-Pair Encoding (BPE) implementation.
- Efficient Data Loading: Memory-optimized Numpy batch loaders capable of handling massive memmap files.
- Neural Network Architecture: Custom Transformer blocks, Rotary Positional Embeddings (RoPE), and linear layers built with PyTorch.
- Optimizers: Custom AdamW optimizer with learning rate scheduling.
- Generators: Autoregressive text generation logic.
- Trainer: Training loops with experiment tracking (W&B integration).
- Configuration Management: Hydra-based configuration for easy parameter sweeping and experiment management.
The repository is divided into two primary packages:
The core engine behind the language model:
mew/data_loaders/: Efficient batching and data loading logic (numpy_batch_loader).mew/generators/: Text generation utilities (conditional_generator).mew/nn/: Neural network architectures, modules, and layers (Transformers, RoPE).mew/optimizers/: Custom optimizers and schedulers (AdamW, LR scheduling).mew/tokenization/: BPE tokenizer and text processing tools.mew/trainers/: Implementations of the training loops (e.g.,NPTTrainer).
High-level scripts and configurations:
apps/cfgs/: Hydra configuration files (training.yaml,inference.yaml,tokenization.yaml).apps/launch_training.py: Entry point for launching model training.apps/tokenization.py: Entry point for running the data tokenization pipelines.
This project strictly uses uv for fast and reliable Python package management.
- Ensure you have
uvinstalled. - Install the project and its dependencies:
uv sync
You can run the application scripts using uv run.
Tokenization:
uv run apps/tokenization.pyTraining:
uv run apps/launch_training.pyNote: The launch scripts use Hydra, so you can override configurations via the CLI (e.g., uv run apps/launch_training.py wandb.enable=True).
- Formatting: Always format the code using
black. - Linting: Check for lint errors using
flake8, but ignore the "line too long" error (E501).
uv run black mew/ apps/
uv run flake8 --ignore=E501 mew/ apps/See AGENTS.md for more details regarding instructions for AI agents and code contributors.
