Model is not being loaded on loadModel? #408

kumikumi · 2025-01-02T01:10:29Z

kumikumi
Jan 2, 2025

I'm trying to load a LLM (model weights) to GPU memory and use it multiple times. Here's the code I'm working on:

import {
	ChatModelResponse,
	ChatUserMessage,
	Llama,
	LlamaChat,
	LlamaContext,
	LlamaModel,
	getLlama,
} from 'node-llama-cpp'

/*
type Message = {
    role: "user" | "assistant";
    content: string;
}
*/
import { Message } from '../types'

// Object to keep reference to everything
type LlamaState = {
	llama: Llama | null
	currentlyLoadedModel: LlamaModel | null
	context: LlamaContext | null
	llamaChat: LlamaChat | null
}

export const llamaCppState: LlamaState = {
	llama: null,
	currentlyLoadedModel: null,
	context: null,
	llamaChat: null,
}

/**
 * Load a LLM to GPU memory
 */
export const loadAndInitializeModel = async () => {
	const llama = await getLlama()

	console.log('Loading model...')
	const startTime = performance.now()
	const model = await llama.loadModel({
		modelPath:
			'/Users/mikko/.node-llama-cpp/models/hf_bartowski_Llama-3.3-70B-Instruct-Q8_0-00001-of-00002.gguf',
	})
	const duration = Math.round(performance.now() - startTime)
	console.log(`Model loaded! Loading took ${duration} ms`)

	const context = await model.createContext()
	const llamaChat = new LlamaChat({
		contextSequence: context.getSequence(),
	})

	// keep a reference to everything
	llamaCppState.llama = llama
	llamaCppState.currentlyLoadedModel = model
	llamaCppState.context = context
	llamaCppState.llamaChat = llamaChat
}

/**
 *
 * Generate a response using the loaded model
 */
export const respond =
	() =>
	async (
		messages: Array<Message>,
		systemPrompt: string,
		onTextChunk: (messagePart: string) => void,
		temperature: number
	): Promise<Message> => {
		const lastMessage = messages[messages.length - 1]
		if (!lastMessage) {
			throw new Error('Tried to prompt LlamaCpp without messages')
		}
		const llamaChat = llamaCppState.llamaChat
		if (!llamaChat) {
			throw new Error('Model is not loaded')
		}

		const chatHistory = llamaChat.chatWrapper.generateInitialChatHistory()
		chatHistory.push({ type: 'system', text: systemPrompt })

		messages.forEach(({ content, role }) => {
			if (role === 'user') {
				const item: ChatUserMessage = { text: content, type: 'user' }
				chatHistory.push(item)
			} else {
				const item: ChatModelResponse = { response: [content], type: 'model' }
				chatHistory.push(item)
			}
		})
		const message = await llamaChat.generateResponse(chatHistory, {
			onTextChunk,
			temperature,
		})

		return { role: 'assistant', content: message.response }
	}

Basically loadAndInitializeModel is to be called once, to load the model into memory. Then, respond will get called numerous times with different conversations. It receives an array of messages in a conversation between a user and an AI assistant, and returns the next message.

I would expect loadAndInitializeModel to take a while, because I'm loading 70GB of data to memory. (There is 128GB memory shared between the CPU and GPU in this architecture.) I've timed the code and it says it takes less than 400 ms. I'm also watching the memory usage of the Bun process that I'm running this in, and it seems to only take up 18GB of RAM (according to MacOS Activity Monitor) when the model is "loaded".

As I call respond, the RAM consumption starts hiking up, and it takes quite a lot of time (tens of seconds) until it starts outputting any tokens. "Memory pressure", as reported by MacOS, also only starts going up as I call respond, and it goes back down once it's done.

It appears to me as if the model is only really being loaded when I'm generating, and it gets trashed/freed from memory after llamaChat.generateResponse is done. But I might be wrong?

Answered by giladgd

Jan 4, 2025

When you load a model, useMmap is enabled by default if your system supports it.
mmap (memory-mapped file) allows mapping a file from the disk to a virtual memory managed by the OS, so the OS can load and unload the files from memory as it sees fit, and it allows the system to skip caching large regions of the memory to the disk since it can use the file instead, which makes everything more efficient and smooth.
However, it also means that the file might finish loading (or even start, depending on what the OS decides) only when it's used for the first time, which is why the loading of the model is very fast, but the first response begins with a delay.
Most of the memory consumption you se…

View full answer

giladgd · 2025-01-04T23:46:16Z

giladgd
Jan 4, 2025
Maintainer

When you load a model, useMmap is enabled by default if your system supports it.
mmap (memory-mapped file) allows mapping a file from the disk to a virtual memory managed by the OS, so the OS can load and unload the files from memory as it sees fit, and it allows the system to skip caching large regions of the memory to the disk since it can use the file instead, which makes everything more efficient and smooth.
However, it also means that the file might finish loading (or even start, depending on what the OS decides) only when it's used for the first time, which is why the loading of the model is very fast, but the first response begins with a delay.
Most of the memory consumption you see is the context itself.

To make the response start sooner, I recommend either calling .preloadPrompt(...) on a LlamaChatSession or .generateResponse(... ,{maxTokens: 0}) (with only the system message) to preload the common tokens to the sequence state and force the model to load to memory, or to disable useMmap completely.

I'll make sure to document useMmap better to make it more clear what it does.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model is not being loaded on loadModel? #408

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Model is not being loaded on loadModel? #408

kumikumi Jan 2, 2025

Replies: 1 comment

giladgd Jan 4, 2025 Maintainer

kumikumi
Jan 2, 2025

giladgd
Jan 4, 2025
Maintainer