-
I'm trying to load a LLM (model weights) to GPU memory and use it multiple times. Here's the code I'm working on: import {
ChatModelResponse,
ChatUserMessage,
Llama,
LlamaChat,
LlamaContext,
LlamaModel,
getLlama,
} from 'node-llama-cpp'
/*
type Message = {
role: "user" | "assistant";
content: string;
}
*/
import { Message } from '../types'
// Object to keep reference to everything
type LlamaState = {
llama: Llama | null
currentlyLoadedModel: LlamaModel | null
context: LlamaContext | null
llamaChat: LlamaChat | null
}
export const llamaCppState: LlamaState = {
llama: null,
currentlyLoadedModel: null,
context: null,
llamaChat: null,
}
/**
* Load a LLM to GPU memory
*/
export const loadAndInitializeModel = async () => {
const llama = await getLlama()
console.log('Loading model...')
const startTime = performance.now()
const model = await llama.loadModel({
modelPath:
'/Users/mikko/.node-llama-cpp/models/hf_bartowski_Llama-3.3-70B-Instruct-Q8_0-00001-of-00002.gguf',
})
const duration = Math.round(performance.now() - startTime)
console.log(`Model loaded! Loading took ${duration} ms`)
const context = await model.createContext()
const llamaChat = new LlamaChat({
contextSequence: context.getSequence(),
})
// keep a reference to everything
llamaCppState.llama = llama
llamaCppState.currentlyLoadedModel = model
llamaCppState.context = context
llamaCppState.llamaChat = llamaChat
}
/**
*
* Generate a response using the loaded model
*/
export const respond =
() =>
async (
messages: Array<Message>,
systemPrompt: string,
onTextChunk: (messagePart: string) => void,
temperature: number
): Promise<Message> => {
const lastMessage = messages[messages.length - 1]
if (!lastMessage) {
throw new Error('Tried to prompt LlamaCpp without messages')
}
const llamaChat = llamaCppState.llamaChat
if (!llamaChat) {
throw new Error('Model is not loaded')
}
const chatHistory = llamaChat.chatWrapper.generateInitialChatHistory()
chatHistory.push({ type: 'system', text: systemPrompt })
messages.forEach(({ content, role }) => {
if (role === 'user') {
const item: ChatUserMessage = { text: content, type: 'user' }
chatHistory.push(item)
} else {
const item: ChatModelResponse = { response: [content], type: 'model' }
chatHistory.push(item)
}
})
const message = await llamaChat.generateResponse(chatHistory, {
onTextChunk,
temperature,
})
return { role: 'assistant', content: message.response }
} Basically I would expect As I call It appears to me as if the model is only really being loaded when I'm generating, and it gets trashed/freed from memory after |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
When you load a model, To make the response start sooner, I recommend either calling I'll make sure to document |
Beta Was this translation helpful? Give feedback.
When you load a model,
useMmap
is enabled by default if your system supports it.mmap (memory-mapped file) allows mapping a file from the disk to a virtual memory managed by the OS, so the OS can load and unload the files from memory as it sees fit, and it allows the system to skip caching large regions of the memory to the disk since it can use the file instead, which makes everything more efficient and smooth.
However, it also means that the file might finish loading (or even start, depending on what the OS decides) only when it's used for the first time, which is why the loading of the model is very fast, but the first response begins with a delay.
Most of the memory consumption you se…