why time is longer?
Why does the inference time become longer after using this model? Comparing with the original model on the same dataset. How to convert the model to the tf architecture?
how about TensorRT Edge-LLM,do you have a try?
i have 2 system
1 system is 16GB memory vram
2.system is 12GB memory vram
i want test to 2.system because r1 to 4bit save; because 12GB not load orignal model.
import torch
from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
from alpamayo_r1.load_physical_aiavdataset import load_physical_aiavdataset
from alpamayo_r1 import helper
model = AlpamayoR1.from_pretrained(
"./Alpamayo-R1-10B",
device_map="auto",
torch_dtype=torch.bfloat16,
#load_in_4bit=True
)
save_dir = "./Alpamayo-R1-10B-4bit"
model.save_pretrained(save_dir)
#tokenizer.save_pretrained(save_dir)
"Thanks very much, but I'm having trouble following your idea. Could you clarify what you mean? Specifically, how does the model transform (the data / the input)?"