A new approach of running LLM/LMs' inference/training on GPU/NPU backends through C++ implementation and compile for High-Performance and Easy-to-Use