Instructions to use OpenMachineAI/Lulu245M-Mobile with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMachineAI/Lulu245M-Mobile with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMachineAI/Lulu245M-Mobile", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| datasets: | |
| - moca-embed/dclm_20b | |
| - openbmb/UltraChat | |
| language: | |
| - en | |
| library_name: transformers | |
| # Model Card for Model ID | |
| <!-- Provide a quick summary of what the model is/does. --> | |
| ## Model Details | |
| - 245M parameters | |
| - 4 Layers | |
| - D_size 1280 | |
| - 16 MoE | |
| - 8 KV | |
| - FP32 2.3GB - Onix export | |
| Trained on only 20B tokens of web text data. | |
| Fine-tuned on 80K of UltraChat, no LoRA or similar tricks. | |
| ### Model Description | |
| # Lulu Local Android Demo | |
| **Lulu Local** is an offline Android AI demo by **Open Machine**. | |
| This release runs a local Lulu language model directly on an Android phone using **ONNX Runtime CPU inference**. | |
| No cloud. | |
| No server. | |
| No GPU. | |
| No NPU. | |
| No internet required after install. | |
| Runs on the Samsung A25 5G. | |
| This is a raw early proof that a custom local model can run directly on consumer Android hardware. | |
| For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet. | |
| The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory. | |
| This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred. | |
| We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000. | |
| Why this is interesting | |
| Most mobile LLM demos rely on one or more of the following: | |
| heavily quantized models | |
| GPU acceleration | |
| NPU acceleration | |
| server-side inference | |
| vendor SDKs | |
| cloud APIs | |
| This demo is intentionally simple and direct: | |
| Android app | |
| + ONNX Runtime | |
| + local tokenizer | |
| + local ONNX model | |
| + CPU only | |
| The current model is not small, not heavily optimized, and not using mobile accelerator tricks. | |
| That is the point of the demo. | |
| Model architecture note | |
| The Android build uses a stateful single-token step ONNX export. | |
| The runtime loop is: | |
| token_id + position + cache tensors | |
| → ONNX step model | |
| → logits + updated cache tensors | |
| → sample next token | |
| → repeat | |
| This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation. | |
| Current ONNX interface: | |
| Inputs: | |
| - token_id: [1, 1] int64 | |
| - pos: [1] int64 | |
| - k_0, v_0 ... k_23, v_23 | |
| Outputs: | |
| - logits: [1, 32000] float32 | |
| - out_k_0, out_v_0 ... out_k_23, out_v_23 | |
| Cache shape per K/V tensor: | |
| [1, 16, 128, 80] | |
| Total runtime cache is about 31 MB. | |
| - **Developed by: The Open Machine** | |
| - **Model type:** [The Open Machine Transformers Version] | |
| - **Language(s) (NLP):** [English] | |
| - **License:** [Apache 2.0 ] | |
| ### Model Sources [optional] | |
| <!-- Provide the basic links for the model. --> | |
| - **Repository:** [Wiull be provided in upcoming days] | |
| - **Paper [optional]:** [Coming Soon] | |
| - **Demo [optional]:** [More Information Needed] | |
| ## Uses | |
| Demo highlights | |
| Fully offline Android assistant | |
| Runs on mobile CPU only | |
| Stateful single-token ONNX generation | |
| Live token streaming UI | |
| Battery / RAM / speed display | |
| Cool / Turbo mode | |
| Cool: 2 CPU threads | |
| Turbo: 4 CPU threads | |
| No GPU acceleration | |
| No NPU acceleration | |
| No network calls required for inference | |
| Tested device | |
| Early demo testing was done on a Samsung A25-class Android phone. | |
| Observed behavior: | |
| Model loads locally from app storage | |
| Generation works fully offline | |
| CPU-only generation is slow but usable for demo purposes | |
| Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode | |
| This is not yet optimized. | |
| Install | |
| Download the APK: | |
| LuluLocal-Android-CPU-fp32.apk | |
| On Android: | |
| Open the APK file. | |
| Allow install from unknown sources if Android asks. | |
| Install. | |
| Open Lulu. | |
| Wait for the model to load. | |
| Ask a question. | |
| First load may take longer because the app prepares the local ONNX model. | |
| ### Direct Use | |
| <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> | |
| [Privacy | |
| Inference is local. | |
| The demo is designed so prompts are processed on-device. | |
| No cloud inference is required. | |
| If you build or modify the app, review the source code and Android permissions yourself.] | |
| ### Out-of-Scope Use | |
| [Important warning | |
| This is an experimental local AI demo. | |
| The model may: | |
| hallucinate | |
| answer incorrectly | |
| repeat itself | |
| generate incomplete text | |
| be slow on low-end hardware | |
| consume significant battery and RAM | |
| Do not use this for medical, legal, financial, emergency, or safety-critical decisions.] | |
| ## Bias, Risks, and Limitations | |
| <!-- This section is meant to convey both technical and sociotechnical limitations. --> | |
| [Current limitations | |
| CPU only | |
| fp32 ONNX model is large | |
| no NPU backend yet | |
| no GPU/Vulkan backend yet | |
| no quantization yet | |
| context length currently limited | |
| APK size is large | |
| generation quality is still experimental] | |
| ## Model Card Authors [optional] | |
| Credits | |
| Built by Open Machine. | |
| Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices. | |
| ## Model Card Contact | |
| Open Machine | |
| info@theopenmachine.com | |