Model Card for Model ID
Model Details
- 245M parameters
- 4 Layers
- D_size 1280
- 16 MoE
- 8 KV
- FP32 2.3GB - Onix export
Trained on only 20B tokens of web text data.
Fine-tuned on 80K of UltraChat, no LoRA or similar tricks.
Model Description
Lulu Local Android Demo
Lulu Local is an offline Android AI demo by Open Machine.
This release runs a local Lulu language model directly on an Android phone using ONNX Runtime CPU inference.
No cloud. No server. No GPU. No NPU. No internet required after install.
Runs on the Samsung A25 5G.
This is a raw early proof that a custom local model can run directly on consumer Android hardware.
For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet. The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory. This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred. We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000.
Why this is interesting
Most mobile LLM demos rely on one or more of the following:
heavily quantized models GPU acceleration NPU acceleration server-side inference vendor SDKs cloud APIs
This demo is intentionally simple and direct:
Android app
- ONNX Runtime
- local tokenizer
- local ONNX model
- CPU only
The current model is not small, not heavily optimized, and not using mobile accelerator tricks. That is the point of the demo.
Model architecture note
The Android build uses a stateful single-token step ONNX export.
The runtime loop is:
token_id + position + cache tensors → ONNX step model → logits + updated cache tensors → sample next token → repeat
This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation.
Current ONNX interface:
Inputs:
- token_id: [1, 1] int64
- pos: [1] int64
- k_0, v_0 ... k_23, v_23
Outputs:
- logits: [1, 32000] float32
- out_k_0, out_v_0 ... out_k_23, out_v_23
Cache shape per K/V tensor:
[1, 16, 128, 80]
Total runtime cache is about 31 MB.
- Developed by: The Open Machine
- Model type: [The Open Machine Transformers Version]
- Language(s) (NLP): [English]
- License: [Apache 2.0 ]
Model Sources [optional]
- Repository: [Wiull be provided in upcoming days]
- Paper [optional]: [Coming Soon]
- Demo [optional]: [More Information Needed]
Uses
Demo highlights Fully offline Android assistant Runs on mobile CPU only Stateful single-token ONNX generation Live token streaming UI Battery / RAM / speed display Cool / Turbo mode Cool: 2 CPU threads Turbo: 4 CPU threads No GPU acceleration No NPU acceleration No network calls required for inference
Tested device
Early demo testing was done on a Samsung A25-class Android phone.
Observed behavior:
Model loads locally from app storage Generation works fully offline CPU-only generation is slow but usable for demo purposes Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode
This is not yet optimized.
Install
Download the APK:
LuluLocal-Android-CPU-fp32.apk
On Android:
Open the APK file. Allow install from unknown sources if Android asks. Install. Open Lulu. Wait for the model to load. Ask a question.
First load may take longer because the app prepares the local ONNX model.
Direct Use
[Privacy
Inference is local.
The demo is designed so prompts are processed on-device. No cloud inference is required.
If you build or modify the app, review the source code and Android permissions yourself.]
Out-of-Scope Use
[Important warning
This is an experimental local AI demo.
The model may:
hallucinate answer incorrectly repeat itself generate incomplete text be slow on low-end hardware consume significant battery and RAM
Do not use this for medical, legal, financial, emergency, or safety-critical decisions.]
Bias, Risks, and Limitations
[Current limitations CPU only fp32 ONNX model is large no NPU backend yet no GPU/Vulkan backend yet no quantization yet context length currently limited APK size is large generation quality is still experimental]
Model Card Authors [optional]
Credits
Built by Open Machine.
Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices.
Model Card Contact
Open Machine info@theopenmachine.com