Transformers
English

Model Card for Model ID

Model Details

  • 245M parameters
  • 4 Layers
  • D_size 1280
  • 16 MoE
  • 8 KV
  • FP32 2.3GB - Onix export

Trained on only 20B tokens of web text data.

Fine-tuned on 80K of UltraChat, no LoRA or similar tricks.

Model Description

Lulu Local Android Demo

Lulu Local is an offline Android AI demo by Open Machine.

This release runs a local Lulu language model directly on an Android phone using ONNX Runtime CPU inference.

No cloud. No server. No GPU. No NPU. No internet required after install.

Runs on the Samsung A25 5G.

This is a raw early proof that a custom local model can run directly on consumer Android hardware.

For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet. The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory. This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred. We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000.

Why this is interesting

Most mobile LLM demos rely on one or more of the following:

heavily quantized models GPU acceleration NPU acceleration server-side inference vendor SDKs cloud APIs

This demo is intentionally simple and direct:

Android app

  • ONNX Runtime
  • local tokenizer
  • local ONNX model
  • CPU only

The current model is not small, not heavily optimized, and not using mobile accelerator tricks. That is the point of the demo.

Model architecture note

The Android build uses a stateful single-token step ONNX export.

The runtime loop is:

token_id + position + cache tensors → ONNX step model → logits + updated cache tensors → sample next token → repeat

This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation.

Current ONNX interface:

Inputs:

  • token_id: [1, 1] int64
  • pos: [1] int64
  • k_0, v_0 ... k_23, v_23

Outputs:

  • logits: [1, 32000] float32
  • out_k_0, out_v_0 ... out_k_23, out_v_23

Cache shape per K/V tensor:

[1, 16, 128, 80]

Total runtime cache is about 31 MB.

  • Developed by: The Open Machine
  • Model type: [The Open Machine Transformers Version]
  • Language(s) (NLP): [English]
  • License: [Apache 2.0 ]

Model Sources [optional]

  • Repository: [Wiull be provided in upcoming days]
  • Paper [optional]: [Coming Soon]
  • Demo [optional]: [More Information Needed]

Uses

Demo highlights Fully offline Android assistant Runs on mobile CPU only Stateful single-token ONNX generation Live token streaming UI Battery / RAM / speed display Cool / Turbo mode Cool: 2 CPU threads Turbo: 4 CPU threads No GPU acceleration No NPU acceleration No network calls required for inference

Tested device

Early demo testing was done on a Samsung A25-class Android phone.

Observed behavior:

Model loads locally from app storage Generation works fully offline CPU-only generation is slow but usable for demo purposes Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode

This is not yet optimized.

Install

Download the APK:

LuluLocal-Android-CPU-fp32.apk

On Android:

Open the APK file. Allow install from unknown sources if Android asks. Install. Open Lulu. Wait for the model to load. Ask a question.

First load may take longer because the app prepares the local ONNX model.

Direct Use

[Privacy

Inference is local.

The demo is designed so prompts are processed on-device. No cloud inference is required.

If you build or modify the app, review the source code and Android permissions yourself.]

Out-of-Scope Use

[Important warning

This is an experimental local AI demo.

The model may:

hallucinate answer incorrectly repeat itself generate incomplete text be slow on low-end hardware consume significant battery and RAM

Do not use this for medical, legal, financial, emergency, or safety-critical decisions.]

Bias, Risks, and Limitations

[Current limitations CPU only fp32 ONNX model is large no NPU backend yet no GPU/Vulkan backend yet no quantization yet context length currently limited APK size is large generation quality is still experimental]

Model Card Authors [optional]

Credits

Built by Open Machine.

Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices.

Model Card Contact

Open Machine info@theopenmachine.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train OpenMachineAI/Lulu245M-Mobile