FUTO Swipe

Demo | Release Announcement | Technical Overview | Technical Report

Mobile-oriented models for decoding swipe gestures into text.

Swipe decode of the word 'computer'

See the technical report for more details.

Models

This repository contains 3 CNN models that compose together. Only the encoder is required. The decoder and language model are additional refinements, leveraging specific layout and language information. The encoder can decode for any keyboard layout, while the decoder is English/QWERTY-only and the language model is English-only.

Model	Codename	Role	Params	Size (fp32)
Encoder	`honorable_sturgeon`	Maps a swipe trajectory to per-timestep character emissions. Layout-agnostic — works on any keyboard supplied at runtime.	635 K	2.65 MB
Decoder	`magic_macaw`	(Optional) per-layout refinement over the encoder's frozen features. Lifts top-k where layout-specific training data exists (English here).	304 K	1.25 MB
Context LM	`hungry_jellyfish`	(Optional) next-word and beam-rerank language model that blends sentence context into candidate ranking.	1.5M	6.25 MB

Encoder — `honorable_sturgeon`

A 1D temporal convolutional network (TCN) reads the raw (x, y) touch trajectory and emits a 64-coefficient spectral pattern and a scalar "intention" gate for each timestep. Per-key character scores are read off by evaluating a fixed cosine (DCT) basis at the layout key centers. Switching layouts on device requires no retraining, just a different key-coordinate tensor.

Decoder (English/QWERTY) — `magic_macaw`

A small DFSMN decoder over the frozen encoder features. It refines the character distribution on specific layouts. Currently we only have data for training an English/QWERTY decoder.

Context LM (English) — `hungry_jellyfish`

A causal DFSMN language model with a hash embedding for large vocabularies. It can supply a rerank score and perform next-word prediction. This model can be used with or without a decoder.

Getting started

The example below demonstrates the encoder on CPU (x86) with the ExecuTorch runtime and greedy-decodes a swipe into characters. Note that greedy decoding is fairly inaccurate and should generally be improved by constraining to a lexicon (eg. trie, WFST).

import numpy as np
import torch
from huggingface_hub import hf_hub_download
from executorch.runtime import Runtime

# QWERTY letter centers in the normalized [0,1] keyboard frame. This is the
# only layout-specific input (can swap in another layout's key coordinates to
# decode a keyboard the encoder never saw at training time).
QWERTY = {
    "a": (0.10, 0.500), "b": (0.60, 0.833), "c": (0.40, 0.833), "d": (0.30, 0.500),
    "e": (0.25, 0.167), "f": (0.40, 0.500), "g": (0.50, 0.500), "h": (0.60, 0.500),
    "i": (0.75, 0.167), "j": (0.70, 0.500), "k": (0.80, 0.500), "l": (0.90, 0.500),
    "m": (0.80, 0.833), "n": (0.70, 0.833), "o": (0.85, 0.167), "p": (0.95, 0.167),
    "q": (0.05, 0.167), "r": (0.35, 0.167), "s": (0.20, 0.500), "t": (0.45, 0.167),
    "u": (0.65, 0.167), "v": (0.50, 0.833), "w": (0.15, 0.167), "x": (0.30, 0.833),
    "y": (0.55, 0.167), "z": (0.20, 0.833),
}
LETTERS = sorted(QWERTY)
MAX_KEYS = 64  # export-time padding bound

# A real swipe for the word "computer": normalized x, y and timestamps (ms).
PX = [0.4141, 0.4478, 0.5, 0.5741, 0.6599, 0.7256, 0.7744, 0.8098, 0.8485, 0.867,
      0.8737, 0.8653, 0.8418, 0.8182, 0.8098, 0.7963, 0.7946, 0.8081, 0.8418, 0.8704,
      0.9057, 0.9259, 0.9545, 0.9697, 0.968, 0.9529, 0.9141, 0.8468, 0.7811, 0.7273,
      0.6869, 0.6616, 0.6582, 0.6431, 0.6061, 0.5572, 0.5067, 0.4663, 0.4495, 0.4461,
      0.4411, 0.4192, 0.3872, 0.362, 0.3283, 0.2795, 0.2391, 0.2323, 0.2407, 0.2593,
      0.2879, 0.3249, 0.3468, 0.3569]
PY = [0.8991, 0.858, 0.7876, 0.6702, 0.5352, 0.4237, 0.3357, 0.2653, 0.1655, 0.142,
      0.142, 0.2183, 0.3709, 0.588, 0.7347, 0.8462, 0.8697, 0.811, 0.6115, 0.4707,
      0.3122, 0.2066, 0.1303, 0.1068, 0.1068, 0.1068, 0.1185, 0.1596, 0.1772, 0.1772,
      0.1772, 0.189, 0.189, 0.189, 0.1831, 0.189, 0.189, 0.189, 0.189, 0.189,
      0.1831, 0.1831, 0.1831, 0.1831, 0.1831, 0.1948, 0.189, 0.1948, 0.189, 0.189,
      0.189, 0.1831, 0.1831, 0.1831]
PT = [0.0, 100, 149, 197, 246, 297, 348, 399, 449, 498, 548, 598, 648, 698, 749, 799,
      849, 949, 999, 1047, 1100, 1152, 1197, 1248, 1314, 1364, 1414, 1465, 1515, 1565,
      1614, 1666, 1715, 1851, 1898, 1951, 1998, 2049, 2097, 2165, 2231, 2279, 2331,
      2382, 2431, 2481, 2532, 2584, 2649, 2700, 2751, 2798, 2848, 2899]


def resample(px, py, pt, T=64):
    """Resample a variable-length trajectory to T evenly-spaced points -> [2, T]."""
    x, y, t = map(np.asarray, (px, py, pt))
    t = t - t[0]
    if t[-1] > 1e-3:  # uniform 60 Hz resample, then to T points
        n60 = max(2, round(t[-1] / (1000.0 / 60.0)) + 1)
        tt = np.linspace(0.0, t[-1], n60)
        x, y = np.interp(tt, t, x), np.interp(tt, t, y)
    idx = np.linspace(0, len(x) - 1, T)
    rx = np.interp(idx, np.arange(len(x)), x)
    ry = np.interp(idx, np.arange(len(y)), y)
    return np.stack([rx, ry], axis=0).astype(np.float32)


def greedy_ctc(log_emissions):
    """Collapse the per-timestep argmax into a string (blank is the last class)."""
    blank = log_emissions.shape[-1] - 1
    out, prev = [], -1
    for c in log_emissions[0].argmax(axis=-1):
        if c != prev and c != blank and c < len(LETTERS):
            out.append(LETTERS[c])
        prev = c
    return "".join(out)


# Load the encoder .pte and run one forward pass.
pte = hf_hub_download("futo-org/futo-swipe", "honorable_sturgeon/model_fp32.pte")
encoder = Runtime.get().load_program(pte).load_method("forward")

features = torch.from_numpy(resample(PX, PY, PT)[None])     # [1, 2, 64]
keys = torch.zeros(1, MAX_KEYS, 2)                          # [1, 64, 2]
mask = torch.zeros(1, MAX_KEYS, dtype=torch.bool)           # [1, 64]
for i, ch in enumerate(LETTERS):
    keys[0, i] = torch.tensor(QWERTY[ch])
    mask[0, i] = True

log_emissions, coefficients, lambda_ = encoder.execute((features, keys, mask))
print("greedy decode:", greedy_ctc(log_emissions.numpy()))  # -> "computer"

Example output:

greedy decode: computer

Encoder inputs and outputs

	Tensor	Shape	Meaning
input	`features`	`[1, 2, 64]`	Swipe trajectory `(x, y)` resampled to 64 points
input	`layout_keys`	`[1, 64, 2]`	Per-key `(x, y)` centers, padded to 64 keys
input	`layout_mask`	`[1, 64]`	Boolean mask of valid keys
output	`log_emissions`	`[1, 32, 65]`	Log-probabilities over 64 keys + blank
output	`coefficients`	`[1, 32, 64]`	Spectral coefficients (decoder features)
output	`lambda`	`[1, 32, 1]`	Intention gate (decoder features)

The output time dimension is 32, half the 64 input points. The encoder applies a 2× temporal downsample (a stride-2 adapter) inside the network, so the 64 trajectory steps become 32 emission steps.

C++ Library

We release an open-source C++ library swipe-library that handles the entire input processing, inference, and trie-constrained beam search process for easy use.

License

Released under the FUTO Model Weights License 1.0.

Downloads last month: 308

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train futo-org/futo-swipe

Paper for futo-org/futo-swipe

FUTO Swipe: Layout-Agnostic Neural Swipe Decoding

Paper • 2606.25247 • Published 23 days ago • 2