ExecuTorch
swipe-typing
gesture-typing
shape-writing
keyboard
on-device
ctc
mobile

FUTO Swipe

Mobile-oriented models for decoding swipe gestures into text.

Swipe decode of the word 'computer'

See the paper (coming soon) for more details.

Models

This repository contains 3 CNN models that compose together. Only the encoder is required. The decoder and language model are additional refinements, leveraging specific layout and language information. The encoder can decode for any keyboard layout, while the decoder is English/QWERTY-only and the language model is English-only.

Model Codename Role Params Size (fp32)
Encoder honorable_sturgeon Maps a swipe trajectory to per-timestep character emissions. Layout-agnostic β€” works on any keyboard supplied at runtime. 635 K 2.65 MB
Decoder magic_macaw (Optional) per-layout refinement over the encoder's frozen features. Lifts top-k where layout-specific training data exists (English here). 304 K 1.25 MB
Context LM hungry_jellyfish (Optional) next-word and beam-rerank language model that blends sentence context into candidate ranking. 1.5M 6.25 MB

Encoder β€” honorable_sturgeon

A 1D temporal convolutional network (TCN) reads the raw (x, y) touch trajectory and emits a 64-coefficient spectral pattern and a scalar "intention" gate for each timestep. Per-key character scores are read off by evaluating a fixed cosine (DCT) basis at the layout key centers. Switching layouts on device requires no retraining, just a different key-coordinate tensor.

Decoder (English/QWERTY) β€” magic_macaw

A small DFSMN decoder over the frozen encoder features. It refines the character distribution on specific layouts. Currently we only have data for training an English/QWERTY decoder.

Context LM (English) β€” hungry_jellyfish

A causal DFSMN language model with a hash embedding for large vocabularies. It can supply a rerank score and perform next-word prediction. This model can be used with or without a decoder.

Getting started

The example below demonstrates the encoder on CPU (x86) with the ExecuTorch runtime and greedy-decodes a swipe into characters. Note that greedy decoding is fairly inaccurate and should generally be improved by constraining to a lexicon (eg. trie, WFST).

import numpy as np
import torch
from huggingface_hub import hf_hub_download
from executorch.runtime import Runtime

# QWERTY letter centers in the normalized [0,1] keyboard frame. This is the
# only layout-specific input (can swap in another layout's key coordinates to
# decode a keyboard the encoder never saw at training time).
QWERTY = {
    "a": (0.10, 0.500), "b": (0.60, 0.833), "c": (0.40, 0.833), "d": (0.30, 0.500),
    "e": (0.25, 0.167), "f": (0.40, 0.500), "g": (0.50, 0.500), "h": (0.60, 0.500),
    "i": (0.75, 0.167), "j": (0.70, 0.500), "k": (0.80, 0.500), "l": (0.90, 0.500),
    "m": (0.80, 0.833), "n": (0.70, 0.833), "o": (0.85, 0.167), "p": (0.95, 0.167),
    "q": (0.05, 0.167), "r": (0.35, 0.167), "s": (0.20, 0.500), "t": (0.45, 0.167),
    "u": (0.65, 0.167), "v": (0.50, 0.833), "w": (0.15, 0.167), "x": (0.30, 0.833),
    "y": (0.55, 0.167), "z": (0.20, 0.833),
}
LETTERS = sorted(QWERTY)
MAX_KEYS = 64  # export-time padding bound

# A real swipe for the word "computer": normalized x, y and timestamps (ms).
PX = [0.4141, 0.4478, 0.5, 0.5741, 0.6599, 0.7256, 0.7744, 0.8098, 0.8485, 0.867,
      0.8737, 0.8653, 0.8418, 0.8182, 0.8098, 0.7963, 0.7946, 0.8081, 0.8418, 0.8704,
      0.9057, 0.9259, 0.9545, 0.9697, 0.968, 0.9529, 0.9141, 0.8468, 0.7811, 0.7273,
      0.6869, 0.6616, 0.6582, 0.6431, 0.6061, 0.5572, 0.5067, 0.4663, 0.4495, 0.4461,
      0.4411, 0.4192, 0.3872, 0.362, 0.3283, 0.2795, 0.2391, 0.2323, 0.2407, 0.2593,
      0.2879, 0.3249, 0.3468, 0.3569]
PY = [0.8991, 0.858, 0.7876, 0.6702, 0.5352, 0.4237, 0.3357, 0.2653, 0.1655, 0.142,
      0.142, 0.2183, 0.3709, 0.588, 0.7347, 0.8462, 0.8697, 0.811, 0.6115, 0.4707,
      0.3122, 0.2066, 0.1303, 0.1068, 0.1068, 0.1068, 0.1185, 0.1596, 0.1772, 0.1772,
      0.1772, 0.189, 0.189, 0.189, 0.1831, 0.189, 0.189, 0.189, 0.189, 0.189,
      0.1831, 0.1831, 0.1831, 0.1831, 0.1831, 0.1948, 0.189, 0.1948, 0.189, 0.189,
      0.189, 0.1831, 0.1831, 0.1831]
PT = [0.0, 100, 149, 197, 246, 297, 348, 399, 449, 498, 548, 598, 648, 698, 749, 799,
      849, 949, 999, 1047, 1100, 1152, 1197, 1248, 1314, 1364, 1414, 1465, 1515, 1565,
      1614, 1666, 1715, 1851, 1898, 1951, 1998, 2049, 2097, 2165, 2231, 2279, 2331,
      2382, 2431, 2481, 2532, 2584, 2649, 2700, 2751, 2798, 2848, 2899]


def resample(px, py, pt, T=64):
    """Resample a variable-length trajectory to T evenly-spaced points -> [2, T]."""
    x, y, t = map(np.asarray, (px, py, pt))
    t = t - t[0]
    if t[-1] > 1e-3:  # uniform 60 Hz resample, then to T points
        n60 = max(2, round(t[-1] / (1000.0 / 60.0)) + 1)
        tt = np.linspace(0.0, t[-1], n60)
        x, y = np.interp(tt, t, x), np.interp(tt, t, y)
    idx = np.linspace(0, len(x) - 1, T)
    rx = np.interp(idx, np.arange(len(x)), x)
    ry = np.interp(idx, np.arange(len(y)), y)
    return np.stack([rx, ry], axis=0).astype(np.float32)


def greedy_ctc(log_emissions):
    """Collapse the per-timestep argmax into a string (blank is the last class)."""
    blank = log_emissions.shape[-1] - 1
    out, prev = [], -1
    for c in log_emissions[0].argmax(axis=-1):
        if c != prev and c != blank and c < len(LETTERS):
            out.append(LETTERS[c])
        prev = c
    return "".join(out)


# Load the encoder .pte and run one forward pass.
pte = hf_hub_download("futo-org/futo-swipe", "honorable_sturgeon/model_fp32.pte")
encoder = Runtime.get().load_program(pte).load_method("forward")

features = torch.from_numpy(resample(PX, PY, PT)[None])     # [1, 2, 64]
keys = torch.zeros(1, MAX_KEYS, 2)                          # [1, 64, 2]
mask = torch.zeros(1, MAX_KEYS, dtype=torch.bool)           # [1, 64]
for i, ch in enumerate(LETTERS):
    keys[0, i] = torch.tensor(QWERTY[ch])
    mask[0, i] = True

log_emissions, coefficients, lambda_ = encoder.execute((features, keys, mask))
print("greedy decode:", greedy_ctc(log_emissions.numpy()))  # -> "computer"

Example output:

greedy decode: computer

Encoder inputs and outputs

Tensor Shape Meaning
input features [1, 2, 64] Swipe trajectory (x, y) resampled to 64 points
input layout_keys [1, 64, 2] Per-key (x, y) centers, padded to 64 keys
input layout_mask [1, 64] Boolean mask of valid keys
output log_emissions [1, 32, 65] Log-probabilities over 64 keys + blank
output coefficients [1, 32, 64] Spectral coefficients (decoder features)
output lambda [1, 32, 1] Intention gate (decoder features)

The output time dimension is 32, half the 64 input points. The encoder applies a 2Γ— temporal downsample (a stride-2 adapter) inside the network, so the 64 trajectory steps become 32 emission steps.

License

Released under the FUTO Model Weights License 1.0.

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train futo-org/futo-swipe