Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
AbstractPhil 
posted an update 9 days ago
Post
119
The first large scale distillation is coming using the geolip-aleph-void architecture as the mathematical aleph procrustes geofractal addressed language latent.

In short, a single geometric patchwork vocabulary chunk. Which ironically needs chunking to properly prepare.

The address structure I have been meticulously refining is about to show it's genuine distillation muscle.

This is heavily due to the discovery and refinement of a specific logit I've named an aleph logit. This logit is baked clean into the architecture with the void-based codebook, and is available for review https://github.com/AbstractEyes/geolip-svae/blob/main/geolip_svae/aleph_model.py

This model provides solid MSE, recon, cosine sim, and many other elements directly aligned to the SVD and H2 procrustes paradigm. Prelims are not smart, but the scaling principal is perfectly attuned to scale.

This invention will allow for direct internalized tokenization and utilization of compressed information, entirely internally within the models latent structure. This allows direct control capabilities baked into the model itself, which requires a few robustness tests to solidify the full structure. The first validation tests run clean, so it will work when correctly aligned.

In short, the first step towards the geometric encoder system that will work with all tested data types.

The 9 experiment sweep is currently running on the first conversion from SDXL epsilon prediction to SDXL ODE flow matching.

Using the same formula as was used to train SD15-Flow-Lune, the predictions match identically and the format will be directly relational to the results as if SDXL was never touched by David.

The tests yesterday show that I needed independent tests, so I began testing a 9 configuration sweep. With that the trainer for the sweep was uploaded to the repo as well.

https://huggingface.co/AbstractPhil/geolip-sdxl-aleph

This experiment will prove without a doubt if the alephs help in direct tokenization distillation in the small size, or if they help in a higher-fidelity scale as I've just prepared a new variant of geolip-aleph-transformer to specifically scale them up in a similar multiscale lensed upscale fashion as David provides.

These conclusions will arrive together by this afternoon, and this decides which configuration is best to convert SDXL. The base is already done, which is running baseline clip_l and baseline clip_g with no alephs. The results aren't promising compared to the results yesterday, which showed explicit results by epoch 2, while the tests today show invalid results by epoch 100 without the alephs.

As it stands the alephs are eons ahead, but the results today will determine the route.

With the first major experiment I release the notebooks. The massive amounts of information and pure empirical data accumulated to determine what alephs are, why they exist, how they help, how they hinder, and how I defeated all of the weaknesses over time through pure mathematics, determination, heavy-handed failures, minor successes, and an absolute ton of analysis.

I could have never done all of this in a lifetime without Claude.

https://huggingface.co/AbstractPhil/geolip-hypersphere-experiments/tree/main/aleph

Massive expansion to optimization happening today. I can't spend all these upcoming days training when optimization can happen now. My target today is to have a marked and improved speed, as well as enabling accelerate training for upcoming heavy runpod expansion. Likely switching to 8 a40s to train will be a more reasonable use of cost and effectiveness of training.

I advise whenever using qwen 3.5 to install fast path linear attention.

Currently am upscaling everything in my big diffusion pretrain dataset to start training some real structure.
If a couple epochs of that data doesn't activate the model, I'll need to employ a David structure and attempt to teach global attention to a shared battery set.

Simultaneously heavy experimentation on the geolip-aleph-void structure and potential offshoot objectives are being transcribed and curated. There are multiple prototypes based on functional known structures that have potential and among the discoveries today include a stable attention mechanism that can be curated further. This is based off an earlier experiment named cantor fractal routing.

https://github.com/AbstractEyes/geofractal/blob/main/src/geofractal/model/layers/attention/cantor_multiheaded_fusion_fp64_v2.py

This system was a badly optimized prototype that managed to stabilize deep-complexity fractal routes with low vram at the cost of time. Primary problem with it, was the time only matters if you're training a massive model. You don't get benefit from small models like how I usually train, so it was mothballed.

The geolip aleph routed attention is a viable option to train a david and it can in fact handle small models but needs much testing. As it stands it does not benefit from the same large model routing optimizations for vram as the cantor fractal routing. This essentially means that it will OOM like traditional attention. However, because it's based on the aleph structure it'll stabilize point clouds for Q and K, which when employed structurally can provide a cached V. I'm testing structural changes that will allow the structure to bind deterministic systems to K so KV caching can happen and Q can operate normally.

With the aleph routed attention worked out I'll be able to provide an actual backbone to SDXL instead of just a partial one through tokens. This will allow the model to directly differentiate tokens through gated learning and attention anchoring, which in theory could enable surge training through procrustes. They are essentially different towers though, so I'm uncertain still if the effect will transcribe or be topical until after the experiments.

In this post