ProgramerSalar
/

fineTune_dit_checkpoint

diffusion-transformer

Model card Files Files and versions

fineTune_dit_checkpoint / README.md

ProgramerSalar's picture

Update README.md

3435aa3 verified about 2 months ago

|

history blame contribute delete

2.63 kB

	---
	license: apache-2.0
	library_name: diffusers
	tags:
	- text-to-video
	- dit
	- diffusion-transformer
	- education
	- zulense
	---

	# 🧠 DiT (Diffusion Transformer) Fine-Tuning Experiments

	Core Backbone for the [Zulense Z1 Foundation Model](https://huggingface.co/zulense/z1)

	This repository hosts the Diffusion Transformer (DiT) checkpoints trained to generate educational video content. These models operate in the latent space of our [Causal VAE](https://huggingface.co/ProgramerSalar/causal_vae_checkpoint) and are responsible for the temporal consistency and logical flow of the generated math lectures.

	## 📂 Model Ledger & Performance

	We are releasing the training logs to demonstrate the optimization curve of the "Imagination Engine."

	### 1. `finetune_2_pytorch_model.bin` (🌟 Production Candidate)
	* Role: The Z1 Foundation Backbone
	* Status: ✅ Converged / High Fidelity
	* Performance:
	* This checkpoint represents our stable run. It successfully learned to align temporal attention with the "teacher's movement" and "blackboard writing" logic.
	* Metrics: Achieved target validation loss on the Class 5 & 8 Math dataset.
	* Behavior: Shows strong temporal coherence (objects don't disappear randomly) and adheres to the physics of writing on a board.
	* Recommendation: Use this file for all inference tasks related to Zulense Z1.

	### 2. `finetune_1_pytorch_model.bin` (Experimental / Deprecated)
	* Role: Initial Warmup Run
	* Status: ⚠️ Underfitted / High Noise
	* Performance:
	* This was an early checkpoint where the model struggled to decouple the background (classroom) from the foreground (teacher).
	* Issues: Resulted in "flickering" artifacts and poor text alignment.
	* Archived: Kept here for research comparison to show the impact of our improved data scheduling in `finetune_2`.

	## 🏗️ Architecture Context

	The Zulense Video Pipeline follows a two-stage generation process:
	1. Stage 1 (VAE): Compresses video into latents (See: `causal_vae_checkpoint`).
	2. Stage 2 (DiT): This model (`finetune_2`) acts as the denoising backbone, predicting the latent patches over time based on text prompts (e.g., "Draw a triangle with 3 angles").

	## 💻 Usage (Loading Weights)

	```python
	import torch

	# Path to the best performing checkpoint
	model_path = "finetune_2_pytorch_model.bin"

	# Load weights (assuming standard DiT structure)
	state_dict = torch.load(model_path, map_location="cpu")

	print(f"✅ Loaded DiT Backbone: {model_path}")
	print(f"Tensor keys found: {len(state_dict.keys())}")