Quick Note

Core Methodology?

Hybrid Attention (Compressed Sparse Attention + Heavily Compressed Attention)
Manifold-Constrained Hyper-Connections (mHC)
Muon Optimizer

Pretrain
- 32T
Post-Train
- Domain-specific experts then on-policy distillation (wherein the unified model acts as the student learning to optimize the reverse KL loss with teacher models)

Motivation

Strenghts

Core Architecture

Related Works

<Main baseline>
Other methods to solve the same problems

Relation to My Research

Main research Gap