
Quick Note
Core Methodology?
- Hybrid Attention (Compressed Sparse Attention + Heavily Compressed Attention)
- Manifold-Constrained Hyper-Connections (mHC)
- Muon Optimizer
- Pretrain
- Post-Train
- Domain-specific experts then on-policy distillation (wherein the unified model acts as the student learning to optimize the reverse KL loss with teacher models)
Motivation
Strenghts
Core Architecture
Related Works
- <Main baseline>
- Other methods to solve the same problems
Relation to My Research
Main research Gap