I've dig into the internal of an MIT licensed MoE system that makes use of Linear Attention (Lightning Attention) to extend it's context length to 1M input tokens.
How Minimax-01 Achieves 1M Token Context…
I've dig into the internal of an MIT licensed MoE system that makes use of Linear Attention (Lightning Attention) to extend it's context length to 1M input tokens.