Statement from ByteDance TikTok
Reinforcement learning in Seed-Thinking-v1.5 is powered by custom actor-critic (VAPO) and policy-gradient (DAPO) frameworks, developed to address known instabilities in RL training. These techniques reduce reward signal sparsity and enhance training stability, especially in long chain-of-thought (CoT) settings. For supervised fine-tuning (SFT), the team curated 400,000 samples, including 300,000 verifiable (STEM, logic and coding tasks) …