We study prescriptive scaling laws for foundation models: given a pretraining compute budget, what downstream benchmark accuracy is achievable under contemporary post-training practice, and how stable is this mapping over time? Using large-scale observational evaluations, we estimate capability boundaries (high conditional quantiles of scores) as a function of log pretraining FLOPs via a monotone, saturating sigmoid quantile-regression model. We test temporal reliability by fitting on earlier model generations and evaluating on later releases, finding mostly stable boundaries except for math reasoning, which advances over time. We further analyze task-dependent saturation and potential contamination-related shifts for math reasoning, and propose an evaluation strategy that recovers near-full frontiers with about 20% of the evaluation budget. We release the Proteus 2k dataset and provide a practical methodology for translating compute budgets into performance expectations and monitoring boundary shifts.