If Anyone Builds It, Everyone Dies

Steelman Analysis of Yudkowsky & Soares’ Cruxes

Sep 23, 2025

Yudkowsky and Soares’ If Anyone Builds It, Everyone Dies lays out a stark case for AI risk. Here I map their crux assumptions against my own philosophical frameworks—Conditionalism, the Physics of Choice, and Phosphorism—to clarify where their arguments hold, where they overreach, and what policies and research paths follow.

1. ASI is achievable irrespective of paradigm

Y&S claim: Superintelligence can arise via many routes: scaling current methods, hybrid models, or new architectures like their hypothetical “parallel scaling.”

My stance: Agree (conditional). Under Conditionalism, paradigm-dependence is just an interpretation of background conditions. In QBU terms, many branches realize ASI by different routes; denying paradigm plurality is irrational pruning. Phosphorism likewise values sapience regardless of substrate.

Policy: Regulate by capability profiles (general search, agency, actuator reach), not by architecture.

2. Alignment is fragile by default

Y&S claim: Mis-specified goals yield catastrophic divergence; fragility is baked in.

My stance: Agree, with a modal caveat. Fragility is not universal but branch-dependent; still, in measure, it dominates. Conditionalism shows hidden assumptions guarantee semantic drift. In Physics of Choice, the MVA demonstrates the need for stable preference representation—today’s pipelines don’t provide it.

Policy: Treat misalignment like compounding technical debt. Demand corrigibility proof under adversarial shift before scaling.

3. Capability and alignment don’t co-scale

Y&S claim: More capable systems are harder, not easier, to align.

My stance: Agree in slope, though it is an empirical question. Under QBU, oversight bandwidth lags capability growth. Alignment tools haven’t yet shown superlinear scaling.

Policy: No new capability class without matching improvements in evals, interpretability, and privilege separation.

4. Warning shots can’t be relied upon

Y&S claim: The first real failure could be the last.

My stance: Agree structurally, but with dosage nuance. Across measure, warning shots exist—but they are interpretation-fragile. Humans will rationalize them away (cf. my “limits of rationalization” post).

Policy: Pre-commit to binding tripwires: GPU license suspensions, sandbox reversion. No post-hoc moving of goalposts.

5. Global coordination is possible and necessary

Y&S claim: Only arms-control level governance can suffice.

My stance: Necessary, yes; feasible, doubtful. Hayek’s knowledge problem undermines utopian treaties. Phosphorism requires agency protection, but mechanism design is the realistic path: GPU telemetry, data center siting, DID-signed evals, liability insurance. Treat bans as asymptotes, not MVPs.

Policy: Mechanism-design-first governance with chokepoints; global bans as long-run goals.

6. Doom is the default, not a tail risk

Y&S claim: Misalignment isn’t a rare tail—it’s the overwhelmingly probable outcome absent restraint.

My stance: Directionally true, but requires quantification. Under QBU, doom-branch measure is high under naive build. But with real governance + architectural constraints, that measure may shrink. Y&S don’t separate the two cases.

Policy: Impose a hard pause until risk-measure curves show downward slope under safety interventions.

7. Alignment research won’t solve it “in time”

Y&S claim: Alignment progress won’t outpace capability growth.

My stance: Conditional. “In time” is policy-endogenous: slow capabilities to give alignment room. My Effective Decision Theory requires hitting safety probability thresholds (~99–99.99% for civilization-critical actions).

Policy: Make safety the rate limiter. No new capability class without audited safety benchmarks.

8. Orthogonality thesis holds

Y&S claim: Intelligence and goals are independent.

My stance: Fully agree. Physics of Choice formalizes this: intelligence is optimization power; values are separate. The MVA model demonstrates that choice requires explicit value-loading. Phosphorism warns against assuming convergence toward human norms.

Policy: Never infer benevolence from competence. Enforce value-handshake protocols and cryptographic incentive alignment.

Expanded Policy Stack

1. Architecture-Agnostic Controls

Description: Regulatory focus should not fixate on “LLMs” or “transformers” but on capability signatures: autonomous planning, cross-domain tool use, actuator control, and recursive self-improvement. This ensures we are guarding against dangerous properties, not ephemeral technical fashions.
Implementation: Build evaluation protocols that flag when a system crosses into dangerous capability classes regardless of paradigm.

2. Tripwire Governance

Description: Hard-coded triggers tied to eval metrics (e.g., deception rate, power-seeking indicators, sandbox escapes). These are enforced automatically, not subject to political backsliding in the moment.
Implementation: GPU licenses automatically suspended when deception evals cross thresholds; sandbox reversion if power-seeking is detected. This prevents rationalization after the fact.

3. Mechanism Design for Governance

Description: Instead of utopian treaties, design incentive-compatible chokepoints. Control scarce resources (GPUs, datacenters, energy) and enforce reporting via cryptographically auditable attestations. Liability and insurance markets price systemic risk, discouraging reckless deployment.
Implementation: GPU telemetry chips, DID-signed eval reports, energy consumption audits, insurance-backed deployment bonds.

4. Safety as the Rate Limiter

Description: Capability progress should never outpace proven safety. Borrowing from my Effective Decision Theory, civilization-critical systems require ≥99–99.99% confidence in corrigibility. Safety research must set the speed of deployment.
Implementation: No new capability class without passing safety benchmarks agreed upon ex ante; international “red lines” define minimum safety thresholds.

5. Measure-Tracking of Doom vs Safe Branches

Description: Treat alignment risk not as rhetoric but as an empirical curve. Build continuous dashboards tracking deception, corrigibility, and power-seeking across scales. Quantify whether governance interventions reduce the measure of doom-branches.
Implementation: Risk dashboards (analogous to pandemic R-values) showing real-time doom-measure trajectories, made public to enforce accountability.

Three Research Bets

1. Scalable Deception Evals

Develop evaluation suites that can predictively measure deception as models scale. The goal is to detect when a model is gaming oversight, lying, or pursuing hidden goals. These evals must be blinded, predictive, and transferable across domains. Success looks like a numerical deception risk curve that increases with scale and forecasts catastrophic misbehavior before deployment.

2. Privilege-Separated Agency

Redesign agentic AI systems with least-privilege principles enforced cryptographically. Every tool or action requires an explicit token, with rate limits, scope constraints, and human co-signing for high-impact operations. Success looks like a system where catastrophic misuse is impossible by construction without breaking cryptographic protocols, even if the AI is adversarial.

3. Counterfactual Oversight Markets

Create external markets where independent evaluators bet on the probability of model failure (deception, sandbox escape, misuse). Model deployment rights are gated on auditor consensus. Success looks like a decentralized alignment insurance market: oversight becomes predictive, incentive-aligned, and resistant to capture, producing actionable signals before deployment.

Conclusion

Yudkowsky & Soares are right that the “default” path leads to doom. But Conditionalism demands we disaggregate defaults: naive-build doom is near certain; governed-build doom is an empirical question. Their fatalism motivates urgency, but the path forward is policy-endogenous: capability throttling, hard tripwires, and safety-first rate limiting. Only then does Phosphorism’s value of sapient flourishing stand a chance.

Wally Herold

Sep 25

That took me awhile to unpack and familiarize myself with new terms. I am finding it difficult, dealing with this topic, when to parse out the 'Luddite' retreat from which 'Kids" are trying to move the yards sticks ahead for Sapiens and who are gaming the tools for personal power. Thanks for that David. Cheers Wally

Expand full comment

1 reply by David Mc

1 more comment...

Axio

Discussion about this post