“posted Marc’s question in a group chat, and just off the cuff, Tristan Hume, who works on interpretability alignment research at Anthropic, supplied the following list (edited for clarity):
I’d feel much better if we solved hallucinations and made models follow arbitrary rules in a way that nobody succeeded in red-teaming.
(in a way that wasn't just confusing the model into not understanding what it was doing).
I’d feel pretty good if we then further came up with and implemented a really good supervision setup that could also identify and disincentivize model misbehavior, to the extent where me playing as the AI couldn't get anything past the supervision. Plus evaluations that were really good at eliciting capabilities and showed smooth progress and only mildly superhuman abilities. And our datacenters were secure enough I didn't believe that I could personally hack any of the major AI companies if I tried.
I’d feel great if we solve interpretability to the extent where we can be confident there's no deception happening, or develop really good and clever deception evals, or come up with a strong theory of the training process and how it prevents deceptive solutions.”
Contra Marc Andreessen on AI
https://www.dwarkeshpatel.com/p/contra-marc-andreessen-on-ai?utm_term=0_e605619869-85042d89c9-%5BLIST_EMAIL_ID%5D
via Instapaper
I’d feel much better if we solved hallucinations and made models follow arbitrary rules in a way that nobody succeeded in red-teaming.
(in a way that wasn't just confusing the model into not understanding what it was doing).
I’d feel pretty good if we then further came up with and implemented a really good supervision setup that could also identify and disincentivize model misbehavior, to the extent where me playing as the AI couldn't get anything past the supervision. Plus evaluations that were really good at eliciting capabilities and showed smooth progress and only mildly superhuman abilities. And our datacenters were secure enough I didn't believe that I could personally hack any of the major AI companies if I tried.
I’d feel great if we solve interpretability to the extent where we can be confident there's no deception happening, or develop really good and clever deception evals, or come up with a strong theory of the training process and how it prevents deceptive solutions.”
Contra Marc Andreessen on AI
https://www.dwarkeshpatel.com/p/contra-marc-andreessen-on-ai?utm_term=0_e605619869-85042d89c9-%5BLIST_EMAIL_ID%5D
via Instapaper