Navigating the Path to Beneficial AI: Stuart Russell’s Roadmap for the Future
Russell, S. (2020). Human compatible: Artificial intelligence and the problem of control . Penguin Books. Buy the book here . Highlights: The standard model of AI, where machines optimize a fixed objective, is fundamentally flawed and becomes untenable as AI becomes more powerful. To ensure AI remains beneficial, machines should be designed to be uncertain about human preferences and learn them through observation, leading to deferential and cautious behavior. The three principles for beneficial machines are: the machine's only objective is to maximize the realization of human preferences, the machine is initially uncertain about those preferences, and human behavior is the only source of information about human preferences. Inverse reinforcement learning (IRL) is a key technique for machines to learn human preferences from observed behavior. The off-switch problem (machines resisting being turned off) can be solved by machines that are uncertain about human preferences, as they ...