Close Navigation
Learn more about IBKR accounts
Reward is Enough for Collective Artificial Intelligence

Reward is Enough for Collective Artificial Intelligence

Posted June 12, 2023
Dr. Peter Cotton
Dr. Peter Cotton

The article “Reward is Enough for Collective Artificial Intelligence” by Petter Cotton appeared on Microprediction via Medium.

Excerpt

This essay responds to the well-known Reward is Enough article by David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton.

The title of their work is a high-quality provocation but somewhat futuristic. In this response, I argue that “reasonably impressive collective intelligence” is more imminent and feasible than the outcome they hope for — one seemingly premised on very powerful reinforcement learning (RL) agents. In my counter-scenario many people contribute only “reasonably intelligent” algorithms and a web is spun between them.

There is a well-appreciated economic reason why reward isn’t already enough, in my scenario. But fortunately, this also suggests a solution. This post originally appeared here, incidentally. I’m moving notes to medium.

A bold solution to artificial intelligence?

It is invigorating to see professors formulating a strong hypothesis about a field they pioneered — and at the same time a hypothesis about multiple fields. This paper might be seen as both a mini-survey for those doing related work and an introduction to reinforcement learning principles for those in different fields that might benefit. It invites researchers to consider whether they have underestimated a line of thinking. And it says, or at least I read it as, “listen, this might not work but the payoff is rather large — so lend me your ears (and maybe a little more funding).”

Oh yes, the list of flippant responses to this paper is long. Rejected titles might have included, “Is another $3 billion enough?”. Coming from DeepMind, the paper might also be cynically viewed as “reward hacking”. The project might be seen as an agent that has a huge, legitimate scientific goal. But like a robot in a maze that is seemingly too hard to solve, it might be accused of overfitting to less scientific intermediate rewards it has created for itself.

If that harsh view is in any way accurate, then I blame the press and corporate faux data scientists for that, not the researchers. Funding for research is hard to come by, and science is rarely advanced by those with lukewarm enthusiasm for their own work. And let’s not forget the accomplishments. Solving protein folding has immense implications, even if that doesn’t translate immediately into general intelligence. It doesn’t matter if DeepMind hasn’t “finished” that — for they seem to be leading.

Why not entertain speculation on the future of AI from these authors?

In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalization and imitation.

Sure, what differentiates this paper from a reinforcement learning survey is this somewhat aggressive style. But the fact that someone is “talking their book” doesn’t make them wrong. Nor should it disqualify the potential. That’s so large, it almost feels like a Pascalian wager:

This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.

I share the author’s desire for unifying beauty — who doesn’t? And in particular, the elegant idea that reward maximization is “sufficient” is attractive. I’m less sure that this is quite as clearly delineated from every other possible emphasis as the authors would like, but they make the case as follows.

One possible answer is that each ability arises from the pursuit of a goal that is designed specifically to elicit that ability. For example, the ability of social intelligence has often been framed as the Nash equilibrium of a multi-agent system; the ability of language by a combination of goals such as parsing, part-of-speech tagging, lexical analysis, and sentiment analysis; and the ability of perception by object segmentation and recognition.

In other words, the need for seemingly disparate approaches is disheartening, as far as general AI goes, and they reduce the chance of a “wow” answer. So let’s bet on RL. The situation is not dissimilar to the mediation of Pedro Domingos’ book The Master Algorithm, in which the author invites the reader to generalize an all-encompassing algorithm subsuming special cases like nearest-neighbor, genetic programming, or backpropagation. (I note that Domingos pursues the crowd-sourcing approach, and isn’t asking for forgiveness on a $1.5 billion dollar loan … but I promise to stop with the jibes now).

In “reward is enough”, the speculation is not quite down Domingos’ line because the authors stop short of suggesting a singular RL approach — or even the possibility of one. At least in my reading, the point is more subtle. They suggest:

In this paper, we consider an alternative hypothesis: that the generic objective of maximising reward is enough to drive behaviour that exhibits most if not all abilities that are studied in natural and artificial intelligence.

So, the broad category of reinforcement learning, which presumably needs to be delineated from more model-intensive ways of guiding people and machines, can be enough when it comes to creating and explaining generalized intelligence. I don’t think they are suggesting that every function in numpy be rewritten as a reward optimization (it is mathematically obvious that all of them can be).

Are other things not enough?

The authors mention some examples where reward is enough for imitation, perception, learning, social intelligence, generalization, and language. A danger here is that it can feel a little bit like a survey of applied statistics or control theory, where an author concludes that computing a norm of a vector is enough (as that is common to many activities, it must be conceded).

The possible circularity of defining intelligence as goal attainment and then claiming that goal attainment explains intelligence has been noted in a rather withering style by Herbert Roitblat (article), who also provides a historical perspective on reward-based explanations going back to B. F. Skinner. Skinner copped the People’s Elbow from Noam Chomsky and the fight was stopped immediately.

As the authors are keenly aware, “Reward is Enough” sails dangerously close to that other well-known thesis normally attributed to Charles Darwin. Is evolution enough? The authors suggest that because crossover and mutation aren’t the only mechanisms, this is substantially different. Okay.

It is certainly true that rewards help in many places — I’ll get back to Economics 101 in a moment. However, for me, the least persuasive section of the paper is the author’s somewhat glib dismissal of other things that might be “enough”. For instance, they reject the idea that prediction is enough on the grounds that alongside supervised learning it will not provide a principle for action selection and therefore cannot be enough for goal-oriented intelligence.

What about action-conditional prediction? What about the conditional prediction of value functions? Is that considered so different from prediction? This feels too much like a game with strange rules and rewards: an imagined competition in which single-word explanations are to be set against each other. We might as well typeset them on Malcolm Gladwell-style book covers. (That generator used to exist, by the way, but maybe the rewards for maintaining the site were insufficient). But continuing:

Optimization is a generic mathematical formalism that may maximise any signal, including cumulative reward, but does not specify how an agent interacts with its environment.

Except that “mere optimization” certainly does specify how an agent interacts, or can. Anyone with the creativity to design ingenious optimization algorithms that work in high dimensional spaces miraculously well and beat benchmarks is certainly capable of making the relatively trivial conceptual step — marrying this to hyper-parameters in some model-rich approach. I suppose, however, that the argument is that there is no model-rich approach in the vicinity.

Still, rather than elevate one area of study over another, I’d be inspired by the theory of computational complexity. This has taught us that many seemingly different problems are equally hard, and equally general. Solve one and you might well solve them all.

That said, it is legitimate for the authors to point out the distinction, and chief advantage, of reinforcement learning compared with various flavours of control theory. RL attempts a shortcut to intelligent behaviour that can sometimes avoid the self-imposed limitations of an incorrect, or inconvenient, model of reality.

Avoiding a normative model of reality is a line many of us can be sympathetic to — and not just in real-time decision making. In my case, I recall noodling on ways to avoid models in a derivative pricing setting. (I’m not sure it came to much, but unlike the Malcolm Gladwell book generator, at least the page still exists). I’m guessing many modelers have grown frustrated at their own inability to mimic nature’s generative model over the years. That is, after all, the genesis of the machine learning revolution and Brieman’s “second culture of statistics”. Enter reinforcement learning:

By contrast, the reinforcement learning problem includes interaction at its heart: actions are optimised to maximise reward, those actions in turn determine the observations received from the environment, which themselves inform the optimisation process; furthermore optimisation occurs online in real-time while the environment continues to tick.

I can’t say I’ve ever completely bought into this taxonomy. Working backward in the above paragraph, there are plenty of things that perform optimization incrementally in real-time. I was just working on one here but I wouldn’t call it RL — maybe Darwinian. Nor would I call it reward-based even though yes, algorithms are rewarded for having a lower error. The above passage makes it sound like most of the novelty in RL springs from online versus batch computation. I’m a huge fan of the former but it’s pretty old stuff.

The problem is, that the more you introduce specific tricks into RL for creating and predicting reward functions or advantage functions, the more it starts to look like there might be other holy grails, like online optimization, or conditional prediction (of intermediate rewards, sure).

Or maybe it’s even simpler. If you predict for me how long it will take to travel through the Lincoln tunnel versus an alternative, I can certainly make a decision. It seems that here prediction is the real open-ended task, not the “reinforcement logic”. Perhaps a discussion of temporal difference learning, and other specific devices, might help make the author’s case crisper. Otherwise, Captain Obvious from the Hotels.Com commercials enters stage left and declares “prediction is enough!”

Read the rest of the article here: https://microprediction.medium.com/reward-is-enough-for-collective-artificial-intelligence-8b7ae45eb044.

About your author

I work for Intech InvestmentsI stir up trouble on LinkedIn on occasion, in the hope that people might contribute to open source code serving the broad goal of collective artificial intelligence.

Addendum May 2023: This article appeared in Aug 2021 on my old blog and I’m moving it over. Since then I did, in fact, finish the book referenced above which comprises the long(er) form discussion, and I’m pleased to report it has won some awards.

Join The Conversation

If you have a general question, it may already be covered in our FAQs. If you have an account-specific question or concern, please reach out to Client Services.

Leave a Reply

Your email address will not be published. Required fields are marked *

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from Dr. Peter Cotton and is being posted with its permission. The views expressed in this material are solely those of the author and/or Dr. Peter Cotton and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.