Discrimination & RL
If a person decides to throw their life savings into buying lottery tickets and then wins (assuming the person isn’t buying an appreciable percentage of the float for some mad arbitrage opportunity), does that make it a good decision?
According to proponents of reinforcement learning (in it’s current state), intelligence can be solved by choosing actions which will lead to the greatest reward (both now and in the discounted future). Most humans’ internal reward signal would go off big if they won the lottery; you can imagine the huge flood of dopamine, euphoria, and of course the extrinsic reward signal of a huge bankroll. We can all agree reinforcing such behaviour to the point of shaping policy to repeat it will most likely end poorly. Yes, there are implementations that accommodate this such that the inevitable catastrophy from consistently engaging in this behaviour will eventually shape optimal policy. But the feelings of regret or sentiments along the lines of “that was incredibly dumb” are sometimes independent of the outcome, I can definitely attest to that. We lack regret for actions that don’t lead to rewarding outcomes, and sometimes regret actions even when they lead to positive reward signals. I honestly wonder if there’s more people who don’t regret following their dreams even if it didn’t work out, or those who are successful but still have regrets. Evidently independent of the reward we are applying some form of discriminative filter when it comes to credit assignment.
There’s a chance this is adequately addressed via the actor-critic methods, where the policy is shaped by expectation of reward. In this respect, the principle of selecting actions that may not lead to a good reward but rather a high expected one would address the lack or remorse we feel for actions that didn’t pan out. But this feels dissapontingly inadequate. For one, how does it explain when we willingly take actions acknowledging their negative expected reward? Startup founders notoriously claim they go in knowing there is a slim chance of success and a great deal of pain even in the success of it. That the mental model supposedly nevertheless attributes a positive expected reward to this endeavour just doesn’t feel like it adequately encapsulates human intelligence policy. People who talk about being guided by specific goals seem to chase after them the same way nematodes chase sugar (funnily enough to their own demise). But people who instead see their lives as devoted to a purpose do not seem at all fixated on the rewards. They don’t talk about the fancy cars or house that awaits them, rather the need for such a purpose to be achieved. People who talk about changing society for the better, improving this or that, they appear fixated on this purpose being achieved independent of whether or not they are the ones to achieve it, or whether they are around to even see it’s fruition. Where is the reward expectation in this?
It is this propulsive force I’m calling discrimination. Discrimination on the bodily platform is objectively dumb. But discrimination in the sense of having an internal model of what is the right thing to do and what isn’t, and one that remains unaffected by reward signals is, in my opinion, a better candidate for ‘necessary ingredient’ of the intelligence model rather than the concept of reinforcement learning. When it comes to humans, I also believe this discriminatory model should be cultivated from an objective source of truth. It should be the same for everyone, but that’s a discussion for another time.
It would be interesting to codify this concept of being able devise a discriminatory component for RL that may not be learnable itself but one that based on the state would influence policy. Though if the remainder of the actor model is learnable I suppose it could always learn to ‘tune’ this discriminator out. Though perhaps explainability of that itself would be telling about the alignment of the whole agent-environment system devised.