If a person decides to throw their life savings into buying lottery tickets and then wins (assuming the person isn’t buying an appreciable percentage of the float for some mad arbitrage opportunity), does that make it a good decision?

According to proponents of reinforcement learning (in it’s current state), intelligence can be solved by using reward signals to shape policy. Certainly most humans’ internal reward signal would go off big if they won the lottery; you can imagine the huge flood of dopamine, euphoria, and of course the extrinsic reward signal of a huge bankroll (though is getting money really an extrinsic reward?). We can all agree reinforcing such behaviour to the point of shaping policy to repeat it will most likely end poorly. RL-is-mimetic proponents will argue that the inevitable catastrophy from consistently doing this behaviour will eventually shape optimal policy. But the feelings of regret or sentiments along the lines of “that was incredibly dumb” are sometimes independent of the outcome, I can definitely attest to that. We lack regret for actions that don’t lead to rewarding outcomes, and sometimes regret actions even when they lead to positive reward signals. I honestly wonder if there’s more people who don’t regret following their dreams even if it didn’t work out, or those who are successful but still have regrets. Evidently independent of the reward we are applying some form of discriminative filter when it comes to credit assignment.

Discrimination on the bodily platform is objectively dumb. But discrimination in the sense of having an internal model of what is the right thing to do and what isn’t, and one that remains unaffected by reward signals is, in my opinion, a better candidate for ‘necessary ingredient’ of the intelligence model rather than the concept of reinforcement learning. When it comes to humans, I also believe this discriminatory model should be cultivated from an objective source of truth. It should be the same for everyone, but that’s a discussion for another time.

It would be interesting to codify this concept of being able devise a discriminatory component for RL that may not be learnable itself but one that based on the state would influence policy. Though if the remainder of the actor model is learnable I suppose it could always learn to ‘tune’ this discriminator out. Though perhaps explainability of that itself would be telling about the alignment of the whole agent-environment system devised.