Reinforcement Learning (RL) follows a simple premise: there is a world, which you can sense and interact with, and a measure of goodness or badness which you can influence through your actions. Everything else is variations on trial and error. The actual algorithms are intuitive and often provably optimal, given their assumptions. Everything about the complexity comes from determining constraints, sensors, actions and values — all setup before the actual computer science begins. A straightforward example would be a chess board — the state of the world is the arrangement of pieces on the board, and we can influence the world by making moves. That’s sensors and actions, but what about value? A little more complicated, but there are a few reasonable choices: probability of securing a win in the given position, material advantage, positional advantage (bonus points for connected pawn chains!) etc. Any of these or all of them in combination can be used to estimate the “value” of a particular board state to us.
Having a value estimation is crucial because it tells us which actions to pursue. Without making value judgments about the state of the world, we have no way of making meaningful decisions through our actions. Algorithms cannot define value — it has to be defined for them by a human being. We decide what to reward the algo for and how, translating success into a numeric value, entirely on the basis of what we are motivated to accomplish. A research team wanted to win chess games, so they made AlphaZero want to win chess games. This kind of subjective value judgment underlies every number crunching machine learning algorithm, every application of computer science at all. The actor is informed by its creator.
A chess board is entirely visible to the computer model — it is directly accessible, we can be omniscient about the state of the board. Most interesting problems involve worlds which are not directly accessible. The computer agent makes an action, then gets an observation, which is based on the state of the world but not the state itself. We get partial information, or an indirect measurement, or a noisy signal. Here we start making assumptions about the state of the world, about how it is structured, about how the observations reflect the underlying reality. Even without knowing the world directly, we can still make value judgments based on the information we do have access to. Even if we’re not certain of our observations, we can still move toward good, or away from bad. Even if our sensors are unreliable and our actions are unreliable — if sometimes we get bad data about the world state, or an action happens that we didn’t want to do — that doesn’t stop us from being able to make improvements over random. Maybe our final result will not always be very good, but it will be the best we can do, and it will be better than giving up.
When dealing with unreliable systems, information we can’t always trust or a problem liable to change underneath us, it’s not helpful to try and achieve 100% correctness. There is an optimal way to play chess that we could discover by machine if we worked hard enough, there is no optimal way to play poker. But there is a best way to play poker, and we can find that — a strategy that might not get the best result every time, but will tend to get better results than anything else in the long run. RL has a class of algorithms which deal with situations like this, called PAC algos, for Probably Approximately Correct. Formally, these algos are defined mostly by two parameters called δ and ε, both very small. Each algorithm guarantees a (1-δ) probability that by the end it will achieve a value which is within ε of the optimal value possible. Setting the specific values of these two parameters is up to the discretion of the person applying the algorithm, and the smaller they are generally the longer it takes to achieve. They are mathematical guarantees of things that can’t be fully guaranteed, promises that we are almost certain of the correctness of our assumptions about the world, even though we cannot know the world directly. Which brings us to Kant.
Kant’s Critique of Pure Reason is built on the distinction between phenomena and noumena, which correspond nicely to observations about a hidden state and the existence of that state itself. To Kant, the sciences are purely concerned with phenomena — with explaining the world as we experience it, and only metaphysics grapples with noumena. We cannot make any kind of a priori judgment about the noumena based on our experience. Kant stresses this point because he only wants absolute truth, knowledge beyond a shadow of a doubt. He cannot stomach any PAC estimations — but we can, and do. In our effort to know things, we satisfy ourselves with somewhat-correct estimations of the actual underlying reality which informs our senses, and we work longer and longer until we build even better estimates, and so on. The laws of physics as we know them are not uncorrelated with noumena, they just aren’t infinitely precise. But infinite precision is a mathematician’s luxury — there’s no need for even base computers to have 100% certainty before deciding something is true, let alone for us human beings. We are built with the capacity to handle uncertainty for a reason: so that we can pursue our goals in the world-as-it-is without omniscience.
And where do those goals come from? Machines cannot generate value systems, they must be programmed in by a creator. Something outside the system of action and observation has to give ethics to the actors which operate inside it. Something with an understanding not just of the actor but of the world in which they will act, the noumena, what makes it better, what makes it worse. Motivation comes from somewhere, even for base processes like entropic decay. Anything which acts does so for a reason given to it by something which constructs it, otherwise it would not act at all. If we accept that our value judgments are informed by understanding of the world-as-it-is, even if we do not have any understanding of that world ourselves beyond what we can observe, then our efforts to describe rules of nature through observing phenomena do depend on, and meaningfully inform us about the underlying noumena. They are not just a convenient axiom to allow for reason to exist, as Kant believes — noumena are things we interact with and which inform our observations, if they exist at all, and we can make plenty of claims about them, as long as we allow the caveat of PAC. Allowing uncertainty, we defeat it. Perhaps Kant would have understood this, had he spent more time in the gambler’s den.