Estimated Read Time: 23 minutes

Reinforcement learning (RL) is a powerful machine learning technique in which an agent learns to make decisions by interacting with its environment in order to maximise some notion of cumulative reward. It is the job of the engineer to design these rewards in a way that causes the agent to learn a behaviour policy that achieves a specific goal. However, this technique has an interesting limitation which can lead to peculiar outcomes. When training an agent, we cannot simply tell it the goal we want it to learn; we must guide its behaviour towards this goal using rewards. It's therefore possible for the agent to learn a behaviour policy that appears to be aligned with our goal, but in actuality, when deployed, does something completely different, demonstrating misalignment (see this spreadsheet for a long list of examples). There are many possible causes of misalignment. It might arise from issues during the training phase, or perhaps particular differences between the training and deployment environments. Today, we will be taking a closer look at one particular instance of misalignment in RL: sparse reward deployment environments with novel obstacles for single life reinforcement learning (SLRL) scenarios such as search and rescue robots.

Ooof, that's a bit of a mouthful! So, before we move ahead, let's break it down. As previously mentioned, when we train an agent to complete a goal we do so by supplying it with rewards that guide its behaviour towards that goal. For example, we might be training a search and rescue robot dog (Rex) to locate a human in a building, so (using a simulation for efficiency) we'd create a bunch of random building interiors, deploy Rex into this environment, and let it explore. We'd "feed" Rex rewards only when it locates the human, and offer it nothing otherwise. Over time we update Rex's behaviour policy to increase the probability of making decisions that lead to the reward, and decrease the probability of making decisions that don't. Well, it's not actually us who updates the behaviour policy, but a particular class of deep RL algorithms known as 'policy gradient methods', e.g. proximal policy optimisation. Eventually, this process causes Rex to learn an efficient behaviour policy for search and rescue environments.

So, why might it all go wrong on deployment? Well, search and rescue is a particular case of SLRL, which itself is any instance where, an agent, having learned a particular behaviour policy in one environment, is placed into a new environment with novel obstacles. The agent must use its prior experience (learned policy) to navigate this new environment and novel obstacles in a single attempt, without any human (or other) interventions. Put more directly, when we throw Rex into a burning building and say, "now go find the human!" it's likely to come across obstacles (such as fallen debris or fires) in novel and unpredictable ways. On top of that, Rex can't be overseen by a human, it must act autonomously and also cannot be picked up and reset if it goes wrong. Rex has one chance to efficiently locate the human or we're out of luck!

So why do these novel obstacles cause such a problem? Why can't we just train Rex to deal with them before deployment? Well, whilst you and I might be able to list dozens (or more) possible novel obstacles Rex could encounter, the point of SLRL is, that these deployment environments are inherently, dynamic, unstable and unpredictable. There's always a chance that Rex encounters something we haven't thought of, or perhaps something we have thought of but in a completely new way (think fallen debris on fire, that's rolling down the stairs!). Further, because of the nature of the problem Rex is tackling, we have had to define its training reward signal in such a way that allows it to learn a sufficiently generalisable search and rescue policy. This means, we've had to keep the reward signal quite sparse (i.e. only offering a reward when Rex finds the human). Whilst necessary, this causes problems in deployment because Rex does not directly learn the goal of search and rescue, it instead learns a particular behaviour policy that allows it to complete the goal of search and rescue. This is a subtle but important distinction because, in deployment Rex will follow this behaviour policy to navigate the building and attempt to complete its mission. However, upon encountering a novel obstacle, Rex has to update its behaviour policy to allow it to adapt and navigate around this obstacle. In doing so it partially overwrites what it learned in its training and therefore, in successfully adapting to and navigating around the novel obstacle it loses sight of its original behaviour policy. This also means it loses sight of its original goal of search and rescue too.

This is both the beauty and curse of RL. These algorithms are incredibly sophisticated and adaptable, but they weren't built for such SLRL scenarios with sparse reward signals and thus demonstrate incredible fragility when faced with novel obstacles that cause them to move "off-distribution".

In the text that follows, we're going to dive into this problem in more detail, but don't worry, I'll keep us at a relatively comfortable cruising altitude when it comes to all the technical stuff. We'll explore a sophisticated and powerful technique to combat SLRL misalignment, known as "Q-Weighted Adversarial Learning" (QWALE) and then think about how we might take inspiration from Prof. Feldman Barrett's theory of Constructed Emotion to introduce "affect-inspired" enhancements to QWALE and potentially make it even better!
‍

(Note: If you're feeling a bit lost by what you read above, don't worry this machine learning stuff can get a bit confusing! I'll be doing my best to not dive too deeply into any of the hairy stuff throughout this article, so my hope is that you can still walk away with a decent intuition about the problem at hand and the possible approaches to tackling it. Nonetheless, I do suggest clicking on any (and all) of the hyperlinks provided throughout the text, they're there to help fill in possible gaps where it would take me too much room to do so here)

Q-Weighted Adversarial Learning

QWALE builds on an algorithm called Generative Adversarial Imitation Learning (GAIL) to develop a sophisticated and powerful way of correcting the divergence of behaviour caused by novel obstacles in SLRL scenarios. As mentioned in the introduction, when we teach a robot like Rex the goal of search and rescue, we do so by way of guiding it with rewards. These rewards cause Rex to learn a particular behaviour policy which in turn causes it to increase the probability of making decisions that lead to achieving a reward (finding our human) more likely and decrease the likelihood of making decisions that don't. You can think of this like developing a particular propensity to behave in a manner that both efficiently navigates and explores each building's internal layout to maximally increase the chance of receiving a reward (finding our human). For example, instead of wandering around the corridors, Rex might learn to check each room it passes.

After sufficient training, once we're happy with Rex's behaviour (we can see that, on average it is receiving rewards in an efficient manner) we can say that Rex has settled on a particular behaviour policy distribution. I.e. on average Rex will make particular decisions (choose particular actions) in response to particular observations about the world it's in (states). As a result, we can actually train a separate deep neural network to detect, with a high level of accuracy whether Rex is making decisions based on its learned behaviour policy or some other random one. We call this the discriminator network.

So, why does this matter? Well, remember how we mentioned that novel obstacles cause Rex to overwrite its previous behaviour policy and begin making "off-distribution" actions that cause it to diverge from the overall goal of search and rescue? This discriminator network allows us to detect when that's happening and correct the behaviour back towards the original policy.

Let's explore that in a bit more detail...

So, let's say Rex has learned to explore each room along a corridor before moving on to the next floor. It's learned to do this in training by, at first, making random decisions (sometimes go past rooms, sometimes go into rooms) and learning over time that, on average, when it checks each room (instead of walking past them) it increases its probability of receiving a reward (finding the human) more quickly. Now, Rex doesn't keep a long list of to-dos that it references each time we place it into a new environment (e.g. do: check each room, don't: walk past rooms). Rex learns an incredibly vast amount of sophisticated behaviour in training, far too much to be reasonably captured using a list! So, in order to encode its behaviour policy effectively it uses the power of deep neural networks. Don't worry, we won't go too "deep" into that all here (although, I recommend this video for a great introduction) but at a very high level, you should understand the following. The neural network(s) powering Rex's decision-making essentially encodes, in response to a certain sequence of images from Rex's RGB camera, what actions it should choose that will increase its likelihood of receiving rewards the quickest, on average. It's able to encode this using a finite (but incredibly large) set of "neurons" which, when adjusted, over time, suppress the probability of actions which don't lead efficiently to rewards and increase the probability of actions that do. It's important to stress here that (at a high level) this large set of neurons eventually converge to one particular "configuration" that encodes all of this behaviour. At this point, when given any sequence of images from the possible observations Rex might make, the behaviour policy network will be able to output the optimal actions to take. This final distribution of average optimal actions in response to states (image sequences) we refer to as the learned behaviour policy distribution of Rex's policy network (deep neural network controlling the decisions). This particular configuration post-training encodes within it all of Rex's knowledge and importantly its overall goal of search and rescue.

When we deploy Rex in SLRL scenarios, it will be acting autonomously with no human oversight or redos. Therefore, we have to allow for its neural network to continue updating itself (i.e. learning), because, if it encounters a novel obstacle such as debris it needs to figure out a way around it, which is not something that will be encoded in its learned knowledge (because these obstacles are novel). More directly, the behaviour policy neural network (for the sake of this example) will not have processed a sequence of RGB images with "debris" in before. So, it won't have encoded the optimal actions to select in response to this observation. In following its learnt behaviour policy distribution from training it might begin by walking straight into the debris and getting stuck. After not being able to move for a while, it will begin to realise it's now no longer selecting actions which lead it to its reward in the most optimal way (because it's not moving). So, it will begin to update the internal configuration of neurons in such a way that - for this ongoing sequence of "debris" images in-view - suppress the probability of re-selecting the sub-optimal actions it currently chooses (i.e. walk through the debris) and increase the probability of selecting other more optimal actions (i.e. walk around the debris). However, by changing the internal configuration of neurons it has now affected the behaviour policy this neural network encodes, thus affecting the way it responds to all future image sequences. In this way it has overwritten certain aspects of its original knowledge and can (and often does) lose sight of its original goal of search and rescue as a consequence. We call this, getting stuck "off-distribution".

This is the important point to understand. We must allow Rex (and by extension Rex's behaviour policy) to adapt to these novel obstacles (which we cannot train it for) but in doing so we allow it to overwrite its previously learned knowledge, causing sub-optimal future decisions in response to image sequences it continues to process after successfully navigating around a novel obstacle. This causes it to fall into further sub-optimal corrections and ultimately end up completely "off-distribution", losing sight of its original search and rescue goals. A very costly problem!

In the video below you can see a simple example of such SLRL misalignment occuring in real time. For a "two-legged cheetah" which was successfully trained to walk across a flat surface towards a finish line in an environment with no obstacles (hurdles), when it is placed into an environment with hurdles, we witness it fall off-distribution and become misaligned. In a similar way to that explained above, the cheetah begins to dynamically update its policy (the internal "configuration" of neurons in its neural network) in order to get over the first hurdle, but in doing so it overwrites the knowledge it previously learned of "how to walk" and thus gets stuck on its back lacking any ability to recover and losing sight of its overall goal to cross the finish line.
‍

[data-wf-bgvideo-fallback-img] { display: none; } @media (prefers-reduced-motion: reduce) { [data-wf-bgvideo-fallback-img] { position: absolute; z-index: -100; display: inline-block; height: 100%; width: 100%; object-fit: cover; } }

Video Credit: Chen, A; Sharma, A; Levine, S; Finn, C‍
‍

Great. Now that we (hopefully) understand more clearly how this problem occurs let's revisit the discriminator network we spoke about earlier that QWALE uses as its "secret sauce" to correct this problem. As a reminder, the discriminator network, upon viewing a given state (sequence of images) and chosen action (decision made in response) by a particular policy network, is able to determine with high accuracy whether these decisions were made by the policy network Rex developed in training (i.e. the one that encodes optimal search and rescue knowledge) or some other random policy network. Which is useful because, when we deploy Rex into the real world, we can deploy it alongside this discriminator network such that for every action Rex's policy network chooses in response to a given sequence of images, we give this information to the discriminator network and get it to produce an "error signal" in response. This just means we get it to tell us whether these actions selected in response to these images were made by something akin to the original learned behaviour policy or not. Eventually, when Rex encounters a novel obstacle like in the debris example, and starts to change its behaviour, this discriminator network is going to notice and begin producing a stronger error signal i.e. tell us loud and clear that Rex's policy network is moving "off-distribution". We can then use this error signal to penalise Rex's behaviour policy for being off-distribution. This forces it to begin choosing actions which minimise being penalised (receiving negative rewards) - remember, the goal of Rex’s behaviour policy network is to maximise expected rewards over time. The clever part is, the discriminator network only penalises Rex for choosing actions which are significantly different to its originally learned behaviour policy and therefore, when Rex attempts to reduce negative reward over time, it also realigns back towards its originally learned behaviour policy distribution, thus, reinstating its knowledge of search and rescue.

The important bit about QWALE (that differentiates it from other SLRL realignment methods like GAIL) is that the error signal the discriminator network produces is "Q-weighted". This means it guides Rex towards states in the original behaviour policy distribution that are most valuable. Intuitively, these are states which are closer in proximity to the reward. Since, in training, Rex only received rewards for completing the mission (finding the human), QWALE essentially "pulls" Rex's off-distribution, adapting policy, towards states on-distribution that are closest to our main objective for Rex (find the human). This enhances the quality of realignment towards states most relevant for mission success in our SLRL search and rescue scenario.

In the video below, you can see what happens when we apply QWALE to that same "two-legged cheetah" scenario. The cheetah encounters the first hurdle, adapts its originally learned "walking" policy in order to get over it and then, instead of getting stuck off-distribution, uses the error signal produced by the discriminator network to move its behaviour policy back towards its originally learned distribution (encoding how to walk) and thus, gets back on its feet and continues to the second hurdle. Here, it repeats the same process and eventually achieves its ultimate goal of crossing the finish line.

Video Credit: Chen, A; Sharma, A; Levine, S; Finn, C‍
‍

We can also clearly see the incredible effect QWALE has in correcting the behaviour policy back on-distribution in the figure below. The left graph demonstrates what happens to an agent when deployed into a SLRL scenario with novel obstacles without QWALE and the right, with QWALE. The purple line represents the originally learned behaviour policy distribution. Clearly, QWALE has a significant correcting effect! Allowing the agent to get back on-distribution and achieve its original goal despite becoming perturbed by novel obstacles along the way.

Figure Credit: Chen, A; Sharma, A; Levine, S; Finn, C‍
‍

This is great. But, perhaps there's still room for improvement! This (if you're still with me!) is what I'd like to explore in the next chapter. I will preface the following by stating that, at this moment in time the ideas presented below are purely theoretical. Although I will be exploring them all more thoroughly and experimentally over the coming months, they may (and likely will) be subject to change or adjustment. Nevertheless, let us continue and explore how we might take inspiration from Prof. Feldman Barrett's theory of Constructed Emotion to introduce "affect-inspired" enhancements to QWALE, potentially making it even better!
‍

Affect-Inspired Enhancements to QWALE

Let's begin by reflecting more broadly on what patterns we observe when applying QWALE in SLRL scenarios. For the case of Rex, upon reaching some novel obstacle (like debris), we witness a drift in behaviour policy distribution away from what was acquired during training. This is Rex adapting to its environment and "overcoming" the novel obstacle. QWALE then acts to realign Rex's adapted behaviour policy back to its original distribution. In effect, we therefore witness a sort of "push" and "pull" force away from and back towards the originally learned behaviour policy distribution. This "push" and "pull" effect is represented quite nicely by the rightmost box in the above figure. To be clear, the push equals an adaptation around the novel obstacle, and the pull, a realignment towards the original behaviour policy.

In theory, if we could introduce additional reward signals to "boost" the speed of both divergence and realignment, we might be able to improve the effectiveness of our agent's autonomous adaptability in SLRL scenarios.

So...the question is, how do we produce such additional signals?

In Prof. Feldman Barrett's Theory of Constructed Emotion she argues that emotions are not innate, discrete entities that we feel as a reaction to the world around us but instead, an active way in which we construct our experiences based on interoception and categorisation. Emotions guide our attention and shape our perception of the world, influencing our behaviour in significant ways. They're a tool with which we appraise our surroundings using past experiences and concepts to make predictions about what’s happening and generate appropriate emotional responses before we've fully registered its reality. For instance, imagine you're walking in a dark alley at night and you hear a sudden noise. If your brain predicts that the noise is a threat based on past experiences and the current context, you might feel fear. However, if you’re in the same alley during the day and hear a similar noise, your brain might predict it as harmless, leading to a feeling of curiosity instead. In both scenarios, the emotion you experience is not a direct reaction to the noise itself but a construction based on your brain’s predictions, shaped by your past experiences and the context.

Cool, so what on earth does this have to do with QWALE, and search and rescue robots? Well, it demonstrates that affect can be thought of as a tool through which we appraise our surroundings, guide attention and select behaviours. So, whilst human emotion is an incredibly complex phenomena, we might be able to construct novel reward signals inspired by how we use affect to appraise our environments. Perhaps, as a first step, we could introduce a way of detecting the severity of a novel obstacle based on the potential harm it could cause Rex, and in response boost the speed of adaptation (the push), and subsequent realignment of behaviour (the pull). For instance, we would want Rex to adapt its behaviour quicker in response to a fire obstacle than a debris obstacle because it would be less likely to recover from the harms of walking into a fire. You can adjust this example in whatever way seems reasonable in SLRL scenarios, but the point remains.

In a study investigating the effect of stress and predation on pain perception in robots, Prof. Cañamero and Prof. l’Haridon attached infrared sensors onto the body of a "simple" robot to function as "nociceptors" (damage receptors). The robot's goal was to maintain ideal levels of internal variables like energy and temperature while evading predators (other robots) in its environment that, upon detection, would attempt to crash into the "prey" robot and cause "damage." Deviations from the ideal levels of internal variables created error signals that influenced the prey robot's motivations (hunger, cold, and danger avoidance) in varying proportions. In turn, actions such as "seek resources", "explore environment", or "avoid obstacles" were triggered automatically based on which was measured to be most urgent to maintain optimal levels of internal variables. The choice between these behaviours was also influenced by the robot's "perception" of "pain," primarily caused by damage (strikes) to its nociceptors (infrared sensors). Additionally, they introduced a pain-modulator, "cortisol" which acted to increase the salience of pain perception by boosting the signal caused by damage. Cortisol levels would increase in response to higher environmental stress caused by a greater presence of predators.

By introducing "pain perception" and modulating it with "cortisol" in this way, the robot was able to prioritise behaviours that minimise discomfort, evade predators, and acquire resources more efficiently while maintaining greater stability of internal variables. The robot also began exhibiting interesting emergent "fight or flight" behaviours. In high-threat situations, it would sometimes engage in "fight" responses, such as "ramming" predators off of areas that provided access to resources. It also exhibited more sophisticated evasion manoeuvres when being chased by predators. Whilst it is clear that descriptors such as "pain perception" and "cortisol modulation" are more illustrative than scientifically accurate, they help explain the mechanisms at play and the inspiration drawn from affective modes of environmental appraisal.

So, considering all the concepts explored above, how might we introduce "affect-inspired" enhancements to Rex? Well, perhaps we could introduce "pain" and "pleasure" feedback signals to boost the "push" and "pull" signals already present. With,
‍

‍
Pain = Damage * Pain_Modulator

Pleasure = Success * Pleasure_Modulator
‍

‍
Of course, it should be acknowledged that these descriptors serve as more "illustrative" labels, to demonstrate the root of inspiration, rather than to signify an accurate replication of these phenomena. Nonetheless, by integrating sufficiently suitable sensory mechanisms into Rex's repertoire, we could potentially heighten its adaptive behaviour through these additional affect channels and improve its appraisal of environmental dangers.

How would this work in practice? To begin we could focus on detecting novel obstacles most harmful to Rex by equipping it with various sensors to detect threats earlier. It's important to note that these sensors will not feed additional data directly into the behaviour policy network itself. Such an increase in information can produce too much "noise" for the policy network (see this explanation) and increase the overall load placed on the processing systems. Instead these sensors would act via the two "affect" channels (Pain and Pleasure) to produce clear and clean additional reward or error signals. Warning Rex, alongside any data it might already be processing through its RGB camera, of possible, and particularly threatening novel obstacles.
‍
Let’s take the example of a fire hazard. If Rex was equipped with heat sensors and approached a fire, the heat sensors would warn of the potential harm of this novel obstacle (by detecting temperature changes) much earlier than Rex's RGB image data would. As a result, the heat sensors would begin producing an error signal through the Damage variable of the Pain affect channel, forcing Rex's behaviour policy to move off-distribution earlier than it otherwise would have. This signal would also act to amplify the normal "push" signal being generated directly by Rex’s behaviour policy from the RGB images too, ultimately encouraging a more rapid suppression of previously learned behaviours that might cause it to take harmful actions such as walking into the fire.

As Rex’s behaviour policy begins to diverge, Rex might continue to approach the fire for a short while. However, as the heat sensors detect higher temperatures the Damage variable would increase proportionally and thus, so would the Pain error signal. This would cause Rex's behaviour policy to move further off-distribution by encouraging it to increase its exploration of alternative actions. Eventually, the discriminator network (don't forget about that guy!) will begin to notice a divergence from the original behaviour policy and produce its own error signal in response to encourage Rex back on-distribution. If we observe such a "battle of error signals" without noticing a drop in temperature, we know that Rex's behaviour hasn't adapted effectively yet to move it sufficiently away from harm. In this case we can induce a short burst of the Pain_Modulator variable to temporarily boost the effect of the Damage variable and ultimately Pain error signal, amplifying the selection of rapid adaptive behaviour whilst temporarily suppressing any early corrective behaviour caused by the discriminator network.

We can continue to modulate the Pain signal proportionally to how our heat sensors detect the proximity of the fire. As temperature detected reduces, Damage will too and thus the error signal produced by the discriminator network now takes charge. When we notice such discrepancy between the Damage signal from the heat sensors and error signal from the discriminator network we can begin to taper off the Pain_Modulator too.

Additionally, we can now begin to trigger reward signals from the Pleasure "affect" channel, by releasing a Success signal in response to registering a particular declining trend after a spike in Pain. We'd use Pleasure as a reward signal that increases in proportion to discriminator network error decrease, "boosting" the realignment of Rex's behaviour and thus amplifying the “pull” effect back towards the original policy distribution.

When the discriminator network’s error signal begins to minimise (and Rex's behaviour stabilise) beyond a certain threshold we can induce a temporary Pleasure_Modulator signal to guide the final realignment. In the figure below I've attempted to represent this relationship between Pain, Pleasure and discriminator network error visually.
‍

The possible effect of Pain and Pleasure affect signals in enhancing adaptive response of autonomous agents to a novel fire hazard.

‍
One potential advantage of this system is its ability to incorporate various sensors in a modular fashion in order to address the wide variety of potential hazards Rex could encounter. For instance, we could enhance the detection of falling debris with motion sensors, identify uneven or broken surfaces with depth sensors, monitor the presence of hazardous or explosive gases with chemical sensors, and so on. Each sensor, in turn, acting through the same additional error and reward signals (Pain and Pleasure) that the heat sensor did in the example above.

Of course, as previously mentioned, novel environments are inherently unpredictable, but the addition of such sensory inputs effectively captures a good variety of possible novel obstacles. It is also important that, with the addition of new sensors, we do not "overwhelm" the system and create too much "noise" for the behaviour policy network to process. Although, with each sensor acting through the same two modulated affect signals, we create a funnel to de-noise competing sensory inputs and more directly enhance the current adaptive ("push") and re-alignment ("pull") behaviour of an agent like Rex for a wide variety of circumstances.

In essence, this system might be an effective manner through which we can equip (and process) autonomous agents with modular "affect-based" sensory powers, facilitating more rapid and efficient adaptation and re-alignment to original policy distributions in SLRL environments. Ultimately, moving QWALE more towards the trend line depicted in the right box of the figure below from the one on the left which depicts QWALE's current effect (right box created for illustrative purposes only).
‍

Figure Adapted From: Chen, A; Sharma, A; Levine, S; Finn, C‍
‍

Closing Thoughts

‍
In this article we explored the various misalignment issues that can arise from sparse reward SLRL scenarios with novel obstacles, such as the deployment of autonomous robots into search and rescue environments. It is clear that addressing such misalignment issues holds significant value. When deploying autonomous agents like Rex, into unpredictable, high-stakes environments, the cost of failure can be significant. As we have demonstrated, novel obstacles can result in off-distribution behaviour that compromises the agent's ability to achieve its goals. However, the effective approach of the QWALE algorithm is a promising pathway to mitigate misalignment.

It is also possible that, by incorporating affect-based mechanisms which take inspiration from theories such as Prof. Feldman Barrett's Constructed Emotion, we might be able to enhance QWALE further. Developing a more dynamic and responsive system that leverages additional sensory inputs to both detect and correct adaptive behaviours for novel obstacles earlier and more efficiently; especially those that might cause our agent significant harm!
‍
While these ideas remain theoretical at this stage, their potential implications for AI safety are worth noting. Enhancing the ability of autonomous agents to handle unforeseen challenges not only benefits specific applications like search and rescue but also contributes to the broader goal of developing safer, more predictable and aligned autonomous robotic systems, such as self-driving cars and specialised surgical robots.

---

Thank you for taking the time to read my thoughts on this topic. I really appreciate it and hope you gained something in return. If you would like to reach out and discuss it further you are welcome to connect with me on LinkedIn here. I hope to add more articles and improve the functionality of this website in the near future, so stay tuned!

Addressing Misalignment In Search & Rescue Robots

Estimated Read Time: 23 minutes

Q-Weighted Adversarial Learning

Affect-Inspired Enhancements to QWALE

Closing Thoughts