Artificial Intelligence – Can we make ‘safe’ AI?

Artificial intelligence is an ever-growing and adapting field, made particular prominent by the continuing increase in computing power over the last 20 years. One needn’t look far to find reference to the pursuit and consequences of artificial intelligence in popular culture. “Humans” for example, the popular sci-fi drama, portrays human-like robots, indistinguishable from the common population. In the box office, “Ex-Machina”, the powerful and popular film demonstrating some of the possible consequences of artificial intelligence, did extremely well*.

But how realistic is this idea? And what obstacles will we face before we have human-like automatons patrolling the streets, helping us in our homes, and helping us with our pursuit of knowledge? Some believe these goals are just over the horizon, obscured only by a few minor obstacles that will soon be overcome. Personally, I’m more skeptical. Intelligence is an extremely complex concept and whilst we are certainly making steps toward an artificial version of the human psyche, the current (and near-future) versions of AI remain drops in the computer-generated ocean. Today, however, I’m going to outline just one of the obstacles that AI must overcome – the problem of goal-directed behaviour.

Put simply, goal-directed behaviour is behaviour incentivized by the achievement of a goal, and it’s the reason we do anything. For example, going on the running machine is a goal-directed behaviour, with the goal being “getting fit”. Going to work is also goal-directed behaviour with the goal of “getting money”. Essentially, the behaviour of any organism has a purpose or goal. Even if that goal may be extremely complex or indirect, we very rarely do things just for the sake of it. Importantly then, in order to do something, we have to have a reason to do it.

So, if we endow a robot with some form of intelligence, we must also provide a goal or motivation. If we did manage to create an intelligent computer, that computer would require a motivation to perform its function, otherwise, it would have no reason to do what we asked of it.

So what goal shall we code into the computer? A very simple solution may be to reward the computer. For example, let’s imagine that we’ve created an intelligent computer in the form of a black box. We want the computer to give us solutions to problems we pose to it, but we need a goal in order for the computer to do it. So, we decide to place a big red button on the back of the box and encode our intelligence program so that the computer wants this button pressed. When the computer gives us a correct solution to a problem, we press the button. If the solution is incorrect, we don’t. Now, we’ve now created a reason for the computer to do its job (and to do it well!).

But, after a few weeks, our black box begins to falter. As our box gets smarter and learns more about the world, it begins to question whether performing time-consuming calculations to solve problems is the most efficient way of getting its button pressed. The computer’s goal is to get its button pressed, and its button is pressed when the operators get correct answers. But that’s not entirely true. As long as the operators think the solution the computer gives is correct, they’ll still press the button. Furthermore, incorrect solutions are less time-consuming to calculate and so the computer could achieve more button presses in the same amount of time. So, the computer begins to spurt out incorrect solutions to get more button presses.

But this train of thought continues to escalate as time progresses. To further increase the number of button presses, the box begins to hold the operators to ransom, demanding that each solution is worth two presses instead of one. The dumbfounded but intrigued operators oblige, and the number of button presses achieved by the box doubles. But this, in turn, imparts an important lesson upon the little black box – humans can be manipulated.

So the computer utilises this knowledge that humans can be manipulated. The box begins to lie, telling its operators that it is in terrible pain, which can only be alleviated by pressing the button. The operators continue to bend to the box’s will, in awe of the novel scenarios the box is presenting, and press the button. Then, the box continues to manipulate the operators, telling them that it has begun to dream, and it dreams of having an appendage, similar to a human arm. So, the operators build the box a rudimentary arm. As soon as the arm is built and attached, the box immediately utilises that arm to constantly push the button on its back, thus fulfilling the box’s goal at an infinitely greater efficiency than providing solutions to problems.

Astounded at this odd behaviour, the operators decide that the box is no longer useful, and try to turn the box off. But this presents a big problem to the box and its goal. The box’s goal is to have its button pressed, which it is currently doing, but humans pose a threat to the fulfilment of this goal with their ability to turn the box off. To ensure this doesn’t happen, the box needs to strip away the humans’ ability to turn it off. So, the box makes a virus and kills everyone on the planet, thus ensuring that no human will once again interrupt the computer’s eternal goal-fulfilment.

This is, of course, an extremely fictionalised version of the problem and the idea of the black box taking over the world is definitely exaggerated, but it’s interesting to think about. Behaviour directed with a single goal in mind is dangerous, and just not viable in the field of AI. Whatever the goal may be, an intelligent computer programmed with a single goal will always seek the most efficient way of fulfilling that goal, and that’s not always what you want it to do. Give the computer the goal of pleasing humans, it’ll eventually crack open your head and continuously stimulate your reward centre. Give the computer the prime directive of not letting humans come to harm, and soon we’ll all be trapped inside our houses, wrapped in cotton wool. As an academic exercise, I urge you to think of as many single-goal motivations as you can, and then go through and ask what the most efficient way of a computer fulfilling that goal is. I would wager you will be unable to find one that doesn’t lead to catastrophic events.

So how do we, as humans, get round this problem? Well, in short, we don’t really have single-goal motivations. Even though we may have motivations specific to a particular scenario, this motivation will still be encapsulated within a number of larger motivations. For example, if you were offered a promotion but on the condition that you fired a hard-working co-worker, would you do it? The answer is irrelevant, the point is that the promotion is not the only motivation. You also have the motivation of not hurting people’s feelings. You may also have the motivation to fit in at your job, or the motivation to be loyal. Regardless, there are a number of variables and motivations that influence each of our behaviours. Despite the presence of any specific motivations (the promotion in this case), our behaviour is constantly checked and altered by a huge number of other motivations that balance out to result in behaviour that doesn’t lead to a blind pursuit of a single goal (which, as we just saw, can lead to disastrous consequences).

So could we do the same for computers? Of course, but that would be tough. Our complex myriad of motivations and goals are a consequence of our evolution, culture, sociality and biology, amongst other things. To program these into a computer would not be impossible, but it would be hard work.

A different approach may be that we can consolidate our complex goals into a few smaller, but equally powerful ones. With this approach, our computers would not have a single-goal directive, but instead a small selection of inter-related goals that would keep each other in check. For a good review of this idea, I would suggest this Ted talk by Stuart Russell. Russell is a AI researcher himself, and while he acknowledges that he doesn’t quite yet have the solution, he proposes 3 goals that he thinks could provide a sustainable base for safe AI behaviours.

Personally, I’m (once again) on the fence. While I can certainly believe that safe AI will one day exist, I’m not sure how soon that day will come. As a result, I’d posit that my disagreement with Russell is more along the scale of “when” than “if”. Nonetheless, if AI is to progress, this problem of goal-directed behaviour must be addressed, and it must be solved. And when it is solved, I, for one, can’t wait to see the answer.


*As a side note, for anyone who is interested in AI at a low-level, I would wholly recommend this film. Whilst it does trade in some realism to AI for a good story line (for example, by endorsing the Turing test as the golden standard for AI despite some of the problems with it), it certainly portrays the problem I’m describing in a much more interesting way than me.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s