Case Study: GitHub Copilot and the deceiving ladder

By on 7 March 2024

I recently gave a workshop at the Developer Open Space un-conference about GitHub Copilot and how developers can benefit from these AI assistants. I was fairly new to Copilot myself, having only tried it for a few months. But I was already impressed by its often spot-on suggestions, its understanding of context, and its rich feature set.

At the end of the course, a participant asked me an interesting question about how to approach a typical extreme value or optimization task from school mathematics. I had a gut feeling that this type of task would be challenging for an AI assistant, and it sparked a great discussion about the limitations of AI and problem-solving in general. I didn’t have time to think about it back then, but now I can dive deeper. Come along and see how it works out.

So let’s see if GitHub Copilot can be used to solve mathematical puzzles. We will see that we need to learn to recognise the signs of when AI assistants are overconfident.

problem statement 3d
The mathematical problem of finding the highest point a ladder can reach while leaning against a cube

The problem

You’ll find a good description of the problem here, along with several solutions, but don’t worry, you have to deliberately click to see the solution, so there’s no risk of accidentally discovering it. As with all good problems, the statement is very simple, but there is a lot to think about: There is a cube leaning against a wall, and a ladder leaning against the same wall, but at the same time touching the cube. The question is: how far up the wall can the ladder go when it has to touch the cube, so that we can get the ladder as close to the wall as the cube will allow. Ideally I would like to find a solution for every pair of ladder and cube.

In terms of variables, we have the length \( l \) of the ladder and the side length of the cube \( d \). What we want to know is the height \( h \) from the ground to the point where the ladder touches the wall. That shouldn’t be too hard, should it? Well, try to solve it yourself!

A first attempt

Well, GitHub Copilot or AI assistants to the rescue, then! Why bother with middle school mathematics? Long forgotten or never learned! Surely we’re not going to do the hard intellectual work! The whole point of AI is to assist us, isn’t it? So let’s ask GitHub Copilot to solve this problem for us.

GitHub Copilot can be used in many different ways, but the most satisfying and natural way is to simply write a comment asking for something and wait for Copilot’s suggestions. Alternatively, you can launch an inline chat prompt, which is also a nice way to go, and for most of my workflows it doesn’t feel too disruptive when I’m already in the editor window.

This article is not about how to get started with GitHub Copilot. There is plenty of material already available. You can follow the official instructions here and start with a test trial, or you can use similar services like plugins that use an OpenAI API developer key or other services like Codium. For the rest of this article, I will demonstrate my approach using GitHub Copilot in Visual Studio Code with the following extensions installed: GitHub Copilot and GitHub Copilot Chat. But the nice thing about the current landscape is, that it doesn’t really matter which tool you use, as long as you start using it. There is a lot going on at the moment and different service providers are tying to dethrone each other every day, giving us developers a wide variety of providers and tools to choose from. In my personal opinion, GitHub Copilot offers one of the best integrations with Visual Studio Code and it has a very nice update cycle that regularly delivers new and helpful features (you should not miss the official YouTube channel, which is a great source of newly available features).

Having that said, let’s see what GitHub Copilot does, if we ask it in a very simplistic way:

As you can see in the screenshot, GitHub Copilot suggests a solution (hence the grey color) based on our input. The suggested solution is

def max_ladder_height(l, d):
    return (l**2 - d**2)**0.5

Not bad! But wait… this solution is so simple that even without having studied the problem in detail, we might start to get suspicious. And we would be right to do so, because this is a Pythagorean equation that assumes \( l^2 = h^2 + d^2 \), which basically says that the squared length of the ladder is equal to the sum of the squared height and the squared side length of the cube. But this is wrong, because the side length of the cube is not the whole distance between the ladder and the wall. This would only be the case if the ladder leaned directly against the cube, but then there would be no point of contact with the wall (and no triangle), the ladder would stand vertically like a candle! No, this equation, although stated with great confidence, is simply wrong.

If you are interested in why GitHub Copilot came up with this solution, you can try out the chat feature and give it the same prompt. Here is the first part of the answer I got back from it:

To solve this problem, we need to use the Pythagorean theorem. The highest point the ladder can reach is when the ladder is perpendicular to the ground. This is because the ladder’s length forms the hypotenuse of a right triangle, and the height is one of the sides. The maximum height is achieved when the other side (the base of the triangle) is minimized. In this case, the base of the triangle is the edge of the cube, so the maximum height is when the ladder is directly above the cube’s edge.”

Github Colpilot Chat

Alternatively, if you do not want to think too much about the suggested solution, you can ask Copilot to generate some outputs for you. This will allow you to do some basic plausibility checks yourself:

Normally with GitHub Copilot and a prompt (comment) like this, you end up with suggestions coming in line by line. If you want Copilot to generate the whole block of code at once, there are better ways than relying on the inline suggestions. One alternative is the Completion Panel, which can be opened with Ctrl + Enter. This panel shows up to 10 different suggestions, offering a greater variety than the inline suggestions, and you can choose which suggestion you want to accept. In my case, I chose suggestion 3, which complied with my request that d be only one of two values. Funnily enough, Copilot directly lists the output of each function call as a comment in front of the call, so we don’t even need to run the code, right? Wrong, let’s see what happens in the console:

2.8284271247461903
2.23606797749979
3.872983346207417
3.4641016151377544
4.898979485566356
4.58257569495584

The output is different from the comments! What a bad behavior, Copilot, shame on you! Giving me test cases with the wrong output is not a nice way to build trust between us. But honestly, I expected as much, so let’s move on.

To make the comparison easier, let’s put the predicted and the actual outputs of the function calls side by side:

Function callSuggested outputActual output
max_ladder_height(3,1)\( 2.828 \)\( 2.828 \)
max_ladder_height(3,2)\( 2.646 \)\( 2.236 \)

Note that in the second case, the suggested output of the function call is not the actual output of the function call, something you should always be aware of when working with Copilot.

Back to the question of whether this code is actually a correct solution (which we have already concluded it is not). In the first example, a ladder of length 3 (metres, if you like) leaning against a cube of side length 1 can reach a height of 2.83. This sounds wrong. Consider that a ladder leaning against a cube with a side length of 1 metre would most likely have to be tilted a bit until it actually touches the wall, which means that it is not plausible to reach a height of 2.83 metres in this scenario.

The second example is even more problematic: A ladder of length 3 and a cube of side length 2. Let’s imagine this for a moment. You can do some drawing, but it will turn out to be completely impractical for a simple reason: the ladder must be at least twice as long as the side length of the cube, otherwise it will never reach the wall if it has to lean against the cube.

The (geometric) reason for this is that the cube divides the ladder into two segments: the upper segment, which touches the wall and the cube, and the lower segment, which touches the cube and the ground. Both of these segments are part of a rectangular triangle with the cube and either the wall or the floor. In both cases, the ladder segment is the hypotenuse, that is, the longest side of the triangle (this is easier to follow if you look at the picture below, where the problem is shown in 2D space).Since each ladder segment is part of a triangle where the side length of the cube is one of the two shorter sides of the triangle, it follows that each segment must be at least as long as the side length of the cube So, the whole ladder must be at least twice the side length of the cube.

So not only does Copilot generate code that does not solve the problem correctly, it also violates some important physical constraints, such as the ladder being at least twice the size of the cube’s side length. Of course, once we have derived this knowledge ourselves, it would be quite easy to make it part of our prompts (comments) and let Copilot fix the existing function by implementing some plausibility checks. But the main point of this case study is to show how much we can rely on Copilot without spending too much effort analysing the problem first and wasting time on prompt engineering.

Deriving a working solution

I’ve already included a link to the solution in the problem statement, but because I just love maths, and because I love self-contained articles where you can get the whole story and all the information in one go, we’re going to derive the solution by hand. However, if you don’t want to see the maths behind the problem, all you need to follow this article is the final equation we derive at the end of this section.

There are several ways to solve this problem, but one elegant way is to strip away unnecessary information and think of this as a 2D problem (see the next picture). Imagine a common Cartesian coordinate system with \( y \) and \( x \) axes. In our case, the wall represents the \( y \) axis and the floor represents the x axis. The cube is no longer a square but a simple rectangle with a side length \( d \), and the ladder becomes a straight line that touches the rectangle in a single point and intersects the axes. The point where the ladder touches the wall becomes the \( y \) intercept \( P(0, h) \). This is exactly the point we are interested in, because \( h \) gives us the height we want to know. The point where the ladder touches the floor becomes the \( x \) intercept \( P(a, 0) \). We do not yet know \( a \), but it represents the distance of the ladder from the wall so we keep this unknown variable for now. There is also a third point of interest that will be important, and that is the point where the ladder touches the cube: \( P(d, d) \).

problem statement 2d
The original problem in 2D. The ladder can be understood as a straight line intersection both the x- and y-axis.

If we look at the figure above, we can see that \( a \) is the base of a right triangle, \( h \) the height, and \( l \) is the hypotenuse. Thanks to Pythagoras, we know that \( l^2 = a^2 + h^2 \). But we don’t know \( a \) and we need another equation to get closer to a solution. We now do the “trick” (I should really avoid calling it a trick because people often fear that mathematics is all about tricks and that you can only be good at it if you know those tricks, but that is not true. It is more about trying out different perspectives and representations until the problem becomes solvable or simple enough to see a solution) and think of the ladder as a straight line intersecting the two axes. The slope of a straight line \( m \) is the ratio of how much you go up or down when you go left or right. Mathematically it is expressed as the quotient of the difference in \( y \) and the difference in \( x \), so \( m = \frac{\Delta y}{\Delta x}=\frac{y_2 – y_1}{x_2 – x_1} \). We only need to fill in two points on the line to get the slope! Lucky for us, we already have two points \( P(0, h) \) and \( P(a, 0) \):

$$ m = \frac{h – 0}{a – 0} = \frac{h}{a} $$

That’s splendid! But wait, what is \( a \)? We don’t know it, so maybe we can express it with variables that we already know? Yes we can, see the Pythagorean equation above! Thanks to

$$ a^2 = l^2 – h^2 \Rightarrow a = \sqrt{l^2-h^2} $$

we get the following final expression for the slope

$$ m = \frac{h}{a} = \frac{h}{\sqrt{l^2 – h^2}} $$

That is good because we know \( l \) and thus only have an equation with the one variable that we actually want to find out: the height \( h \). But we only have the slope of the straight line, how do we get a full equation?

Taking into consideration the full equation of a straight line \( y = mx + b \) with \( m \) being the slope and \( b \) being the y-axis intersection (which we already now, because it is exactly \( h \), so \( b=h \)), we cannot reuse the two points we already used to get the slope. Luckily, we don’t have to because we have another point \( P(d,d) \) that we can just plug in into this equation.

$$ d = \frac{h}{\sqrt{l^2 – h^2}}d + h $$

Look at the beauty of that equation! The square root can look a bit scary but it is actually quite straight forward to solve this equation for \( h \), which is the final step we have to do to solve our problem. What follows are some algebraic finger exercises to solve this equation for \( h \). This is done by isolating the square root and squaring on each side and handling a few bracket expressions. Once we have done all that, we arrive at

$$ h^4 – 2h^3 + h^2 (2d^2-l^2) + 2dl^2h – d^2l^2 = 0 $$

And that is a fourth degree polynomial equation, quite different from Copilot’s first attempt! This equation looks complex because it still contains \( l \) and \( d \) as two parameters that we can plug in, so to make this equation more tangible we could plug in \( l = 3 \) and \( d=1 \) to get the equation for the height when the ladder is 3 meters and the cube’s side length is 1 meter. This would give us

$$ h^4 – 2h^3 – 7h^2 + 18h – 9 = 0 $$

How to solve such a polynomial equation is beyond the scope of this article. In short, there is no algebraic closed solution for this particular equation, which means you have to use numerical methods to solve this equation. Good for us: That is something Copilot can do easily for us, finally!

Bringing it all together

With our newly derived knowledge about the general solution to our problem for any pair of \( (l,d) \) values, let’s ask Copilot again to solve this problem for us:

# solve the equation h^4 - 2dh^3 + h^2*(2d^2 - l^2) + 2dl^2h - d^2l^2=0
# where d is the length of the cube and l is the length of the ladder
def get_highest_point(l, d):
    a = 1
    b = -2 * d
    c = 2 * d**2 - l**2
    e = 2 * d * l**2
    f = -(d**2 * l**2)
    roots = np.roots([a, b, c, e, f])
    max_h = max(roots)
    # return the roots, the number of roots, the maximum root and the distance of the ladder from the wall
    return roots, len(roots), max_h, (l**2 - max_h**2) ** 0.5

As you can see, I gave him as prompt the full equation and explained the meaning of both variables. And in one go with a single suggestion Copilot completed the whole function for me! The only thing I entered was def and Copilot did the rest. You can see a clear structure in the code defining first all coefficients of the polynomial.

Did you notice that Copilot started with a, b and so on, but after c continues with e and f? Actually, in the first attempt, Copilot wasn’t so clever and actually used d as the fourth coefficient, overwriting the function parameter d and thus getting the wrong result for the next coefficient e which actually is calculated by \( d^2 \cdot{} l^2 \) and here \( d \) has to be the cube’s side length and not the fourth coefficient value. I used Copilot’s inline chat functionality to ask for a simple fix and it refactored the code into the version you can see above. But again, lesson learned: Copilot is not always clever enough to avoid such simple mistakes like accidentally overwriting important function parameters.

Looking at the final solution, it is actually quite nice. Using NumPy’s np.roots (the documentation actually says there is a newer API we should use) is a great solution for this problem as the roots of the polynomial are possible solutions for \( h \) because the roots are zero points that solve this equation. Knowing that it is a fourth degree polynomial equation we already know that there can only exist up to four different solutions (when you think about it, there are infinite solutions because in the real world we can put the ladder in any angle we want, not only in four discrete positions, so the solutions to this equation are not the only possible heights the ladder can take, only the maximum is meaningful in this context), but under certain conditions, it can be less than four. However, to us only the maximum height possible is interesting, so we use max(roots) to get the maximum height of all possible heights.

You may notice the second comment inside the function saying “return the roots, the number of roots, the maximum root and the distance of the ladder from the wall”. Here I prompted Copilot to generate for me the desired output and it did so flawlessly. This function returns four things: all found solutions for \( h \), the number of found solutions, the maximum height, and the distance of the ladder from the wall, so I could picture things more easily in my head!

Let’s do a quick run for one of our test cases from the beginning:

print(get_highest_point(3, 1))
>>> (array([-2.9062796 ,  2.49206604,  1.67021162,  0.74400194]), 4, 2.4920660376475325, 1.670211622520849)

For a ladder with length 3 and a cube with length 1, there are four solutions, one being physically implausible. The remaining three heights are all valid solutions, from which 2.492 is the maximum height we were looking for. The ladder is actually 1.67 away from the wall. Copilot’s suggested solution for this case was 2.828, which we knew was wrong, but only now can we assess how wrong it actually was.

Final words about Copilot

What have I learnt from this case study? Copilot is a powerful tool that works surprisingly well far more often than it fails. It increases my daily productivity by several orders of magnitude, especially with mundane, repetitive tasks such as writing tests and documentation, writing well-defined methods, fixing common bugs, and the like. It is also very helpful when doing things I have not done for a while, such as working with a lesser-known library, or testing a new library I am not yet familiar with. In both cases, it can save me a lot of time by not requiring me to read tutorials or API documentation. Instead, it can directly generate examples and code snippets that show me how to use them. Copilot also shines when it comes to very common tasks like implementing algorithms. Since problems like finding a prime number, sorting a list of numbers and so on have been solved literally hundreds of times and are available in public repositories, Copilot has seen enough examples of how to generate this code and how to generate it well. Finally, Copilot is also great for interactive workflows, where I write part of a function or refactor an existing function and let it help me finish or revise the existing code using the inline chat functionality. Because Copilot knows the context, the surrounding code, it’s suggestions can be spot on.

Yet there is a reason why GitHub Copilot always warns of the real possibility that the generated code might be wrong. Whatever wrong means. Wrong can mean many things, as we have seen. It can mean that the code does not work at all and produces some kind of runtime error. It can mean that the code does not do exactly what we want it to do. Of course, there is always the possibility that our description was ambiguous and that Copilot simply took advantage of the freedom given by the vagueness of our description. It can mean that the code does what we asked it to do, but not in the best possible way, or in a way that is unsafe or otherwise considered harmful. It can mean that the code does what we asked it to do, but the generated comments and documentation are wrong. And it is difficult to detect all these things if we are too trusting. We need to be aware of this kind of misbehaviour, and we need to develop a gut feeling about which tasks will prove easy for AI assistants, and which tasks will prove difficult. Copilot gets a lot of things right the first time, often flawlessly. And then there are times when it does not get it right, even after many attempts with different prompts and strategies, adding more and more information, details, examples and guidance.

Let’s summarise:

  • Copilot (and any other AI tool) can produce working code that seems correct and reasonable at first glance, but does not provide the correct solution to a given problem when tested thoroughly.
  • Copilot can create test cases for existing code, but those test cases may be wrong. Similarly, it may suggest tests for parameter combinations that are not plausible.
  • Copilot (and especially more generalised models) can be very persuasive and assertive in the way it presents its solutions, so you need to be aware that it can be wrong even when it argues that it is right.
  • In general, Copilot is not good at solving problems that don’t have a clear solution or a known algorithm. If you ask it to sort numbers, it will do it without breaking a sweat. But if you ask it to solve a mathematical puzzle for you, well…it tries its best. Our problem was particularly hard, when you think about it, because the solution could not be derived without finding a better representation of the problem. If you have a list of numbers, you can just start sorting them in different ways, there is no step you have to take first to understand the problem. But in the case of our ladder and cube problem, the facts about the length of the ladder and the side length of the cube are not helpful in finding the solution. We had to first transform this problem into a geometric problem and define some helper variables to be able to solve it (not to mention that we relied on several mathematical facts such as equations about rectangular triangles and straight lines).
  • When generating code, Copilot can introduce errors such as overwriting existing function parameters.
  • However, once we had an equation and wanted Copilot to solve it for us using any method it deemed appropriate, it could do it in a single pass on the first try. I doubt it had ever seen this exact equation before, but it drew parallels from the hundreds of similar polynomial equations found in other public code repositories.

Side note: And what about ChatGPT?

If you are familiar with Copilot, you mayknow that it is based on OpenAI’s Codex model. OpenAI states, that “OpenAI Codex is most capable in Python, but it is also proficient in over a dozen languages including JavaScript, Go, Perl, PHP, Ruby, Swift and TypeScript, and even Shell.” Basically, it is a fine-tuned foundational Large Language Model (LLM) that excels at generating code. The astute reader may object that Copilot is simply the wrong model for such a task for precisely this reason. And they might be right about that. So I decided to ask ChatGPT the same question: Can you tell me how to solve this maths problem (without the code generation)?

I will not include any more screenshots of my attempts here, you can see the chat history via this link (very nice feature of ChatGPT!). As you can see, at no point was ChatGPT able to derive at the same equation as we did. However, each solution it presented was well explained and well argued. So, basically, we can see some of the same problems we already noticed with GitHub Copilot.

There are other LLMs I could have tried, such as more powerful ones like GPT-4 or fine-tuned ones like MetaMath-Mistral-7B. It has become a sport to find the best fine-tuned model for the task at hand, and it makes sense to spend time fine-tuning for a specific problem. But the point of this article was to learn about the inherent limitations of models and AI assistants like GitHub Copilot, and when to be cautious about the answers you get and the code you generate. There’s always a better tool for the job, but the truth is that we don’t usually use all the tools available to us. Humans are creatures of habit and like to follow the path of least resistance. We try to do as much as we can with one tool, and are happy if that tool works well for almost any task we throw at it. In this respect, GitHub Copilot has many strengths, but we should never let our guard down and always remember that these tools are assistants, not experts that replace us humans.

Want a career as a Python Developer but not sure where to start?