Large Language Models Learn to Use Tools at 775M Parameters

The ability for LLMs to learn how to use tools seems to emerge at around 775M parameters. In a study where they gave this learning capacity to LLMs of various size , the authors found that LLMs could only use them effectively when they had a size of 775M and above.

Large Language Models Learn to Use Tools at 775M Parameters
LLMs can use APIs effectively at 775M parameters in size as per a 2023 article by Meta AI Research Lab

A very interesting result emerged from the paper “Toolformer: Language Models Can Teach Themselves to Use Tools” by the Meta AI Research Lab in early 2023.

Great paper, I highly recommend it: arxiv

This specific result caught my attention for its seemingly neuroscientific explanation (which might not be even relevant).

I like brains and I couldn't help but be fascinated by the potential implications.

The result is that LLMs' ability to learn how to learn to use tools emerges at around 775M parameters and not before.

The distinction is important here.

The training regiment used in this research makes it so that the LLMs were teaching themselves how to use new tools through API calls.

At 775M parameters, they can finish this course and earn a certificate!

Toolformer is what the authors call this learned capacity that can be tacked on top of any model, but only the models with more than 775M parameters can use it effectively.

The tool learning is really about making the right API call to augment a standard LLM output as seen in Figure 1. In color, before the arrow we can see the API call, and after the arrow the result.

We’ll dive into the methodology in a few, but first, let’s look at the result!

Result: LLMs Learn Tools 🐒

There are a bunch of interesting results throughout this paper, however, the one I'm highlighting happens in Figure 4!

Beautiful figure, but lots to unpack!

There are three main subgraphs of interest for this specific result. Each of the subgraphs has the same composition:

Five different models are examined and compared against GPT-3 (175B parameters).

On the Y axis, you have the performance for a given benchmark.

On the X-axis, you have the size of the model in millions of parameters.

For the lines we have:

  1. The dotted line is GPT-3 results.
  2. The orange line is each model baseline with no ability to learn how to use tools.
  3. The blue line is each model with the ability to learn how to use tools.

On the blue and red lines, the dots represent the actual models.

Let’s take a look at each subgraph one by one

LAMA Benchmarks

Figure 4: LAMA benchmarks result in focus.

LAMA benchmark is a set of tasks where the model needs to complete a statement whether a date or fact. It was devised to probe the knowledge of a language model.

{"masked_sentences": ["To emphasize the 50th anniversary of the Super Bowl the [MASK] color was used."], "obj_label": "gold", "id": "56be4db0acb8001400a502f0_0", "sub_label": "Squad"}
{"masked_sentences": ["Von Miller plays in [MASK] position for the Denver Broncos."], "obj_label": "linebacker", "id": "56bf17653aeaaa14008c9513_0", "sub_label": "Squad"}
{"masked_sentences": ["The TGIF comedy Family Matters for the 1997-98 season was originally aired by [MASK]."], "obj_label": "CBS", "id": "57273b69dd62a815002e99d8_0", "sub_label": "Squad"}
A couple of raw examples for the SQuAD sub-benchmark of LAMA

As we can see in Figure 4, the two smallest models have the same performance in LAMA whether they have their toolformer ability disabled or not (blue and orange dots are at the same performance).

Meaning that their new ability is useless to improve performance.

At 775M parameters, the performance of the model with toolformer ability is vastly superior to the baseline.

This trend keeps up with bigger models, culminating with the 1.6B model outperforming the 175B model GPT-3 (by like a lot)!

Math Benchmarks

Figure 4: Math benchmarks result in focus

A similar story is sketched out during the math benchmark with an even greater gain in relative performance for models past 775M parameters.

This result is interesting because LLMs are notoriously poor at math, especially models with small capacity. This makes sense as they are trained on words not explicitly to understand all the intricacy of numbers and operators.

However, by allowing them to use tools, like a calculator, they are able to more easily overcome this inherent limitation (as long as they are above 775M parameters).

Livefeed of model < 775M parameters using the calculator API

QA Benchmarks

Figure 4: QA benchmarks result in focus

This result is different than the others and adds a nuance that paints a great picture.

As you can see, all models are improving with their newfound toolformer ability. This wasn’t the case with the previous two benchmarks!

What does it mean then if during the QA benchmarks all models are improving?

Let's look at the main tool used during the QA type of task by toolformer models:

The interface for the QA API is literally just asking the question for the token that the LLM is unsure about!

As we can see above in blue, the interface for the tool is very simple. It's literally asking "What/Who/Why" followed by the term the model needs to complete.

This makes it a very easy tool to use compared to the other interface which requires a bit more general understanding.

Therefore, as the authors point out, there seems to be a tight connection between the complexity of the tool required to complete a given task and the capacity of the models.

Since the models aren't explicitly taught when to use what tool, the model needs to have just enough internal capacity to figure out which tool to use and how to use it.

In the case of QA tasks, the tool happens to be simple enough for that chain of events to flow correctly for a model smaller than 775M parameters!

As for the other types of tasks, it's too complicated for the smaller models to do the toolforming movement properly.

Neuroscience Tangent 🧠

The cool thing about this result is that it seems to mirror the fact that brains with low capacity don't really use tools and those with higher capacity do.

A whole bunch of toolforming is happening in these brains (image stolen from here)

It also is similar in the sense that brains with high capacity, like human brains, can figure out when and where to use the tools they have been shown a few times. They don't have to be explicitly trained for all types of tasks.

All in all, it's a pretty cool result. Let’s check out the methodology.

Methodology: API + In-Context Learning 🎓

Without going into the full paper let’s elucidate how the author arrived at this result.

Figure 2: Rough idea to create the toolformer ability

Firstly, as pointed out earlier, the author where trying to teach the models how to themselves learn how to use tools.

By tools, we mean those wrapped in APIs :

  • Question and answering
  • Calculator
  • Wikipedia Search
  • Translation System
  • Calendar.

They weren’t teaching the model one type of tool for each possible situation, but a multitude of them without any target task.

The training regiment to learn the interface they could work with consisted of:

  1. A few examples of how to use the APIs.
  2. These examples were then augmented through the use of in-context learning.
  3. Then this augmented dataset was used for fine-tuning the model.

This is pretty cool because it means that there wasn't a massive data labeling effort going on! A few examples were enough for the model to effectively learn how to use the APIs!

Model after the fine-tuning, ready to use that calculator!

After finetuning, the toolformer model had to figure out which tool they should be using for whatever task they had in front of them.

They did so by making the API call and inserting the result at the right spot in the output.

That’s the condensed version of the methodology! Do check the paper for all the details, it's well worth the read!


To conclude, the authors taught a model how to use tools and it had to decide when and which tool to use. The tooflormer model was able to leverage this newfound knowledge to perform various benchmark tasks.

The models with 775M+ parameters were the only ones to be able to gain an advantage as they understood how to use the tools effectively (except for very simple tools and tasks involving QAs).

If you have any questions don’t hesitate to email me at !

I'm on vacation right now, so I'm pretty much chilling and reading research papers all day at the beach.

Having a great time right now!

Have a great week! 🌹

Subscribe to Yacine's Machine Learning Help Desk

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.