A Survey of Techniques for Maximizing LLM Performance - Sous-titres bilingues pour YouTube

A Survey of Techniques for Maximizing LLM Performance - Sous-titres bilingues

I love you.

everyone.

I hope you all enjoyed the keynote.

I you all are enjoying your time here at OpenAI's first developer conference.

So in this breakout session,

we're going to be talking about all the different techniques that you can use to maximize LLM

performance when solving the problems that you care about most.

So just to introduce myself, my name And it's really been a very exciting few months for fine-tuning and open AI, right?

So back in August, we 3.5 turbo fine-tuning, and were just blown away by the reception from the developer community.

We followed that up with a few important features, so was fine-tuning on function calling data.

There was continuous fine-tuning where you can take an existing fine-tuned model and continue the fine-tuning process on it.

We even launched a full UI for fine-tuning.

for.

Over these last few months, we've been able to work closely with developers from all corners of industry, right?

So indie developers, developers from startups and developers from some of the largest enterprise on Earth.

And we've been able to see, you know, what are the problems that they're trying to solve?

How are they trying to use LLMs to solve these problems?

And how are they trying to use fine-tuning of LLMs to solve these problems?

So I hope to share some of these insights with you all today.

With that said, I'm going to get us started.

Thanks, John.

Hey, folks.

Gone.

Nice to meet you.

I had up our solutions practice in Europe, which means working with some of our strategic customers to try and solve their most complex problems.

And you'll probably be unsurprised to hear is that over the last year, optimization has been the most constant focus from everybody.

Trying to get LLMs reliably into production.

And why is that such a focus?

Well, optimizing LLMs is hard.

part.

Despite all the frameworks,

despite all the content that everybody is releasing,

all the metrics and all the different tools that people have provided,

it's still one of the biggest focuses and there's still no one-stop shot for how to optimize.

It depends what category of problem you've got and how to approach it.

I that's what we're hoping to show you today's framework to figure out what is going wrong and how

to approach it and then the tools that you can use to solve things.

So the reason,

I it's So it's hard to separate the signal from the noise to know exactly like what's going This is the first place.

The second thing is that performance can be really abstract and difficult to measure with LLMs.

So it can be really difficult to know how much of the problem you have.

And even if you know what the problem is and how much of a problem you have,

it's also difficult to know which approach you use to solve the problem that you've identified.

So that's really the focus for today.

So today's talk is all about maximizing performance.

So what we're hoping you'll leave here with is a mental model of what the different options are,

an appreciation of when to use one above the other, and the confidence to continue in this optimization journey yourselves.

So let's start.

Optimizing LLM performance is not always linear.

So a lot of folks present a kind of chart like this where you start off with prompt engineering,

you move on to retrieval augmented generation, and then you move on to fine tuning.

This is kind of like the way that you approach optimizing LLMs.

But is problematic because retrieval augmented generation and fine tuning solve different problems.

Sometimes you need one, sometimes you need the other, and you need both.

you're dealing with.

So we think of it more like this.

So there's kind of two axes you can optimize on.

One of them is context optimization.

So what does the model need to know to solve your problem?

The other is LM optimization.

How does the model need to act?

Like what's the method that it needs to carry out or what's the action that it needs to deliver to actually solve your problem.

So a typical flow that you see is starting in the bottom left with prompt engineering.

With prompt engineering, you sort of do both.

You can't scale it that well.

So prompt engineering is always the best place to start.

You test and learn very quickly.

And first thing you should do, start off with a prompt, get to an evaluation, figure out how you're going to consistently evaluate your outputs.

And from there you can decide is this a context problem or is this a kind of how we need the model to act problem.

If you need more context or more relevant context, then go up to retrieve logmented generation or rag.

If you need more consistent instruction following, then go right to fine tuning.

And two things stack together.

They're additive.

So sometimes your problem requires you.

as both.

We're to give you examples of where folks have used one or two and where folks have used all of them to solve their problems.

So a typical optimization journey often looks a lot like this.

So you start off in the bottom left corner.

You've got a prompt.

You create an evaluation and then you figure out what your baseline is.

Then, typically, simple next step, add few shot examples.

So the model a couple of input output examples of pairs of how you want the models to act.

And let's say at this point, actually those few shot examples increase the performance quite a bit.

So hook that up to some kind of like knowledge base that we can industrialize that process.

And usually where folks will add some kind of retrieval augmented generation.

say that now, OK, it's now got context, but it's not producing the output in exactly the format or style that we want every time.

So we might then fine-tune a model.

Then a classic next step is then maybe the retrieval is not quite as good as it might as you want it to be.

Maybe content can be more relevant to what the model needs.

So then go back and optimize the retrieval augmented generation again.

And that you've optimized the retrieval augmented generation again,

you to fine tune your model again with these new examples that you've introduced with your updated retrieval augmented generation.

Bit of an example here of like the classic optimization flow that we see and if I could summarize it like the simplest possible terms you try something

You evaluate Then you try something else.

So That's like in the simplest possible terms.

So dive into each of these quadrants now So we're gonna start in the bottom left with prompt engineering

Then we're gonna move on to retrieve vlogmented generation fine-tuning

And then we're gonna take all this for a spin with a practical challenge that I myself and John took on and show you how this works in practice

so So, prompt engineering.

Now, I know most of you in the audience are gonna be very very familiar with this So we're gonna kind of skip through this at a fair rate

But um always best to start and just make sure everybody knows the principles here.

So prompt engineering a few strategies here.

This comes from the best practices on our documentation, but just to recap them.

So first of all, writing clear instructions.

I'll show example of what that means, but is often where folks fall down in the first place.

Secondly, splitting complex tasks into simpler subtasks.

If you can imagine that the model is like trying to make a prediction or a series of predictions for every kind of subunit or subtasks that you're giving it

to solve, you should give instructions as possible to break down that problem so it has a better chance of carrying it out.

Similarly, giving GPT's time to think, I'll give you an of a very common framework that people use to do that.

And last thing, I've kind of alluded to this already, but changes systematically.

So many times we see our customers end up in this sort of like whack-a-mole situation.

where they change one thing,

they another thing, and they're jumping all around on their evaluation matrix and they don't feel like they're going in the right direction.

And is really where you need a solid set of E valves and typically some kind of LLM up so that you can

just systematically measure these things as you change them.

After that, the most common next step is then to extend it to like reference tag.

are giving it access to external tools, which takes us more into the field of, like, retrieval augmented generation.

But first of all, let's recap what these look like in practice.

So, first of all, a couple of intuitions for prompt engineering.

So, prompt engineering, set it a couple times, say again, best place to start, and can actually be a pretty good place

to finish, depending on your use case.

So, what is good for.

Testing, learning early and when pairing it with evaluation, providing a baseline to set up further optimization should always be where you start.

What is not good for?

a few things.

So introducing new information.

You can pack quite a bit of information into the prompt,

and with the GPD4 Turbo, you now pack a ton of information into the prompt.

But that said,

it's not a super scalable way to do that using prompt engineering,

and we'll see a couple of ways that we can approach that using other methods.

Also reliably replicating a complex style or method,

again, limited by the context window in terms of the number of examples that you can actually show to the model.

So it's a great place to start, but depending on the complexity of might get you there.

And the last thing is minimizing token usage.

A common problem with prompt engineering,

you keep hitting problems and keep adding more and more facets to your prompt to actually deal with those problems.

And end up using more and more and more tokens, which costs you latency, costs, all sorts of things.

So again, not a prompt engineering, not a great way of dealing with that particular problem.

So a quick recap of like things not to do things to do with prompts,

so got a pretty bad prompt here with some vague instructions and some fairly random output and

Just recapping a couple ways to to improve that so clear instructions telling it exactly what it will be presented with and what its task is

Giving time to think so this isn't a particularly good example here.

I'm telling it to approach the tax step-by-step blah blah blah But really giving it time to think I would think of more like things like the react framework where you get it to

like write out It's reasoning steps so it's basically helping itself get to the answer like the react framework is just one way that you can approach that

But giving GPT is time to think is another great way of dealing with like where you have some

kind of very complex or logical reasoning that you need it to do because it will sort

of I mean at the end of the day it's a next token predictor and it's like it's

printing the tokens that it needs to help it like get closer to that answer depending on the strength of your prompt.

And last thing is breaking down complex tasks into simpler tasks.

So this case I mentioned like thinking of each of each step almost as like a prediction.

In case, just them out as clearly as possible.

And on the right side, we see a fairly nicely formatted JSON output.

So, again, just the basics, but just want to recap those before we move on.

So, common next step.

So, prompt engineering, you're trying to basically tell the model how you want to act.

But often, it's very difficult to actually know which of those tokens is actually influencing the model the most.

And great way to start is actually by approaching it as a show rather than tell problem.

So, just by providing few shot examples.

So, giving it input and output pairs and actually showing it the kind of behavior that you want it to have.

And this kind of leads us on nicely then to the next step,

which is typically good performance improvement,

we're going to see in the practical section that that's what gives us some very good lift with the practical tasks that we take on,

but they want to industrialize it.

And want those few shot examples to be contextual,

like on a user's question or based on like whatever the context of this particular problem is.

And that's where folks typically take few shot and move on to retrieval augmented generation.

or write.

So, for write, let's jump right in.

So, before actually I jump into write, I just want to give you a quick mental model for how to think of where to go, basically.

We started with prompt engineering.

We've evaluated.

We've identified the gaps.

Now we're trying to decide whether it's a retrieval augmented generation that we need or whether it's fine-tuning that we need.

It's sometimes useful to think of it as a short-term memory versus long-term memory problem.

So, if we think of it as, perhaps, like, trying to prepare for an exam.

So, your prompt is given them the instructions that they need to complete the exam.

The fine-tuning is, like, all the studying that you do beforehand to learn the, like, methodology.

and like the actual framework that you need to answer those questions.

And retrieval augmented generation is kind of like giving them an open book when they actually go to the exam.

So if they know the methodology and they know what they need to look for,

and then retrieval augmented generation means that they can just open up the book,

go to the right page, and pull the content that they need.

And that's why these two things like solve very different things, without the methodology.

without the content, it can be impossible to solve certain problems.

And this case, we're that we've got a short-term memory problem.

So want to give the model the right context that it needs to answer the question.

So, retrieve augmented generation or RAG is all about giving the model access to domain-specific content.

So, quick recap of what RAG is, I know again, most people.

in the are going to be familiar, but just going to recap this for the benefit of all.

So you'll typically start with some kind of knowledge base or some kind of like area that you want to

get some content to then that you want to use to answer these questions.

So in this case,

we're going to use like a fairly typical flow,

which is we've got some documents, we're going to embed them, we're going to stick them somewhere.

Again, I know folks out there probably have their own search service.

all their sources of documents that they would use and that's fair enough.

For this example, we'll assume that we have some documents, we them, we a knowledge base.

Then when the user comes in, they're going to ask a question, let's say, what's the population of Canada.

We're going to go and instead of giving that directly to the LLM,

we're going to fire that at our knowledge base using some kind of search, let's imagine we do.

a similarity search,

we're to pull back some content,

so have some content that says what the population of Canada is,

and we're then going to combine that with a prompt,

so going to give it to the L11,

say here's the question, here's the content, answer this question with this content, and we're going to end up with a, hopefully, correct answer.

So recap of RAG.

So, as we did,

with prompt engineering,

I to share a little bit of intuition that we've developed in terms of when you should use RAG and when you should.

So what RAG is good for?

Again, introducing new information to the model to update its knowledge.

So is one of the few ways you can do that now.

It's actually one of the biggest problems that customers come to.

They're like, hey, I've got 100,000 documents.

I want the model to just know.

with these documents.

And unfortunately right now there's no super scalable way to take those hundred thousand documents and give the model knowledge of all those at one time.

Retrieve augmented generation is probably about as close as you're going to get right now,

which is we're going to give some contextual knowledge to it based on the particular problem that you want it to solve.

Similarly, reducing hallucinations by controlling content is one of the very...

Common use cases of of using of using retrieval augmented generation We'll see a bit later how that pairs really nicely with fine-tuning,

but a typical use case is you give the model content You give it instructions to only use that content to answer questions

Don't use its knowledge and then that's like a typical way that folks try and constrain the knowledge to a particular

knowledge base and reduce hallucination.

What it's not good for.

So alluded to it there, but understanding of a broad domain.

It retrieval augmented generation, will not allow you to like, teach it what's a particular, like.

what, you know, law or like medicine is.

Unfortunately, that's not like one of the things that Retrieval like mental generation will let you do.

Similarly, teaching the model like to learn

a new language format or style,

like this is probably where you're more in the fine-tuning realm

where you're trying to like teach it a methodology or like a way of approaching solving the problem.

And again, reducing token usage.

So fact, you're gonna add many, many more tokens in RAG.

You're gonna keep adding input-output examples.

I often see folks go prompt engineering and then rag

because the first thing they're trying to do

is get the accuracy to a level that they're comfortable with and then they'll then try and strip tokens back out of this process.

And gonna tell you a lot more about that later,

but this is where a rag you're really like just trying to optimize, give it as much context as it needs to answer the question.

So, why don't you share a success story here?

Because, like, with prompt engineering and rag, it like these things like can be quite simple, but are really quite hard,

and takes a lot of iterations and a of testing and learning to actually make this happen for real.

And this example, a customer had a rag pipeline with two different knowledge bases, and an LLM.

And its job was to get a user's question.

base to use, fire query, and that to answer the question.

And we started,

we just implemented retrieval,

and had loads of talks them,

and were all really excited for how good embeddings was going to be and how easy this was going to be,

and our baseline was 45% accuracy.

So not so great.

What we then tried was a whole bunch of stuff,

and put little, like, ticks and crosses next them to show how many things we tried and how many things actually made it into production.

So things with ticks were things that we actually took to production, the with crosses were things that we tried and discarded.

So we managed to boost it to 65%

by trying hypothetical document embeddings where instead of doing a similarity search with the question,

you generate a fake answer and do a similarity search with that.

and for certain use cases that worked really well for this one, it did not work well.

We also tried fine-tuning the embedding,

so actually changing the embedding space based on a training set that we had to actually help the model get to the correct answer.

And again, this actually worked okay from an accuracy perspective, but was so expensive and so slow that actually we had to discard it.

for non-functional reasons.

And last thing we did was chunking and embedding.

So trying different size chunks of information and embedding different bits of content to try and help the model discern what were the most relevant.

And again, so we got a 20% bump, but still fairly far from something that's possible in front of customers.

And this was maybe like 20 iterations that we'd gone through to get to 65%.

And at this stage, we're kind of like, right, are going to pull the plug on this thing?

We stuck with it.

And we then tried re-ranking.

So applying, like, a crossing coder to re-rank the results or, like, using rules-based stuff, like, oh, well, it's research.

So we want the latest document, something like this.

And we actually got a really big performance bump from that.

And also, classification.

So basically,

having the model classify which these two domains it is,

and then actually giving it,

like, extra metadata in the prompt,

depending on which domain it was classified to, to help it then decide which content was most likely to be relevant.

And in this case, again, pretty good bump.

So 85%, we're now looking like we're, like, just on the cusp of getting to production.

thing we tried was further prompt engineering.

So we went back to the start and actually tried to engineer that prompt a lot better.

We then looked at the category of questions that we were getting wrong, and then we introduced tools.

So, for example, we noticed there were structured data questions where it needed to go and pull figures out of these

documents, and what we decided to do was instead just give it access to a sequence.

database where it would just put in the variables and execute a query and then actually bring back structured data answers.

And last thing was query expansion where somebody asks like three questions in one,

and you would parse those out into like a list of queries and then execute those all in parallel,

bring back the results and synthesize them into one result.

And these things together got us to the point where we got to 98% accuracy.

And at no point in this process did we use fine-tuning.

And I want to call that out because,

again, as we said at the start, the assumption is often you need fine-tuning to go to production.

And actually, in this case, it was all like every problem we were dealing with was context.

It was all either we're not giving it the right context or didn't know which of those context blocks was the right one.

And that's why It's so critical to know, like, what is the problem we're solving for here?

Because if at any point we had gone to fine-tuning, that would have been wasted money and wasted time.

And that's why this is a success story that we're happy with.

But I guess I wanted to give a slightly different, oh, cool.

Thank you.

So.

Yeah.

So I wanted to give a cautionary tale as well,

because times, rag so great, like you have all this great content, and use that to answer the question.

But can also backfire massively.

And I'll give you a different customer example.

So we had a customer where they had one of these, they were trying to reduce hallucination by using return.

tree vlogmented generation.

So they told the model you were to use only your content.

And they had human labelers who would check and flag things as hallucination.

So one of them, we had a funny guy at the customer and they said, what's a great tune to get pumped up to.

And the model came back with don't stop believing by journey.

And labelers were like, right, this definitely a hallucination.

But fortunately, it was not a hallucination.

It actually their content.

And somebody had written a document that said,

how do I get, like, you know, what's the best way to the optimal song to energize financial analysis?

And was like, don't stop, believe me, it was the answer.

And this is like, this kind of, like, well, like, sort of funny is also an example of, like, rag.

If you tell the model to open it only use the content,

and search is not good, then your model has zero percent chance of getting the correct answer.

And reason I call this out is that when you're evaluating RAG, you've actually added a whole other axis of things that can go wrong.

It's like we have our LLM, which can make mistakes, and then we have a search, which, like, you know, is not a solve problem.

That's why I wanted to call out a couple of the evaluation frameworks that the open source community have come up with.

And want to call out especially exploding gradients.

So they developed this framework called ragass, which is cool.

And basically breaks down these different evaluation metrics.

basically pull down our GitHub, use it like straight out of the box, or just like adapt it to your needs.

But basically there's four metrics that it measures.

Two of them measure how well the LLM answers the question, and of them measure how relevant the content actually was to the question.

So if we start off with the LLM side, the first one is faithfulness.

So it takes the answer and it breaks it into facts and it reconciles each of those facts to the content.

And it can't reconcile a fact, that's hallucination.

And it returns a number.

And if your number is above a certain threshold, you block that because you found hallucinations, basically.

This is like one very useful metric that comes out of it.

The other one is answer relevancy.

So a lot of...

By the model will get a bunch of content,

and it will make an answer that makes good use of that content, but actually has nothing to do with what the user originally asked.

So what this metric actually measures,

is like,

so if you find,

okay, well it's all actually correct, but we have a very low relevancy, that means that the model is actually,

we probably need to prompt engineer,

or we probably need to do something here to make the model pay more attention to the question and decide not to

use the content if it's not the case.

So that's on the LLM side.

But other side is how relevance is the content.

And is where I found it most useful for my customers.

As we alluded to earlier, the example with rag is just putting more and more and more context into the context window.

It's like, hey, if we We it like 50 chunks, like it'll get the right answer.

But the fact is that actually ends up getting the wrong result a lot of the time where the model,

there's a paper that was written on this called Lost in the Middle where it's kind of like the more content you give,

actually the more the model starts to hallucinate or starts to forget the content in the middle.

And actually what you want is like the most precise pieces of content.

And that's where this metric.

evaluates the signal to noise ratio of retrieve content.

So it takes every content block and compares it to the answer and figures out whether that content was used in that answer.

And this is sort of where you start to figure out,

okay, we're getting really high accuracy, but we've maybe got like a 5% context precision.

Can we actually bring back much less content and still get correct answers.

And this is like one of the areas I think where it's really useful for folks to start thinking in terms of like

sometimes folks get to production and the,

or get close to production and then the, like, the instinct is just more and more and more context.

And this metric gives you a very solid way to calculate like are you actually is adding more context actually helping us here.

And last one is context recall.

So it retrieve all the relevant information required?

So basically, is the relevant information that it needed to answer that question actually in the content?

So this is the opposite problem.

It's like,

do we have a search and the stuff that it's pushing to the top and we're actually putting in the context window,

is it actually answering the question?

So this usually tells you if this is very low,

this you that you need to optimize your search,

you need to add re-ranking,

you need to fine tune your embeddings or try different embeddings to actually bring like surface the more relevant content for it.

And I guess I wanted to leave you with that

because that's kind of like us trying to squeeze as much performance as we can out of promise.

engineering and rag.

But sometimes, again, the problem that you're trying to answer is different.

Sometimes it's actually the task that you're trying to get it to perform, which is the problem.

And that's where you would take a sideways step and actually try fine tuning.

And that's where I'm going to hand you over to John to take you further.

So, let's So, up until this point in the talk, we've been focusing on the prompting family of techniques, right?

And this is where you figure out clever ways of packing the context window of the LOM at

sample time in order to optimize the LOM's performance on your task.

Fine-tuning is really a different technique altogether from prompting.

So just to start off with the definition,

right, so fine-tuning and especially in the context of large language models is when we take an existing trained model and we continue the

training process on a new data set that is smaller and often more domain specific than the original data set that the model was trained.

trained on.

And fine tuning is really a Transformative process where we essentially take a base model,

we fine tune it, and we end up with a different.

And just to step back for a second, the name fine-tuning is really like a great description of this process, right?

So we start off with some model that has been trained on an enormous and diverse data set.

And models like 3.5 Turbo or GPT4 have a lot of very general knowledge about the world.

So we take one of these very general models and we specialize them and we essentially hone

their abilities to make them better suited for a task that we care about.

So why would one fine-tune in the first place?

And I want to highlight the two primary and related benefits of fine-tune.

So, first off, is that fine tuning can allow you to often achieve a of that would be impossible without fine tuning.

And just to plan a little bit of intuition here,

when you're using prompting techniques,

you're limited by the context size of the model when it comes to the amount of data that you can show the model, right?

So, at the low end, that's maybe like a few thousand tokens at the high end, maybe

it's like 128,000 tokens if you're using Turbo.

But really, this nothing compared to the amount of data that you can show a model while you're fine-tuning.

It's pretty trivial to fine-tune over millions or hundreds of millions of tokens of data.

So you can just show many more examples to a model while fine-tuning than you ever could

hope to pack into the context window of even the largest LLM.

The second benefit is that fine-tune models are often more efficient to interact with than their corresponding base models,

and there's kind of two ways that this efficiency So to start us off,

when you're interacting with a fine-tuned model,

you often don't have to use as complex of prompting techniques in order to reach a desired level of on that model.

So you often don't have to provide as complex of instructions.

You don't have to provide explicit schemas.

You have to provide in-context examples.

What this means is that you're sending fewer prompt tokens per request, which means that each request is both cheaper and generally a response quicker.

So it's more latency and cost efficient interact with fine-tuned models.

Next is that a common use case for fine-tuning is essentially the distillation of knowledge

from a very large model like GVD4 to a smaller model like 3.5 turbo,

and so it's always going to be just more efficient from a cost and latency perspective to interact with a smaller

model than a larger model.

So let's look at an example here,

right, so this is an example of like a common task that someone might want to solve with LLM.

So we're doing here is we're essentially taking a natural language description of a real estate listing and we're trying to extract some

structured information about that listing.

Okay, and so if we were going to try to solve this without fine-tuning, we would essentially open up the toolbox of prompting techniques and we would, you know, write some

complex instructions,

we would provide maybe an explicit schema that we want the model to output, here it's defined as like a pedantic model in Python.

provides some in-context examples to the model.

And so we would then give the model a new real estate listing and natural language, and would provide us some output.

And the output's pretty good, but here, there's actually a mistake.

And it's a pretty trivial mistake, right?

Instead extracting the date that we desired, it templated it to be the current date.

So this would be pretty trivial to fix, right?

We add a new rule, we could essentially add a new in-context example, and we could probably fix this problem.

Let's see how we would approach this with fine tuning.

So if I'm tuning what we're going to do is we're going to start with a relatively simple data set, right?

So here we have examples and I want you to notice the simplicity of these examples.

There's no complex instructions.

There's no formal schema.

There's no kind of like in context examples.

All we're giving are natural language descriptions of the real estate listing and then the desired structured output.

So we take this data set and we fine-tune a model,

then we take this fine-tune model,

we give it a new real estate listing, and we can see that it essentially gets the problem right in this case, right?

So this is just a simple example.

But in this case, the model is both performance and efficient.

So at sampling time,

we don't have to provide the complex instructions,

no in-context learning, no explicit schema, and the model does better than we were doing with just prompting techniques.

So, fine-tuning can be a rather involved process, and it's important to set appropriate expectations

about when fine-tuning is likely to work for a use case and when it's not likely to work.

So fine-tuning is really good for emphasizing knowledge that already exists in the base model.

And so, an example of this might be like a text-to-sequel task.

So have these very powerful general base models like 3.5 Turbo and GPT4,

and they really understand everything there is to understand about SQL, right?

The SQL syntax, the different dialects of SQL, how databases work, all of these things.

But you might want to fine-tune the model to essentially emphasize a certain SQL dialect

or to coerce the model to not work its way into edge cases that are specifically error-prone, right?

So you're essentially taking the knowledge that exists in the base model and you're emphasizing a subset of it.

Fine-tuning is also really great for modifying or the structure or tone of a model.

So of the early killer use cases for fine-tuning was to coerce a model to output valid JSON, right?

Because you're trying to interact programmatically with a model, getting out something that is valid JSON is very easy to deal with programmatically.

If invalid JSON, that kind of opens up many error cases.

And finally, teaching a model complex instruction, well, this is for the reasons I mentioned earlier, right?

So can show a model during the fine-tuning process

many more examples than you could ever hope to pack into the context window of a model.

So on the other side, fine tuning is really not going to be good for adding new knowledge to the model, right?

The knowledge that exists in an LLM was impressive to that LLM during these very large pre-training runs,

and you're essentially just not going to be able to get new knowledge into it during these limited fine tuning runs.

So if you're trying to get new knowledge into the model,

you really want to look at something like RAG for all the reasons that Colin just mentioned.

Finetuning is not great for quickly iterating on a new use case.

If you're fine tuning, it's a relatively slow feedback loop.

There's a lot of investment for creating the data set and all these other components of fine tuning.

So start off with it.

So I want to essentially look at a success story of fine tuning.

And one comes from our friends at Canva.

So the task here was essentially to take a natural language description of a design mock that the user wanted,

to give it to an LLM and have the LLM output like a structured set of design guidelines.

They could then use those structured design guidelines to generate a full-size mock and present that to the user.

So it's essentially a quick way to just throw out some ideas and get a full-size mock.

So here the user says something like, I want a red gradient, I a profile photo, maybe in the style of an Instagram post.

It goes to the LLM, and it's supposed to output something that's very structured here.

So it has a title, it has a style with a...

from a known set of keywords,

it has a description of the hero image,

and it has an actual search that they would give to an image search engine to generate images for these full-size mocks.

So what Canva did is they essentially started off with 3.5 turbo in the base model, and they started off with GBT4.

And they wanted to essentially see what was the performance on this task and the wasn't great.

And they were essentially evaluated by expert human evaluators.

And what they found were that while these models could output sensible outputs,

the outputs were actually irrelevant when looked at from like a design point of view.

So, they then went on to fine-tuning, and so they essentially fine-tuned 3.5 turbo for

this task, and really blown away by the results, right?

So, it not only beat the base 3.5 turbo model, but it actually drastically outperformed GBT4.

And so,

what we're seeing here on the scale is that while 3.5 turbo and GBT4 often output sensible but irrelevant design mocks,

fine-tuned 3.5 turbo was often outputting rather good design mocks when evaluated by expert evaluators with So,

if want to think about why this use case worked, no knowledge was needed.

All the knowledge needed to solve this problem existed in the base model,

but model needed to output a very specific structure of the outputs,

and they used very high quality training data, and had really good base lines to compare that to.

So, they essentially, you don't They evaluated GPD for, they understood where they were succeeding and where they weren't.

And they knew that fine-tuning was going to be like a good technique to approach with a task or to use for this task.

So, and I want to talk about a cautionary tale for fine-tuning.

And so, there's this author of this great blog post that I really liked.

And they had been experimenting with AI assistants to be writing assistants, right?

And so, they tried Chachi BT, they a few base models from the API.

They were impressed, but they were disappointed that these models weren't capturing their tone, right?

They have a very specific tone that they use when they're writing a blog post or social media posts or drafting emails,

and the base models just weren't capturing this tone.

So, they had a great idea, and they said, I'm going to download two years with a Slack

I'll you in

They wrote a script to format the Slack messages into a data format that's compatible with fine-tuning and then they fine-tuned 3.5 turbo on the Slack messages.

So, a process, you've got to like collect the data, aggregate the data, massage it

into a format that's compatible with fine-tuning, fine-tuned the model.

They finally go through this process, they get a fine-tuned model, and they ask it to do something, right?

So, they're saying, can you write me a 500-word blog post on prompt engineering?

And this model, this personalized writing assistant response?

Sure, I'll do it in the morning.

So, I'm sure a little surprised and shocked.

He follows up and he's like, I prefer he wrote it now, please.

And the says okay and then does nothing else, right?

So, I mean, we really got a kick out of this on the fine tuning team.

And the author was a really great one.

take a step back for a second, fine tuning really worked here.

What the author wanted was a model that could replicate their writing style.

And what they got was a model that could replicate their writing style, but slack writing style.

And you think about how you communicate on slack,

it's very terse,

it's in a stream of consciousness style,

you're often for going punctuation, you're for going like kind of grammatical correctness, and what they got was a model that replicated that, right?

So while fine-tuning a model to replicate your tone is actually a relatively good use case for fine-tuning,

the error here was,

they didn't like fully think through whether the data that they were providing the model really replicated the end behavior that they

wanted from that model.

So what they probably should have done,

One here was,

you know,

take 100 Slack messages, 200 Slack messages, fine-tune the model, experiment with it, and see is it moving in the right direction, right?

Is getting closer to the tone that I want the model to replicate?

They would have seen pretty quick that that was not the case,

and maybe they would have gone and fine-tuned on their emails,

or their blog post, or their social media post, and maybe that would have been a better fit.

We've seen some examples.

We've developed some intuition.

So does one actually go about fine-tuning a model?

So like any ML problem, the first step is you've got to get your data set.

And with most ML problems, this is actually the most difficult part.

Some ways of getting a data set, right, you can download an open-source data set.

You can buy data on the private markets.

You can pay human labelers to essentially collect the data and label it for you.

it from a larger model if the terms of service of that model support, that specific use case.

But some way, you essentially have to come up with a dataset to fine tune on.

Next, you're going to actually kick off the training process, and so this varies a lot depending on how you're trying to do the training.

If you use a turnkey solution, like open AI, fine-tuning API, this be relatively simple.

If trying to fine-tune an open source model, totally doable.

You're just going to have to get your own GPUs, use a framework.

It's a little bit more involved.

But it's important while you're training to essentially understand the hyperparameters that are available for you to tune during the training process, right?

Are you more likely to overfit?

Are you more likely to underfit?

Are you going to fine-tune it to the point of catastrophic forgetting?

it's important to just understand the available hyperparameters and the impact they have on the resulting fine-tuned model.

Next, I to point out that it's really important to kind of understand the loss function, right?

So when you're fine-tuning an LLM, really, when you're looking at the loss function, it's a proxy for next token protection.

prediction.

This is great when you're finding an LLM,

but token prediction is not often super well correlated with performance on the downstream task that you care about.

So if you think about code generation, there's many different types of way to write code to solve a single problem.

And so if you're just doing next token prediction and exact token The loss or change in the loss function for a model might not correlate to the change in

performance on the downstream task.

It's important to understand that.

Next, you to evaluate the model.

There's a different ways of evaluating the model.

You can essentially get expert humans to look at the outputs and actually rank them on

Another technique is that you can essentially take different models, generate outputs from them, and just rank them against one another, right?

So not having an absolute ranking, but something like an elo score that you get from chess.

You can also do something like have a more powerful model, rank the outputs for you.

This is really common using GPD4 to rank the outputs of fine-tuned open source models or GPD 3.5 turbo.

And finally, you want to actually deploy it and then sample from it at inference time.

And these last three points can form something of a feedback loop and a data feedback loop, right?

So you can essentially train the model,

evaluate it,

deploy it to production,

collect from it in production,

use that to build a new data set,

down sample the data set,

curate it a bit, and fine tune further on that data set and you get something of a flywheel going here.

here.

So we've talked about a few of these up into this point,

but I to formalize some of the best practices that we recommend when it comes to fine tuning, right?

So first off is just start with prompt engineering and few shot learning, right?

Just these are like very simple low investment techniques.

They're going to give you some intuition for how LLMs operate and how they work on your problem.

And it's just a great place to start.

Next is that it's really important to establish a baseline before you move on to fine tuning, right?

So this kind of ties back to the success story for Canva.

They with 3.5 Turbo.

They with GPT4.

They got a really good understanding for the ins and outs of their problem that they were trying to solve.

They understood the failure cases of those models.

They understood where the models were doing well.

So they understood exactly what they wanted to target with fine tuning.

And finally,

when it comes to fine tuning,

start small,

you know,

don't download 140,000 Slack messages and just kind of do it in a single shot,

develop small high quality data set, perform the fine tuning, and your model and see if you're moving in the right direction.

So can do something like an active learning approach here,

right, where fine tune the model, you look at its outputs, you see in what And then specifically target those areas with new data, right?

So it's very intentionally investing.

And really important that when it comes to LLMs and fine-tuning, data quality trumps data quantity.

So the data quantity part of the training process was done in pre-training, right?

Now, it's like, you really want to focus on fewer high-quality examples.

And just to talk about fine-tuning and rag,

right, so if you want to combine these together for certain use cases, it can be the best of both worlds.

So how this works is that you fine-tune a model to understand complex instructions

and then you no longer have to provide these complex instructions, a shot examples, at sample time, right?

So you essentially fine-tune a model, it's very efficient.

Well, this means that you've essentially minimized the prompt tokens that need to be provided

at sample time, because you no longer need to do complex prompt engineering, it's baked into the fine-tuned model.

And this means that you have more space for retrieved context.

And so you can then use rag to inject relevant knowledge into the context, and context that's available has essentially been maximized in this point.

Now, of course, you have to be careful to not oversaturate the context, as of it might

have spurious correlations with the actual problem that you're trying to solve.

But essentially opens up room in the context window to be used for more important purposes.

So that said,

we've been talking about theory up until this point in the talk,

but we're now going to talk about application in this series, so I'll turn it back over to Hollanda, get us, get us.

Thanks.

Thanks, John.

Cool.

Yeah, so let's take all this theory for a spin.

So problem we decided to take on was the spider 1.0 benchmark.

So, effectively given natural language question and a database schema, can you produce a sent query that answers that question.

So an example looks something like this.

So given this database schema and given this question at the bottom, can we produce that SQL query on the right?

So classic problem, lots of different attempts on it.

And we did was follow the kind of advice that we've given you folks.

So started off with prompt engineering and rag.

And If I just share some of the different methods we use,

just get into the details of what we tried, we started off with the simplest possible RAG approach.

So started a simple retrieval.

So cosine similarity, use the question, and find SQL queries, which similar questions, basically.

So a similarity search with the question.

We also tried formatting the embedding.

differently, we tried a bunch of prompt engineering, just with a of isolated examples, and

results were, as you'll see in a second, not super good.

So what we did was we kind of thought about this problem,

and like, actually, a question could have a totally different answer if it has a different database schema.

So doing a similarity search for the question doesn't make a lot of sense for this problem,

but you using a hypothetical answer to that question, to search might give us actually much better results for this problem.

So what we did was use hypothetical document embedding, so we generated a hypothetical SQL query, and we used that to similarity search.

And actually got a really large performance bump with that for this particular problem.

We also tried contextual retrieval,

where just simple filter in,

we figured we kind of have ranked the hardness of the question that we got,

and then only brought equal hardness examples back in our rag, basically, if you see what I mean.

And that got us slightly better improvements.

We then tried a couple of more advanced techniques.

So a of different things here.

You try chain of thought reasoning,

so you may be trying get it to identify the columns,

and then invite it, and then identify the tables and then build the query at the end.

But what we set along was fairly simple.

We went with self consistency check.

So got it to build a query,

run the query,

and we gave it the error message if it messed up and gave it a little bit of annotation and got a try again.

And actually got,

like, again, sort of funny, but, like, getting to fix itself but it's something that we see actually work fairly well if you have a use case where

like latency is not a huge problem that you're that you're worrying with or cost and the results

we got looked kind of something like this so kind of come over here to talk through this so on the

far right was where we got to with prompt engineering so not so great so we started off with 69 percent

Then we added a few shot examples and got a couple of points of improvement,

so told us that RAG could actually give us further improvement here.

So tried with the question and you can see that we got a 3% performance bump.

Then using the answer, so the hypothetical document embeddings, we a further 5%, which is pretty cool.

So actually,

just by using using using the you using a hypothetical question to search,

rather the actual input question, we got a massive performance bump over what we started with.

And all we did was just increase the number of examples,

and got up to four points shy of the state of the art with this approach.

So again,

this is like just a couple of days hacking around,

starting off with prompt engineering, moving to rag, and shows you just like how much performance you can squeeze.

we use out of these very basic starting approaches.

And that point, we decided to turn over to fine tuning and see whether we could take it any further.

And that's where I'll hand on to John.

Yeah, so for fine-tuning, we turned it over to our preferred partners for fine-tuning at scale AI.

And they started off by establishing a baseline as we recommend, right, so the same baseline that we saw in the previous slide of 69%.

This is just with simple prompt engineering techniques.

They then fine-tuned GBT4 with simple prompt engineering techniques where you just kind of like reduce the schema as it goes into the example,

right?

So very simple, simple fine-tuned model, a little bit of prompt engineering, and they got a little bit 2%, right?

So we're now kind of within striking distance of state of the art.

They then used brag with that model to essentially dynamically inject a few examples into the context window based on actually just the question.

So even very advanced brag techniques.

And got 83.5% which got us really right within striking distance of state of the art.

And If you look at the spider leaderboard on the DevSet, the techniques user are very complex, right?

There's a lot of data pre-processing,

a of data post-processing, often hard-coding edge cases into the script being used to actually evaluate the model.

We didn't actually have to use any of that here, right?

Just simplifying,

tuning, simple prompt engineering,

just kind of following the best practices, and we really got within striking distance of state-of-the-art on this really well-known.

a benchmark.

And it kind of shows the power of fine-tuning and RAG when combined.

So just a recap,

right, when you're working on a problem and you want to improve your LLM's performance, start off with prompt engineering techniques, right?

These are very low investment,

they allow you to iterate quickly and they allow you to kind of validate LLM's as a viable technique

to approach this problem that you're trying to solve.

So you iterate on the prompt until you hit something like a performance plateau,

and you need to analyze the type of errors that you're getting.

So if you need to introduce new knowledge or more context to the model, go down the rag route, right?

If the model is inconsistently following instructions,

or needs to adhere to like a strict or novel output structure,

or you just generally need to interact with the model in a more effective way, time to try fine tuning.

And important to remember that just like this process is not linear.

That's really what we want to stress, right?

It take 49 iterations to get to a point that you're really like happy with.

And you're going to be jumping back and forth between these techniques, kind of like on your journey.

So with that said, we hope you enjoyed this talk.

Colin I will be here for the rest of the day, so if you have any questions.

Okay.

Thank you.

Thank Thank

Langue de traduction

Sélectionner

Résumé Exporter Pratique orale

Débloquez plus de fonctionnalités

Installez l'extension Trancy pour débloquer plus de fonctionnalités, y compris les sous-titres IA, les définitions de mots IA, l'analyse grammaticale IA, la parole IA, etc.

Compatible avec les principales plateformes vidéo

Trancy offre non seulement le support des sous-titres bilingues pour des plateformes telles que YouTube, Netflix, Udemy, Disney+, TED, edX, Kehan, Coursera, mais propose également la traduction de mots/phrases IA, la traduction immersive de texte intégral et d'autres fonctionnalités pour les pages web régulières. C'est un véritable assistant d'apprentissage des langues tout-en-un.

Tous les navigateurs de plateforme

Trancy prend en charge tous les navigateurs de plateforme, y compris l'extension du navigateur Safari iOS.

Modes de visualisation multiples

Prend en charge les modes théâtre, lecture, mixte et autres modes de visualisation pour une expérience bilingue complète.

Modes de pratique multiples

Prend en charge la dictée de phrases, l'évaluation orale, le choix multiple, la dictée et d'autres modes de pratique.

Résumé vidéo IA

Utilisez OpenAI pour résumer les vidéos et saisir rapidement le contenu clé.

Sous-titres IA

Générez des sous-titres IA précis et rapides pour YouTube en seulement 3 à 5 minutes.

Définitions de mots IA

Appuyez sur les mots dans les sous-titres pour rechercher des définitions, avec des définitions alimentées par l'IA.

Analyse grammaticale IA

Analysez la grammaire des phrases pour comprendre rapidement le sens des phrases et maîtriser les points de grammaire difficiles.

Plus de fonctionnalités web

En plus des sous-titres vidéo bilingues, Trancy propose également la traduction de mots et la traduction intégrale de texte pour les pages web.