Llama 3 RAG Demo with DSPy Optimization, Ollama, and Weaviate!

Llama 3 RAG Demo with DSPy Optimization, Ollama, and Weaviate! - バイリンガル字幕

Hey everyone, the day is finally here, llama3 is live.

There's a little bit of background,

meta completely changed the game when they open-sourced llama2, showing that we can have a super capable, large language model with open-sourced weights.

And after they had announced llama2, meta had kind of put it out there that they were gonna be working on llama3.

three.

There's going to be even more capable and feature a larger model.

And we'll see later on in this blog post that they're working on a 400 billion parameter large language model.

with the intention to open source it.

So that's maybe the most interesting aspect of this is that they're open sourcing the model.

This allows the opportunity for all sorts of third party inference API providers.

In this video, we'll look at a demo with O llama, but you can also say serve the model with together AI, any scale broke.

So there's this interesting opportunity around serving these models.

And then of course, we have the open source, we have the weights.

So we can fine tune these models with gradients and do these things like hugging face and generally, you know, control the model.

So they're just to this, the AI world has kind of collectively been holding our breath for this announcement of llama three.

So I'm so excited that, you know, here we are today and the announcement has been made.

So in this video, we're going to first dive into the blog post, how meta is presenting their new large language model.

And then in the second half,

we'll look at some demos,

how to build a retrieve log message generation system with it,

as well as this super excited new idea of using dspi to find the optimal prompt for this particular large language model.

So I'm so excited to dive in.

Let's see what meta is telling us.

All right, let's cover everything in the release blog, introducing meta llama three, the most capable, openly available large language model of the day.

So they begin by telling you, you know, how you can use it Amazon, Databricks, Google Cloud hugging face.

So quickly, I'd say that this is also an O llama now.

And thank you so much to the O llama team for getting this done and then,

you know, updating with the new chat and how you had to respond with that to take out the assistance thing.

And all that just being able to go to the O llama discord and,

you know, watch everyone hacking on and saying, Hey, is ready yet?

When do I update zoom in?

super cool, you know, experience.

And through O llama, you can use it in Weeviate as well as DSP, which we'll see later in this video.

Okay, so next up is in addition to announcing the model, they're giving some tools to help with, you know, guardrails and safety

tools, llama guard to coach shield, cybersecurity eval to so I'm not super familiar with that.

personally but some things that they're releasing around the model as well.

So in addition to the model they're also teasing that in the coming months we

expect to introduce new capabilities so by new capabilities they you know what

does that mean they clarify that later in the article or by that they mean improved reasoning ability and improved coding ability.

So when you're talking about features in large language model the capabilities sort of like the

analog from software is that you think about the training data is how you program the features of the large language model.

So they say this new capabilities what they mean is that they probably have this they're curating a new data set of

you know specialized for reasoning and coding tasks and so they're going

to keep trading their model on that maybe you know but it has a larger composition of the data set as they go forward,

and that's how they'll introduce new capabilities.

So longer context windows.

So llama three is going to have an 8,000 input window,

which is a bit short to say the last video is about coheres command r plus which has 128,000.

So that's,

that's kind of made one thing to take note of as you're switching or thinking about switching your models around to integrate llama three with what you're working on.

So then they're also proposing additional model sizes at the end of the blog post will see that they're working on a 400 billion parameter open source large language model.

That's a pretty amazing update from just, you know, following along with this for a while.

Okay, so, and then they're also saying that the research paper is coming out soon.

Okay, so the next up, a pretty major thing

is that we have another one of these chat bots systems that come with a new model.

So we saw a Mistral recently had unveiled Litchat alongside their model.

Obviously we have Chad GBT and then say Anthropic Cloud.

So we also now have MetaAI.

Maybe is how you log in with Facebook and how it's like linking this massive social media platform with the,

you know, the chat bot and all that kind of stuff.

So interesting development,

I personally,

obviously, as we'll see this DSPY demo, I'm really more fascinated with the systems and the things you've built around large language models by connecting them and these pipelines agents, if you prefer that.

So I always really like seeing these kind of systems as well as just the model itself.

Okay, so, so next up, so they're releasing the first two models.

So eight billion parameter and 70 billion parameter models.

It's super,

super important to note that the pre-trained model and an instruction tune model are two So,

if you're going to be fine-tuning with,

say, you your own data set for preferences and you're in RLHF it and all that kind of stuff, then you go for the pre-trained model.

But you're going to be plugging this into your applications and say,

prompt optimizing it, using few-shot examples, then I highly recommend using the instruction-tuned model.

So-trained model has just been language modeling the internet, or maybe it's also been language modeling.

kind of more curated data set,

but the instruction tune models,

they've been doing, you know, RLHF or say like, you know, with this like DPO, PPO, that kind of stuff on preferences.

So they're going to be better at following instructions.

Okay, so next up are goals for llama three.

So Something that stood out to me and this is that they're starting off with the text based models our goal in the near future is to make

Long with three multi lingual and multi modal later on

I think they say something like five percent would see it when we come to it

But so you know adding multi lingual multi modal longer context and then again the improve the capabilities by that they mean reasoning and coding

Okay, so probably my favorite thing about seeing these new large language model announcements is the state of the art performance.

What's the state of how people are reporting this?

Okay, so Meta is gonna be using academic benchmarks.

So we're looking at MMOU, GPQA, human eval, GSM 8K, and math data set.

So let's take a look at the set.

So, So, here is the MMOU benchmark, this is probably the most used measuring stick for large language models.

So, I'm going to ahead to the end of the section where they have an that kind

of explains it a So what the MMOU benchmark is is it's designed for massive diversity for evaluating large language models.

So you have tasks like abstract algebra, anatomy, or marketing, or nutrition, so it covers an insane diversity.

And a summary of all 57 tasks included in this data set.

And so here are some examples of what these data sets look like.

It be, you know, it's a lot of multiple choices.

So you have a question like,

what is the, I'm not going to read these, but, but you can see examples of, of these, of this data set.

So, so this is the first thing they're reporting is this one, measuring massive multi task language understanding, and then say another one that's very interesting to me is human eval.

So this is evaluating large language models trained on code that unveils this.

And so it's got questions like this,

where you have kind of coding problems like what you see in say,

leak code and things like that, and then it's able to write the code that passed the test.

So pretty interesting way of measuring these models.

So they're comparing it to Gemma from Google, Mistral 7B, instruct, and then they also have the 70 billion parameter model.

which is compared against Gemini Pro 1.5 and then Cloud 3 Sonnet.

So results are probably the thing that will catch your item most,

the fact that this open source model is on par with Gemini Pro 1.5

and these data sets that kind of everyone is using as the model.

So, maybe one of the things you'd be interested in knowing is that five-shot means that there

are five examples of the task used in the input, whereas zero-shot means instructions only.

And chain of thought means that you also add,

let's think step-by-step to the prompt,

and then eight-shot chain of thought means that the examples also have chain of thought reasonings in them.

Okay, so next up, they publish a new data set, so the new data set consists of 1800 prompts

for 12 use cases, asking for advice, brainstorming, blah, blah, so another data set for the sake of comparing these models.

So up,

this is my favorite way of evaluating large language models,

instead of just having a benchmark where there's all sorts of concerns around,

have you trained on the test set,

they even have a part in the blog post where they say our modeling team was never allowed,

with this data set, they just say something like to prevent accidental overfitting of our models.

Even our own modeling teams do not have access to it, so it's buried underground.

All that kind of stuff.

But I think is a better way of evaluating these models,

which is where you have the response from two models and you either have a human that says,

you either have a human that says which one is better or you say another large language model would be capable

of doing this and you have win rate, you know.

We've got meta llama 3 70 billion instructive versus cloud sonnet,

you achieving 53%

win rate versus Mistral medium GPT 3.5 and then llama 2 so I Personally really like this way of evaluating large language models.

Okay, so up We see more of the pre-trained model

performance So maybe coming back to what I was talking about earlier if you see this difference between the 8 billion pre-trained only it has

Compared to above with the instruction fine-tuning we have 68.4 so not not a super massive difference

But maybe let's look at the zero shots the zero shot human eval is 62.2 Oh, it's not the data sets.

It's a GIE vowel.

Okay, so anyway Hopefully you get this idea of pre-trained.

The reason that the pre-trained is so powerful is because if you're going to be fine-tuning this model,

it's a better starting point,

compared to the already instruction tuned version of it,

and that's kind of what I think the common consensus is around thinking about pre-trained instruction tuned models.

then we have some details about the model architecture.

So we have the detail of how many tokens are in the tokenizer in the vocabulary.

So 128,000.

And it was super interesting in the last video covering coheres command r plus to understand how their tokenizer has this kind of multilingual support and the result of how that you need less tokens to

encode different texts.

So think the multilingual thing is maybe the most interesting story around tokenizers, at least what I'm currently aware of.

So next up, you have a detail in the architecture architecture.

year there was a bit of like flash attention.

Everyone was like, you know, they're innovating on attention and that's going to result in these totally new models.

I'd say now it's mostly mixture of experts and a bit rising in state space models.

But we have this announcement grouped query attention, GQA, and they describe how this is the reasoning for 8 billion parameters.

You know,

there's this kind of like dark magic thing of Tim Dettmer's had published this thing about quantization,

which I think it was 6.7 billion is like the magic number where a model has to be larger than that to be capable of reasoning and,

you know, llama two was seven billion and then Mistral was seven billion.

So it was kind of like, okay, the seven going to be the number, but so.

they're explaining how this grouped query attention is like why it's eight billion.

Okay, so next up is the training data and this is a pretty interesting and then, you know,

scaling laws and all this kind of stuff.

So Starting off they have, it's pre-trained on 15 trillion tokens.

So an absolutely enormous data set seven times larger than what they use to train LAMA2.

And then so here's the nugget about multi-lingual.

So again, so programming features into large language models about having the training data for it to learn that kind of thing.

a thing.

So for multilingual use cases over 5%

of the llama three pre training data set consists of high quality non English data that covers over 30 languages.

So, you know, beginning to build that part of the corpus, the 5% 30 languages.

And that's how the,

you know,

by increasing that number,

that's how they would have more support or as well as increasing the K and the number of languages and all that kind of stuff.

So they also describe how they do some things like heuristic filters,

not say for work filters,

semantic deduplication, you know how there's, I mean, so many niche categories for specialized skill sets, but this is definitely one of them.

I think a lot of cool things from NOMIC AI and how they use their embedding visualization to achieve that semantic deduplication,

but there's definitely a lot into how they curate these pre training data sets.

Next up is scaling up pre-training and scaling laws is maybe the most interesting topic in AI because it's like if you can just keep

Scaling it and it gets smarter magically because there's already this kind of like predicting the next token and it forms

It's like magical world model as a result of doing so and so there is a you know a consensus around well

What happens if we just keep scaling it?

And that's like the bitter lesson from Richard Sutton and just saying like,

you know Algorithmic innovation gets us somewhere but scaling up the models that is maybe our you know our So the Chinchilla Optimal Compute,

this is a paper that came out from DeepMind.

I'm not super familiar with all this,

but they compute,

they have plots of compute tokens and then the performance and so they're extrapolating what would we expect based on say,

we did this run with 50 million tokens and now it's here at 500 million tokens and so on.

So they have these plots for how they predict, because not just gonna spend millions of dollars on training one of these models.

Try to forecast what will be the result of the additional computing.

So that hopefully gives you a background

of what they're talking about here

with this kind of scaling laws and how they try to fit these coefficients on model performance as a function of model size,

as well as training steps, and perhaps the size of the data set as well.

So not sure if they do multiple rounds,

through so you know epochs epochs makes a time of sense when you have your cat

verse dog classifiers because you're going through the same data set you

have a thousand cats and dogs so you're going through that multiple times but

I'm not sure if they go through that 15 trillion tokens multiple times but

anyway so I'm not really super caught on that gives you a bit of background, if helpful.

Okay, so then they describe this, you know, two custom-built 24,000 GPU clusters.

So the computational resources here is just, All right, so combine these improvements, increase the efficiency of llama three training by three times.

So these,

they're describing how these things like that, grouped query, tension, and new data center and all this kind of stuff, how they're getting more efficient.

They're better at pre training these models.

And of course, we saw say mosaic ML you know, they made tons of breakthroughs in pre training models.

And so, yeah, that's what I say.

Okay, so then instruction fine tuning.

This a topic I wish I knew more about and I can't wait to dive into it further, especially with DS pies.

Bootstrap fine-tuned teleprompter compiler,

so describing basically how their thoughts on supervised fine-tuning, rejection, sampling, PPO, and and how we get these instruction tune models.

And there'll probably be more details of this when that research works.

Okay, so then they have some things around the guardrails, things like say harmful output detection as well as this kind of structured output decoding.

I'm a huge fan of say instructor as well as SG laying and DSPAI type predictors and assertions.

So there's a lot of things now around enforcing structured outputs.

And then also, of course, like checking for harmful content and so on.

And another one that's really good in that kind of thinking is Harrison Chase describes this as a state machine,

because you know,

Harrison, Lang Chain, they've done a ton of work on these like language model graphs

where you have like eight calls to language models and he describes really nicely how you're trying to.

You know,

some kind of crazy state transition and how the language model transforms the text at any kind of step in that processing,

so maybe another useful way of thinking about this.

Okay, so we've also co-developed LOMA3 with TorchTune, the new PyTorch native library for easily authoring, fine tuning, and experimenting with LLM.

So I think this is pretty fascinating.

We saw,

I think auto train from Abhishek and Huggingface,

that was maybe in my understanding at least one of the most, you know, successful efforts on this kind of automated fine-tuning library.

But it looks like we have another one for PyTorch with integrations with Huggingface, weights biases and a Lutheran AI.

So I think this is another library for facilitating fine-tuning models with gradient descent using PyTorch.

Okay, and then so a little more discussion of the system and how how they intend for you to,

you know, put these kind of guardrails around the input as well as the output.

And of course, with these pipe, we think a lot about these kind of pipelines.

And this is mostly the thinking.

And then,

okay, so more of that, then intentions to deploy more, you know, hat tipping to the grouped query attention and how that maintains the inference efficiency with

the llama 27 billion, even if you the one more billion.

So then we have llama recipes, a of things you can do with llama three.

And okay.

And then here's the big thing.

What's next for llama three?

Okay.

So this is the announcement.

This is the, I mean, it's all great, but here's probably the craziest one.

Our largest models are over 400 billion per hour.

parameters.

So back when GPT three was like 175 billion parameters,

I don't think anyone had even fathomed the idea, you know, cost millions of dollars to train.

I don't think anyone had predicted that this scale of models was going to be open source.

And so it's, you know, pretty remarkable to see 400 billion parameter plus model.

So, uh, Metavom 3, so they're saying it's still training and like here, here's where it's at now.

So it's, it's pretty interesting with the scaling loss thing.

So as, so that, you know, they're still training it and they expect to have it get better as they continue training.

Okay, so try llama 3 today.

Let's dive into a demo with DSPY.

All right, let's dive into a demo showing you how you can use llama 3 with o llama and connect it to DSPY and WeVA.

So first up, we'll have a quick overview.

So again, you can see the release notes that we just looked at and just huge thank you to o llama for supporting this model.

It really funny seeing how when they first put this live and you saw like a thousand poles and like into

Just so this is how you can get it you you go in your command line and install O lama and then just O lama run lama three and you

Put these tags on it.

So Let me see where I'm typing here you

You just have like O lama run lama three and then you put this tag on the image

So if you want the like the instruct model,

you just do it like this or you know or the need be so you see they have this 8b,

but I'm This is the one that you want to use probably is these quantized eight.

I'm not exactly sure why,

but I Just from seeing the discord and from talking to some people and trying it myself that you probably want to use this model

This one will be using The way you do that, if you see me typing up here, would be a llama run, llama three.

Oh, well, you see it right here, actually.

So you just run this into your terminal.

Okay.

Anyway, so thank you so much to all of them.

All right.

So next up,

so we're going to start by looking at how to build a rag system with llama three, oh, I'm a wavy, a DS pie.

And then we're going to use DS pies.

Mipro optimizer to find the optimal rag.

prompt for using llama 3.

This is one of my favorite stories is this kind of idea of you know the the same prompt is an optimal for all these large language models.

And so I'm so excited to be making some more steps into comprehensive LE exploring that when new language models come out with this notebook.

So anyway, so first thing you do is you connect to ds pi dot o llama local, and we pass in the model.

So I'm using 8b instruct and then q five underscore one.

I'm not sure the difference is between q zero to q eight.

this is the instruction to model and then it's probably been quantized so that you're able to run it on your laptop this quickly.

Okay, so then we connect to weeviate and then we set the defaults in our dspy to be use lm and our retrieval model.

So first thing I like to always just check that it's connected by just passing in the lm.

Say hello.

Okay, so there are four key steps to using DSPY's compilers and some kind of entangling the setup of the compilers, which is the basic rag demo.

But the first thing you would need is a data set.

So I have this we gave blog rag.

This is in this we gave recipes folder.

And this just has questions and answers synthesized from the we gave blog data set.

So I separate these into train development so that I can optimize the training set,

tuning hyperparameters by monitoring the development set, and then when I'm finished with my optimization, I evaluate on the held-out test set.

So next thing you have a metric.

So this is a pretty,

you know,

hand-weighty metric,

and I'm planning to do the next video in the DSPY series,

and if it's as a teaser, we'll be about diving into these LLM metrics.

But now we just have,

evaluate the quality of the answer to a question,

according to a given criteria, and you pass in the criteria on the question, ground truth answer, you know, and so on.

So basically just saying how aligned is the predicted answer with the ground truth,

and it outputs a rating on a scale of one to five of how aligned it is with the ground truth.

Okay, so then our DSPY RAG program, generate the answer, we have our initial prompt, assess

the context and answer the question,

our two input fields, the context and the question and then our output field and an a detailed answer that is supported by the context.

So then we build our RAG program where we start our modules, we have our retrieval engine, and then we have our generate answer prompt.

And then in the forward pass, we take the question and we use it to retrieve the context, and then we pass in the context.

to our generate answer prompt and we get the answer and then we return the answer.

So this is the idea of DSPY,

of one idea of building these complex systems is you can have several different prompts where the output of one call is the input to the next call and you can integrate

tools like retrieval or calculators or Python interpreters and all this kind of stuff.

So once you have this program, You can just create it and then run it within with a question like what is binary quantization?

And this is retrieving from a weeviate index filled with index chunks of the weeviate blog posts.

So it can retrieve some information about binary quantization to answer this question about how it is how what it is.

Alright, so next up.

Compiling with binary quantization.

Mipro's, we've recently published a blog post on Weeviate IO, which is about, you know, covering

bootstrap few shot, bootstrap few shot with random search or optuna, copro, and Mipro.

So Mipro is the heavyweight of the DS by Optimizers, and it's doing four things.

The first thing it's doing is it's looking at your training set to make observations about it to summarize

What's in your training set and then it's taking that information to propose paraphrasing of your task

And now for each paraphrasing of your task

You're now going to use bootstrap few shot where you're going to be bootstrapping an example with that particular instruction,

so So you had to assess the context and answer the question,

and then you write a paraphrasing, which is like, please carefully read these documents.

Now you're going to bootstrap an example with the please carefully read these documents thing,

as well as so you had another paraphrasing that was like carefully review all the you know, form a coherent response.

And so you would also bootstrap for that other prompt.

And so you get more insight into how effective each of the paraphrase instructions are by also pairing them with examples.

So, and then we'll probably do that same kind of thing that copro does, where it's observing like how these different prompts are performing and using that to again, to

again propose another paraphrasing of what would be optimal prompts.

So in this case, using the data set, we start off with assess the context and answer the question.

And we end up with,

given the provided context, your task is to understand the content and accurately answer the question based on the information available in the context.

You should use formal English with technical terminologies where necessary and provide a detailed relevant response.

So prompt ends up performing better than this one when using it for RAG.

So this is a huge concept I think that,

you know,

for me,

I quick to understand this slow it down and help other people understand it,

but like when you when you're pronouncing an LLM the details of the language for how you're gonna communicate this task, it matters tremendously.

So, so this is how you use it with Nipro and so it's going to be through all these prompts.

And then in the end, you just, you know, have this new optimized prompt, what are crossing coders?

And you can see how when you inspect the history, this is the call to the language model.

You plug in the optimized prompt,

and then you,

you know,

follow the format where you have the input and output fields, and then, and then you the context, and then you have the answer.

Thank you so much for watching this video on the long-awaited release of Mehta's llama

Thank you so much to the Ollama team for supporting this so quickly and of course,

thank you to the meta researchers who've created this model and open-source it,

it's all just so exciting for the state of AI whether we're running it locally or reproducing results

for the sake of science and generally doing like we own these models so to say when you

have the weights and all that so it's just such an exciting development and all this is just really interesting to be keeping up with.

Really quickly before ending the video,

we're hosting a meetup with a Rise AI and co-here featuring Omar Katab,

the lead author of DSPI in San Francisco on Wednesday, May 1st.

It's going to be so much fun.

I'm excited for it, and I really hope to see you there.

This event page will be linked in the description of this video, as well as, of course, the We Be a Recipes repository.

We're all for this DS5 series can be found.

Thank you so much for watching.

You

翻訳言語

翻訳言語を選択

ビデオの要約字幕のエクスポートスピーキングテスト

さらなる機能をアンロック

Trancy拡張機能をインストールすると、AI字幕、AI単語定義、AI文法分析、AIスピーチなど、さらなる機能をアンロックできます。

主要なビデオプラットフォームに対応

TrancyはYouTube、Netflix、Udemy、Disney+、TED、edX、Kehan、Courseraなどのプラットフォームにバイリンガル字幕を提供するだけでなく、一般のウェブページでのAIワード/フレーズ翻訳、全文翻訳などの機能も提供します。

全プラットフォームのブラウザに対応

TrancyはiOS Safariブラウザ拡張機能を含む、全プラットフォームで使用できます。

複数の視聴モード

シアターモード、リーディングモード、ミックスモードなど、複数の視聴モードをサポートし、バイリンガル体験を提供します。

複数の練習モード

文のリスニング、スピーキングテスト、選択肢補完、書き取りなど、複数の練習方法をサポートします。

AIビデオサマリー

OpenAIを使用してビデオを要約し、キーポイントを把握します。

AI字幕

たった3〜5分でYouTubeのAI字幕を生成し、正確かつ迅速に提供します。

AI単語定義

字幕内の単語をタップするだけで定義を検索し、AIによる定義を利用できます。

AI文法分析

文を文法的に分析し、文の意味を迅速に理解し、難しい文法をマスターします。

その他のウェブ機能

Trancyはビデオのバイリンガル字幕だけでなく、ウェブページの単語翻訳や全文翻訳などの機能も提供します。