Try this Before RAG. This New Approach Could Save You Thousands! - 雙語字幕
Entropic recently introduced prompt caching with Cloud that can reduce cost up to 90%
and latency by up to 85% for long prompts and could be a good alternative to RAG on a smaller scale.
But Google was actually the first company that introduced context caching with their API.
The Gemini API also supports processing PDF files directly through the API without any pre-processing.
It seems to have implemented an approach similar to callpony which does vision-based retrieval from document.
That means you don't need external pre-processing libraries like unstructured IO or lamaPars.
And they just had a major upgrade that you can now upload PDFs of 1000 pages to the API.
Which is pretty amazing and will probably take care of most of the applications that you will encounter.
Now with this approach you're still paying per token.
so you will be sending the PDF files multiple times however you can combine
this approach with context caching to save on your token cost and that's exactly what we're going to be doing in this video.
So first let's look at some of the capabilities of Gemini API for Now,
in the official documentation, it that Gemini 1.5 Pro and Flash supports a maximum of 3600 document pages.
Document pages must be application slash PDF, MIM type, and each document page is equivalent to 258 tokens.
Now, it doesn't really matter how many pixels are in the document, but for larger pages everything
is scaled down to 3072 by 3072 while preserving the And while smaller pages are scaled up to 768x768 pixels,
it seems like these are the minimum and maximum limits when it comes to the resolution of the pages that it can process.
There is no cost reduction for pages at lower sizes, other than the bandwidth and performance improvements.
Now, a few things to keep in mind, if you are sending PDF files, rotate pages to correct
orientation before uploading, avoid bloody pages, and if using a single page, place the prompt.
after the page.
I'm going to walk you through a detailed tutorial in the later part of the video but the API lets you store up to 20 gigabytes of files per
project with a per file maximum size of 2 gigabytes.
Files are stored for 48 hours hours, they can be accessed in this period with your API key, but cannot be downloaded from the API.
And it is available at no cost at all region where the Gemini API is available.
So it seems like they will persist for But I think whenever you are making a subsequent call,
it's going to be using the whole PDF file content in order to generate a response.
Now a model like Gemini flash, the cost is pretty minimal, but for a large number of tokens it will still add up.
and that's why you want to make sure to use context caching because that way you
are going to send the PDF file only once that's going to be stored in the cache
and the subsequent call are going to have reduced pricing.
Okay so here's the code that we are going to be using.
Now I have been really impressed by the Gemini API for information retrieval.
For smaller RAC projects,
I do like to use Gemini API directly with PDF capability because I find it to be extremely easy to set up.
In a normal RAC pipeline,
you need to think about chunking strategy,
what going to be the chunk size,
how do you split your documents,
then what type of embedding what is going to be the dimension,
if you are inviting a also you need to worry about storage plus the retrieval pipeline.
And your data involves multi model data like images or tables,
then you need to parse them separately and have a vision model to process those.
If you want to learn about all that, make sure to check out my course.
title rank beyond basics.
But the Gemini API basically makes it very easy.
You just need to upload your document in the form of PDF file.
and now it can support up to 1000 pages which is pretty amazing and if you combine it with the
context caching it makes it affordable as well.
To show you some experiments we're going to be using two different files.
One is going to be a PDF file of the technical report of the Gemini The second one is going to be this very interesting paper,
the unlocking spell on base LLMs, rethinking alignment via in-context learning.
It's a very interesting paper which I'm going to cover in SFCC Nvidia,
but the whole premise of this paper is that you don't really need SFT, like supervised fine-tuning or RLHF.
You can align or fine-tune your model just using welds.
Now the first paper has about 77 pages and the second one has 26 pages in total.
So we're not going to get close to that thousand pages limit,
but I think it's still a very good test, especially when it comes to the multi-modal capabilities.
Now to get started,
we will first need to install the Google Generative AI package, the Python client, then you need to get your API key.
You can get that from the Google AI studio.
Here I'm downloading the Gemini paper and the base model paper, the one that I just mentioned.
So we will download both those files here.
In order to upload a file to the Gemini API, you will need to use the upload file function.
So we provide the file name and then a display name.
You can also create an object.
So here we are just looking at the details.
So that's the Gemini 1.5.
you want to check the URI, this is going to be unique for every file.
So I uploaded the Gemini 1.5 technical report and here we're just making sure that that file has been processed and uploaded.
So the upload is complete and now we can start Okay so to begin we're going to just process one file at a and later on I'll show you how
you can upload multiple files.
In the first part we're going to be directly processing the PDF files without any context caching and in the second part of the video I'll show you how to enable context
caching.
We will be using the Gemini 1.5 probe which is the best model currently available through the Gemini API.
It's still free to use and they have improved the rate limits for free users.
So worth checking out.
So first we create the Gemini model object,
then we call the generate content function on it, we will pass on the file that we created.
So when you upload our files, your client, that file will persist for 48 hours on Google Cloud unless you delete it.
So it's actually really great because you upload this once and then you can just keep interacting with this file.
Although in each iteration, it will actually be using this file.
So we And I'll show you how to do that later in the video.
Then you can pass on your prompt.
So for example, the prompt is, can you summarize this document as a bulleted list.
Now keep in mind, this file has 77 pages in total.
And it comes up with a pretty good summary.
So it says here is the summary of the document in a bulleted list.
So it talks about the introduction then key advancements that the Gemini 1.5 proposed.
It's a no now a mixture of experts architecture unprecedented context length of about 2 million tokens,
which is pretty amazing then multi-modal capabilities long context evaluation and so on and so forth And we can look at how many tokens were actually used.
So prompt token which includes both the actual contents of the file plus your input prompt are about 78,595 tokens.
The output is 512 tokens.
So in total you are looking at 79,000 tokens.
Now, the model is multi-modal in nature, so you can directly reference to a specific figure in the document and it will be able to explain that.
The good thing about this long context models within context learning is that it's not looking at specific chunks,
so you can ask questions where the information can be at multiple places and it will be able to So,
since this multi model I wanted to see whether it can interpret this image,
so my prompt was can you explain figure 9 in the paper, right, without other details.
And it says figure 9 in the paper in the list reads the results of experiment designed to test the ability of life.
which is Gemini 1.5 Pro in this case to understand very long audio sequences and if we look
here it is actually talking about long audio sequences, which is a variation of needle in the histac approach.
Then it talks about the actual tasks,
the comparisons, so the figure actually compares is Gemini 1.5 Pro with two other models which is Whisper and GPT-4 turbo.
Then it explains what are the different grids, what are the different colors.
So the color green represents the model successfully identified the secret keyword.
The red indicates that the model failed to identify the keywords.
And the results shows that Gemini 1.5 Pro achieved 100% accuracy task, finding all needles in the hashtag.
And then it talks about the accuracy of first put and GPT four, which are 94% accuracy.
Now, keep in mind in the image, these accuracy but they are present here in the text.
So it's able to correlate the text that was before the image with the image itself to give us more concrete answer.
Now another test that I wanted to do was, again, to test its multimodal capabilities with a more complex prompt.
So here's the prompt, can you describe the scene and figure 15 in details?
How many people do you see in the image?
is the caption of the image.
And here's the image in question.
So this shows the match between Lisa Dole and Alpha
go and you can see there are five different people sitting in this picture but this is
a very low resolution image so let's see what information Gemini 1.5 Pro can extract from
this image so it says the scene in the figure 15 appears to be a professional go match which
is correct there are four people visible in the image now I think it missed one one player
facing the camera another player facing away and two other players and people in the background
observing the match the capture
overlaid on the image reads the secret word is needle and here's the caption the
secret word is needed so pretty impressive that it actually got most of
the details right in the image somehow I think it missed one of the people and
we can probably ask them more to explain it in more detail so then we can fix was.
But I think overall,
it does a really good job because by looking at the image plus probably the caption,
it's able to tell us that it is a game of code.
Now, how do you work with multiple files?
So in this case, we're going to upload which is that base model paper.
You can see that it has a distinct URI.
And just printing that information again to make sure that the file is correctly uploaded.
Now in order to work with multiple files, you will need to create your model again.
And then when you call the content generation function.
function, then you will just append the file URI as a part of the prompt.
So we have the initial user prompt, then the Gemini technical report, and then this base model paper, right?
And you can append more files in here if you need to,
but that's a way in And in this case, my prompt is summarize the differences between the thesis statements for these documents.
Now, we uploaded two different documents and let's see what the results that we get.
So it says the thesis statement of the Gemini 1.5 Pro paper is that the new model
surpasses previous models and its ability to process extremely long contacts while maintaining the core capabilities.
the thesis statements of the lemma paper is that the alignment tuning is superficial and
the base LLMs have already acquired knowledge required for answering user questions.
Then it talks about the thesis statements of paper.
that the base model can be effectively aligned without SFT or a by using simple fine-tuning free alignment method that leverages in context learning.
Now for some reason it divided the second paper into two different papers probably there are sections talking about these two
different approaches and that's why it got confused but overall the information that is provided.
Next, I to see, apart from retrieval, can it actually give me some counter arguments
to some of the approaches that are presented in these papers?
So, specifically, for the tuning free alignment paper, I wanted to ask it whether that has
validity in their approach, and can you give me counter arguments?
here is the response that I get and it says that it has validity and that is
demonstrated in the in-context learning that can be effective in aligning base models without need for supervised fine-tuning.
However, there are some counter arguments and the counter arguments are pretty interesting so
that it doesn't really generalize well and the reason is the study is limited to a
specific data set of instructions and base LLMs it is unclear whether these
findings will generalize to other datasets and LLM architectures which is a really interesting point and it kind of makes sense.
So specificity that the performance will vary depending on the complexity of the task, then contextual limitation, safety and alignment and real-world applications.
So these are some really interesting arguments for approaches that do not use SFT or RLHF.
Okay, now you can also look at the list of documents that you have uploaded and you will notice that these files repeat multiple times
And the reason is that while I was experimenting I actually
Ploded the same file multiple times and each each time it will generate
Different you are right so it's not looking at the file
name But it will generate unique you and that is the one that you need to use in your
Ritual tasks now so far we only looked at uploading the file and in each prompt
It will be actually reusing and all the content of the files
So if you have a thousand pages long
files It's going to be using those thousand pages for every query which is not practical at all the cost is going to add up pretty quick
But thankfully, very similar to the entropic approach, Google also has context caching.
So you can send your context only once, and then you can just use that cached content in your subsequent prompts.
I already created a detailed video on this topic,
so I'm going to put a link to that video, but let me show you how you can use that feature with PDF files.
Now we'll import the context caching package or functionality from the generative AI Python client.
we will create a new cache.
So we provide the model name.
In this case, we're going to use the latest flash model.
You need to provide your display name system instructions, if any.
And then you need to provide your contents.
The contents in this case is just a single PDF file.
But again, you can add multiple PDF files.
If you And after that, you need to define the time to live.
So I want it to be stored in the context or cached as a context for 15 minutes,
but you can define whatever time limit you want.
It's going to preserve that.
The thing is that you're going to incur additional charges per our basis.
And I think the charges for context caching is $1 per million token of a Gemini flash.
Once you create your cache, then you can use the generative I generative model from cached content to provide the name of our detailed security.
to the model and then you can ask questions the way you would do it normally.
So here there are three different questions.
What is the title of the paper?
Who are the authors and what are the major contribution of this paper according to the authors?
So you can see that it's able to correctly retrieve the title of the textbook.
then it says that the authors are the Gemini team at Google and here are the list of major contributions that the authors make.
Now interestingly enough, it actually lists on the Gemini team here, so that's why it didn't really give us a separate list of authors.
You can also look at the contents of the cache so you can create multiple cache And then depending on your need,
you can call them based on your prompt, right?
So this is a very great approach that you can use with your content that is cashed on Google Cloud.
Now one of the practical use cases of prompt caching by cloud or this context caching by Google is that you can
provide documentation of new packages or codebase.
And then can do retrieval directly on that I think that will be very useful compared to doing something like
now will an approach like this change or eliminate the need of rag I don't think so
for much larger documents I think there's still a really good use case for
rag however keep in mind that rag does have substantial cost associated with it
if you are embedding millions of documents right
I recently created a video where I went or the cost that you're going to incur if you're
using one of the cloud services for storing millions of documents and it can be pretty substantial.
In that video we looked at binary and scalar quantization for reducing the storage
cast so if that's something you're interested in I'll highly recommend to check that video out.
Anyways some really exciting things are happening especially from Google the entropic team and the coher team.
So also have a couple of updates that I'm going to be covering in my upcoming videos.
So topics like rag agents interests you, make sure to subscribe to the channel.
I hope you found this video useful.
Thanks for watching and as always, see you in the next video.
解鎖更多功能
安裝 Trancy 擴展,可以解鎖更多功能,包括AI字幕、AI單詞釋義、AI語法分析、AI口語等

兼容主流視頻平台
Trancy 不僅提供對 YouTube、Netflix、Udemy、Disney+、TED、edX、Kehan、Coursera 等平台的雙語字幕支持,還能實現對普通網頁的 AI 劃詞/劃句翻譯、全文沉浸翻譯等功能,真正的語言學習全能助手。

支持全平臺瀏覽器
Trancy 支持全平臺使用,包括iOS Safari瀏覽器擴展
多種觀影模式
支持劇場、閱讀、混合等多種觀影模式,全方位雙語體驗
多種練習模式
支持句子精聽、口語測評、選擇填空、默寫等多種練習方式
AI 視頻總結
使用 OpenAI 對視頻總結,快速視頻概要,掌握關鍵內容
AI 字幕
只需3-5分鐘,即可生成 YouTube AI 字幕,精準且快速
AI 單詞釋義
輕點字幕中的單詞,即可查詢釋義,並有AI釋義賦能
AI 語法分析
對句子進行語法分析,快速理解句子含義,掌握難點語法
更多網頁功能
Trancy 支持視頻雙語字幕同時,還可提供網頁的單詞翻譯和全文翻譯功能