Transcript: The Next Era Of Semantic Search Auto Embedding In Vector Search

Hi everyone, welcome to automated embedding in vector search. Um we’re going to talk about how the landscape is changed for search and AI native applications and what role MongoDB is playing with the launch today. So just to get started, I am Praul. I’m a product manager here at MongoDB and joining me will be my colleague James who is on the engineering side of things. Um so let’s start with how AI has transformed search. So at the very core we have seen search becoming not about keywords but about the user intent. Um just in the short while users have been trained to put out the actual question they have which is replete with relationship semantics uh and get a rich result back. Now on the left what you see is the old world of search uh where you ask a specific keyword based question Q2 sales results. You get those narrow result set which are documents which match those keywords. Uh on the right is a new world of search which is user asking a question how did revenue trend in spring right which is exactly what they mean and the expectation is to get back this rich relevant results which are pertaining to that search query could contain the May forecast could contain April financial summary etc. Um and this EI powered search has essentially been powering transformative use cases. A lot of you here would be building a lot of these applications but we have been seeing a lot of usage of semantic search uh which is search based on user intent. This is what powers intelligent interf search interface on e-commerce websites for instance. Uh there is rag or retrieval augmented generation for large language models which lets you ground your responses in the propriety data. Um this is how your co-pilots, intelligent assistants and chat bots are powered. Uh there is also been an implosion of applications leveraging AI agent memory uh which lets an agent uh build context using the long-term memory store. Um so this search AI powered search in the old way has been built in the following way right like so for it’s a two tower approach there’s a indexing step there there’s a querying step so in the indexing you take all your data source the knowledge corpus on which you want to enable search uh pass it into a indexing pipeline which will involve doing a lot of data processing uh having some orchestration for talking to the embedding models um reaching out to that embedding model as its own and that’s when you’re prepared you will store it into a database and a vector search right with a lot of you in the room are aware MongoDB has a native vector search integrated into the database itself so that indexing flow exists and then on the querying side you would take a user query convert it into an embedding and that’s how you get your search results. So this workflow was already simplified by MongoDB by having a single place where your database your transactional database and your vector searches but we have been hearing from our users all sorts of pains like this setup exist but this is this can be brittle this can be errorprone this can be fragile. So uh just to lay out few of the common challenges that we have been hearing in terms of building this AI powered search. So firstly there’s a steep learning curve. Uh getting started can be a pain for teams new to embeddings machine learning. Um there is secondly the integration complexity. So every time you generate insert a document you have to reach out and talk to an external embedding model endpoint in your application code. uh which involves all sorts of retries, authentication, payload management, etc. Thirdly, there’s a synchronization challenge. As you saw all the moving pieces, your data, your embedding, and what’s in your vector search, all of that needs to stay in sync. Um having stale embeddings will lead to bad results. And fourthly, there’s the production scale headache. Um making a embedding model call could be a trivial thing if you have few documents. But when you’re looking at millions of documents, this becomes a nightmare. You’re essentially looking at building task cues uh for handling those data. You are building retry mechanisms. You’re building specialized workers. You will need some kind of a alerts and monitoring that make sure the entire process of embedding generation is happening and is not going into errors. also like you know your database operation cannot get blocked by the slow process of embedding generation which because you know it’s these are two different parts that takes so you don’t want your database to be blocked by embedding generation that’s happening so that is what brings us to automated embedding in vector search with automated embedding our goal was simple to make it not an developer concern to generate these and manage these embeddings things in their application code. Let the database handle all of those operations. You as a developer should be able to focus on building differentiated applications and not dealing with all the infrastructure challenges which comes with place. Um so the solution for that is the introduction of our automated embedding search index. A lot of you are aware u MongoDB has had lexical search full text search as well as vector search and the way these search works are via the secondary search indexes which are defined as search index or vector search index. So the auto embedding search index is a subtype of vector search that we’re releasing right now. You just define a type equal to vector search. You define your type equal to auto embed as a type. Uh define modality. Choose a model. A lot of you are aware we MongoDB has the acquired company called Voyage which makes state-of-the-art embedding models. And you define the path. Uh this is your path in your MongoDB document on which you want to enable semantic search on. This could also be a view where you can define that in your document. These are the fields I want to make a composite and that’s what I want my semantic search to be on. But with this index definition you are good to go and your process of embedding starts. Similarly similar approach on the query side. Uh when you do a query with auto embedding all you need to do is use the MongoDB’s operator dollar vector search. Uh it’s an MQL stage. uh we are introducing new parameters for query where you pass in the text the natural language text that you want to do a search on and you provided the model that you want to use for embedding. So this is an optional parameter. Um I think some of you would have seen we released new class of embedding models which have some special capabilities. So you can switch the model at the query time and indexing time. We’ll talk more about it. Uh but with that you’re good to go. uh this becomes your semantic search setup like this allows you to build those AI powered search on your data. So with auto embedding this is where we get to put your data in to MongoDB define a search index and you’re good to go. You’re running a query. So data in data out um in that one-step process to get started on this right like we are releasing auto embedding in public preview on MongoDB community. As you’re aware MongoDB community is a self-managed deployment that you can run on your local system. To do this uh you essentially need three things. One is the mongod which is the database demon which is available in a docker image as a Linux binary etc. Um we had released this thing called mongot which is a search demon right this was released a few months back today we are open sourcing this as well uh you need this again similarly available as a docker as a Linux and you need your voyage model API key. So this voyage model API key can be obtained on MongoDB Atlas itself. That’s a new product we announced today. Or you can do this on voyage.com. Get the API key, initialize that when you’re setting it up and you’re essentially good to go. So this is a onetime setup. U with that we essentially get to this three things that enables you to do. One is this intuitive setup. um define an index and MongoDB handles the rest. All the infrastructure challenges are something handled by us. Secondly, radically simplify your architecture. U just write the raw text to MongoDB in that high throughput low latency manner and the search index will get generated in eventually consistent manner. Your rights are not blocked. you get quick performance and your embeddings which can be a slow process with all the uh resources required will happen eventually as we wait for it. Uh thirdly there’s complete automation. You continue updating your documents uh writing more documents deleting it. Your embeddings and your search index is always up to date. Now at the very core these are powered by state-of-the-art Voyage models. Vage models and we release these models today itself. Auto embedding comes with day zero support for these new models. Uh but these are the reason I call them state-of-the-art is this benchmarking score that I have the the graph that I have on the screen. So again these are industry benchmarks using the RTEB retrieval text embedding benchmark leaderboard. But on that board um is where uh these results coming from. So the voyage 4 large model is a flagship model that we release. This is the first production grade application the model to embedding model to use a mixture of experts architecture. This outperforms openai’s v3 large embedding model by an average of 14%. Um next up coher v4 voyage 4 large is better than that by average of 8%. and Google Gemini embedding 01 which is the latest embedding model from Google uh vage 4 large outperforms that by 4 percentage point on the cedar board. So auto embedding comes with these models and there’s also a voyage code 3 model which is specially trained for code searches. So if you had those you can use it but within these models you have option to either get the maximum accuracy with voyage 4 large uh with voyage 4 it’s a balanced of accuracy and latency and for light lets you do high throughput low latency application obviously these models have different price points and our system lets you make that choice. Um with that let me u bring on stage uh my colleagues. I talked about the why and the what. Uh James will talk about how the solution works.

Thanks, Pool. Um can folks hear me? Is my mic working? Okay, it also fell off of me, so bear with me a moment. All right. Um yeah, so as Pul mentioned, um auto embedding is a feature that we’re releasing that allows users to perform semantic search without needing to worry about vector representations and without needing to worry about keeping vector representations consistent with source documents. Um my name is James. I was one of the engineers on the project and I want to take some time to talk about the technical challenges that we needed to address when implementing auto embedding and then walk through the solution that we came up with for those two towers that Pool mentioned that index time operation and the query time operation. Um this slide speaks to three of the technical challenges that we wanted to address. Um first and foremost we wanted the auto embedding system to be performant. What that means is low latency search results. We didn’t want to introduce any unnecessary latency in the indexing or the query pipeline. Additionally, we wanted vector representations of source documents to be durable. Oh gosh, am I good? I think people um sorry, additional I don’t know where to stand. Is this good? All right. Um, additionally, we wanted vector representations to durable and consistent. What this means is that vectors should persist as elements of a deployment are spun up and down and vectors vector representations should remain consistent with the source collection as the source collection changes. Sorry, it’s all good. Um finally we wanted to reduce the network IO across a deployment by dduplicating calls to the voyage API service. What this ensures is that there is one call to the API service per document per document index and per query. So I want to take some time now to walk through the system as it uh the system as it pertains to indexing. So here we kind of see several actors. Um there’s the user here. Prool also mentioned that there are multiple processes that run when search is deployed in community. There’s mongod which is the database instance represented by the disks. There’s also this special process T which handles search operations. If you’re using auto embedding and you have a replica set set up, you will need to designate a leader T that has some special responsibilities when building the index. Um, finally we have the voyage API endpoint which is called by the search process. So when an index is defined, think I got when when an index is defined, that index definition is proxied through MongoD to T and the leader T picks up that index definition and performs a collection scan and monitors the change stream. During the the collection scale, MongoD uh sorry, T will batch documents that are to be indexed and call the embedding service for that batch. When the leader t receives vector representations of the source documents, it will write those representations back to a reserved namespace in MongoD. because we’re using MongoD to persist vector representations. Those vector representations will stick around even as components are spun up and down within the system. Additionally, when the vectors are received from the embedding endpoint, they are put in to an underlying lucine vector index. Lucine is a vector search engine that’s used at MongoDB. It’s highly performant for vector search and uh when the vectors are written back to the search namespace, those writes can be picked up by other nodes in the replica set and those vectors can then be put into any other mongot’s lucine vector indexes. What this gives us is performance because we’re using lucine as our underlying vector search engine. We have consistency and durability because we are persisting our vectors in. Finally, we’re dduplicating calls to the embedding endpoint since there’s one T that’s responsible for reaching out to the embedding API. Next, I want to talk about um searching and how that’s handled. for searching. Uh it we now don’t need to worry about a t being a leader or a follower because every t will have a consistent view of the vector representations from the source collection. So a user starts by issuing a query. That query is proxied to the search node mongot. t then then can look within that query, extract the plain text, reach out to the embedding API, get the vector representation of that plain text, look at the underlying vector index, and return the document IDs that are the nearest neighbor to the query vector. It’s important to note from this workflow, the user just sees a plain text query and gets a list of source documents from the source collection. So this whole notion of calling an embedding model and vector indexes is is just hidden from the user. They interact with the system entirely entirely with plain text. Um I’ll hand it over to Poolool for a demo now. Thank you James. Uh can we switch over to the demo please? All right. So I’m actually going to use a recorded demo. There have been some issues with Wi-Fi. So to get started, the first thing is I have the cluster set up and this is the local host. So we just got the connection through an compass. Now this database, this is MongoD with MongoD running comes with a few sample data sets. So for this purpose we are going to be using a movies data set. These are 21,000 documents of movies contains metadata about them. Title, the plot, genre, etc. And think about the use case where you’re talking to a friend, you want to tell them you watched a movie but don’t remember the title, but you remember the plot and things that happened there or like you know you’re trying to make a recommendation and just have a mood and you want to search for it. So that’s what we’re going to demonstrate today. So to build that out uh first thing is uh we want to just on the plot be able to do a semantic search. To do that um we are going to go to indexes and create a search index. This can be done programmatically in tools like compass. Uh other ways uh create a vector search u basically put in that index definition I talked about. This is your type equal to auto embed. Choose a model that you care about. and click on search index. As you kickstart that process, you will essentially see this status. So index is pending building now and uh if you continue focus this is actually the real time thing. It will take a bit of a time 21,000 documents all being embedded with the voyage 4 large model. Uh but you can monitor go away for this time. These are just 21,000 documents. Uh it will happen soon. But if these were 100,000 documents, you know, you are not babysitting this process. But a lot of the builders in your room would know what kind of pain it can be if there are millions of documents. Uh but essentially this will take a while. You can monitor the progress and once that is ready, you will see it get into the ready state. So basically right here once this index is ready uh let me just make sure we capture that. Yeah, there we go. So you can see the index has now moved into the ready state. Um you can obviously progress uh monitor the progress but once this is ready let’s run some queries. To run queries uh you will be basically going in to your favorite query thing. This could be programmatic experience. Um, automated embedding also comes with D0ero support in MongoDB’s MCP server in our lang chain integration in a langraph integration as well as language drivers like Java, Go etc. So now the query I’m running here is really a vector search query and the text I have here is a fairy tale where someone falls in love. So there is no model defined here. So the results that you see are on the left. I realize this may be a little blurry for anyone to see. So I’ll read it out. But for that query, you essentially get the first result of procity. Second is stardust, third is reconstruction. Now let’s actually modify the query a little bit. So and I modified my query to say let me add a different model. So what I’m going to do is this was using a default model which was used during the indexing time. uh I can go ahead and add this model here which is voyage 4 light. Now you will see the results change in order and changing that model is super straightforward. So a lot of you builders would have seen iterating on this can be such a pain. So now depending on your use case do you care about the accuracy? Do you care about latency? You can make these changes in real time in during the query time right like the indexing stays the same and this is a new innovation. uh we are the first model providers uh which have this capability in a commercial setting. Now let’s make the query more interesting. So right now what I have is the model is voyage 4 and the query is fairy tale where a monster falls in love and lives in a swamp. Now for the movie enthusiasts you would know this is very a very popular movie something called Shrek. Uh so right now we are searching that with voyage 4. So the first result I’m getting right here is something called swamp thing matches that description. The second one is the Tammy and Bachelor. The third query is monster in Paris. So we actually see Shrek but the Shrek is further down the list. It comes around the six result set. Now what I want to do is change that model that I’m using at the query time back to voyage for large and see how the result changes. So now what happens is the first result I see here is the shrek. So when we talk about the accuracy of the retrieval and how models impact that this is that process in action. Uh so your first result is shrek and you see other results. So this are all the things that make it easy. Uh I know we are over time a little bit so we’re going to break for question but you can do things like you see the original thing that we have uh the object that we have on which we enable the search on also had certain things like year so 1982. So if you had a query give me a fairy tale where monster falls in love lives in a swamp and a movie was released before 2000 builders among you will know like embedding models don’t really work well on the dates right it works well on the national language context but the date part will be ignored it won’t do well so you can actually create a pre-f filter and define that in a MQL stage definitively that like you need movies to be less than 2,000 um you can also do things like combine plot and title. So this is a JSON object. You can combine multiple fields in a thing called views. So MongoDB has MongoDB views. Now the data model will be impacted. But now you can create a auto embedding search index on that view. So what will happen is uh a composite of plot and title can be used to create that search index. And this way you can iterate on things. Um with that uh let me open it up to any questions. Uh, can you go back to slides, please? Yep. All right. Thank you so much. Um, any questions? Yeah. Yeah, great question. So, the question is, is it English only? So, these embedding models are trained on over 80 plus languages. So these are the search vector search by definition becomes multilingual and cross-lingual by default. So you can index a documents in English language search for them in French language. Uh but yeah we have language specific benchmarks available for these models as well. But you have almost all the languages present on internet very well represented u and how they work. Yeah. So uh James you also want to come over for question and answers. Sure. Hopefully my mic is still working. I think the question was at query time are queries embedded once or are they embedded multiple times? Um so during the query path the one query is picked up by one t and then we’ll reach out to the embedding service get the vector embedding and then you know send the results back. Um the system does not cache that vector however. So if the same query comes in, we will go to the embedding service again to fetch. Um but yeah, it’s once per query, we’ll hit the embedding end point. Yeah. Right. So the question is uh if a user ask a similar question is the query embed like if if the that vector is cached somewhere. Uh so the answer is no we don’t do that. That’s something you can implement yourself. We have some tutorial for something called semantic caching. Um that’s available in our MongoDB’s langchain package or you can do it off yourself. But something you will want to implement based on your use case right like so there is a question of in a semantic cache how similar queries do you want are you able to tolerate? So you can do that. One question uh is it multimodal already? It understands images as well or videos. Yeah, this is uh this is just text. Uh so we have a multimodal model that we announced today in 3.5. Uh but if you if you go back to the slides, these are there is a modality parameter right now. So this is text only. In the future we’ll add multimodal options. Got time for one more question folks. Sorry. Yeah, you want to take that question? I think the question was how many Tes are running at one time. Um, that really depends on your deployment. Um, I think the image that I showed uh represented sort of one T ver uh a onetoone relationship between MongoD the database and t the search process. Um, but you can actually attach multiple search processes to one instance of the database. Um, so it really it’s really up to you when you figure your cluster how many replica sets you want and then how many search processes you want to attach to each uh replica. Yeah. So this if you’re running on atlas by default you will have a replica set of three. So three mongoDes to make sure there is durability and consistency and if you enable search it will be three search processes in a coupled architecture or if you enable search node it can be 2 to32 but yeah this is self-managed so you can deploy it however you want all the flexibility is there. Uh with that thank you so much uh uh hopefully this was useful and we going to be around for a bit if you have any questions please come and talk to us. Thank you.