Vector Database and its Architecture
This blog provides an idea about the generalized version of vector database management systems (VDBMS). It is intended to give a general idea of how VDBMSs are implemented.
Prerequisites: Basic understanding of the vectors and database management systems.
To get familiarity with vectors, you may refer to this blog.
Goals of this blog
By the end of this blog, you will have answers to the following questions:
- What are vector databases?
- Why do we need VDBMS?
- What is the generalized architecture of VDBMS?
- What are the use cases of VDBMS?
- What are the limitations of VDBMS?
Actual Article for in depth analysis: link
Without further ado, let’s get started.
What are vector databases?
Vector databases, as the name suggests, store vectors. As we know, to perform any type of NLP tasks, we need to represent the text data in the form of vectors. This is true not just for text but also for images. As we will see further, vector databases store not only images but also videos and audio in the form of vectors, not just plain arrays.
Why do we need vector databases?
As we build more advanced AI systems, regular databases have some drawbacks. Regular databases just store information as it is, like images, videos, text, or audio, without doing much else (except compressing). On the other hand, vector databases focus on storing how different pieces of information relate to each other, rather than just storing the actual info.
For example, in a regular database with lots of images, finding all the ‘cat’ images without external tools is tricky. Vector databases make this easier with something called similarity search.
Note: It’s crucial to understand the difference between Vector databases and Vector Database Management Systems (VDBMS). VDBMS does everything from converting data into vectors, storing them, to doing similarity searches. Vector databases only handle storing vectors. But for this blog, I’ll use both terms interchangeably.
General Architecture of VDBMS
There are three parts to any VDBMS:
- Vectorization (query & insert)
- Indexing
- Hardware Handling & real-world components
Vectorization
First, we need to perform the vectorization of the data. For most VDBMSs, this process is included in the system itself. Essentially, the database performs vectorization on the datapoints, and these vectors are then stored in the database.
Some algorithms for vectorization include Word2Vec (Word2Vec and it’s smartest overview), FastText, and Doc2Vec.
Vectorization is not only done for the datapoints but also for the query. The query, as we know from any SQL database basics, is the key used for similarity operations. For example, “search images of a cat” could be a query for any image database. Depending on the task, we may or may not store the query vector in the database.
Indexing
Indexing is a task where we perform the similarity search operation. The indexing algorithm finds the similarity between the query vector and the vectors stored in the DB. To find the similarity, we calculate the distance between the two vectors.
Some algorithms for similarity search are Cosine Similarity, Euclidean distance, etc.
The indexing operation for the VDBMS gives us multiple vectors as a result. The vector with the highest similarity would have higher priority than those with a low similarity score.
As we will see further, performing vectorization and similarity search requires a large amount of compute power as the data gets bigger.
Handling Hardware
Depending on the size of the data, the hardware requirements for VDBMS increase. As we move towards more complex data, the data generation process becomes more intricate, leading to higher dimensionality in data points. To perform similarity searches on high-dimensional vectors, higher GPU resources are required.
To alleviate the burden on compute resources, vector databases can utilize product quantization techniques.
In the real world, it is important to consider endpoint failures. Similar to traditional databases, vector databases create redundancy (replicas) of the data to prevent data loss due to hardware failure.
Uses of Vector Databases
We discussed how Vector Databases come in handy for storing and locating similar images, videos, text, and audio. Real-life examples include Google Photos and Shazam, where it recognizes music and identifies images in videos.
VDBMS play a crucial role in chatbots. They help in keeping track of the ongoing chat “context,” making conversations more natural and effective when chatbots interact with real people. Prime example would be ChatGPT.
Challenges and Limitations of Vector Databases
Here are a few challenges with vector databases today:
- Speed Vs Accuracy Caveat: There are many indexing algorithms that excel in performing similarity searches but are very slow. On the other hand, some algorithms are faster but offer lower accuracy. Having lower accuracy could be disastrous in medical sciences or other high-stakes applications. This is an active area of research where we are still searching for better algorithms and techniques that are both faster and highly accurate.
- Increasing Dimensionality Reduces Performance: As the data we store becomes more complex, the actual dimensionality of the data increases. The true impact of large-dimensional data is beyond the scope of this blog. However, a basic rule of thumb is that with higher dimensions, both vectorization and indexing algorithms require exponentially higher computing power. Dimensionality reduction techniques like PCA can be used to address this issue, but this operation is also susceptible to lower performance.
- Yet to Mature in the Industry: Vector databases are relatively new concepts, dating back to just 2019-20. Hence, they are yet to acquire all the functionalities needed for different types of applications. Developers are actively adding more features to vector databases. Some widely known vector databases include Pinecone, Chroma, and Deep Lake.
Conclusion
The Vector Database is getting bigger to handle a special way of storing information called vectors. These vectors describe complex data in spaces with many dimensions. This overview helps explain the basic ideas behind vector databases and the systems that manage them. It covers different ways to compare vectors, types of indexes for vectors, and the important parts of a VDBMS.
There are challenges with VDBMSs, like dealing with data that has a lot of dimensions and isn’t filled with information everywhere. We’ll also talk about how new Vector Database systems are still pretty new, and what that means for the people using them and creating them. In short, this exploration gives a glimpse into the world of Vector Databases, explaining the basics, showing practical uses, and pointing out the challenges that come with managing this kind of data.
References:
Article: VDBMS: Fundamental concepts, use-cases, and current challenges
Video: Building Production-Ready RAG Applications: Jerry Liu
Pinecone: Documentation
Please feel free to reach out to me on X or Linkedin!
Discover more from Arshad Kazi
Subscribe to get the latest posts sent to your email.
Leave a Reply/Feedback :)