向量数据库个人笔记

传统的搜索比如Mysql用b+树索引, EleasticSearch索引倒排但本质上都是精确匹配基于向量检索技术, 做相似度判断去寻找最相似, 可以更好的对图片, 视频等非结构化数据做检索, 玩的就是多维

什么是向量检索?

当查找与当前相似度最高的内容时, 向量搜索都可简化为这三个步骤

首先, 候选和已选内容都转为向量
遍历候选向量与已选向量做余弦相似度计算，然后按照计算出的余弦相似度排序
找出最相似的top N

Milvus向量数据库

Milvus&Llama_index

Llama_index结合milvus, 基于li的多种插件, 对于NLP十分擅长

# 使用llama_index读取数据
# 结合pylimvus导入到milvus数据库


#step1 - 读取数据为文档
from llama_index import download_loader
from glob import glob

MarkdownReader = download_loader("MarkdownReader")
markdownreader = MarkdownReader()

docs = []
for file in glob("./milvus-docs/site/en/**/*.md", recursive=True):
    docs.extend(markdownreader.load_data(file=file))


#step2 - 文档上传到数据库
from llama_index import GPTMilvusIndex
index = GPTMilvusIndex.from_documents(docs, host=HOST, port=PORT, overwrite=True)


#step3 - 查询数据
s = index.query("What is a collection?")
# Output:
# A collection in Milvus is a logical grouping of entities, similar to a table in a relational database management system (RDBMS). It is used to store and manage entities.


#step4 - 保存连接信息复用
saved = index.save_to_dict()
index = GPTMilvusIndex.load_from_dict(saved, overwrite = False)
s = index.query("What communication protocol is used in Pymilvus for commicating with Milvus?")

Milvus&Towhee

Towhee是一个神经网络数据处理流水线, 和llama_index的形态很像, 支持多种数据导入插件, 各擅长于cv 即处理图片和视频

使用towhee检索文本

# 使用towhee导入文本数据

import pandas as pd
import numpy as np
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
from openai.embeddings_utils import get_embedding, get_embeddings
from towhee import pipe, ops
import csv
from towhee.datacollection import DataCollection

# 需要封装成工具
#######todo##############


# 创建数据库的collection
# vector字段需要制定dimension, 代表向量的维度, 向量维度取决于你生成embedding列表中的元素个数
def create_collection(collection_name, dim):
    connections.connect(host='172.24.15.115', port='19530')
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    fields = [
        FieldSchema(name='id', dtype=DataType.VARCHAR, descrition='ids', max_length=500, is_primary=True, auto_id=False),
        FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='reverse image search')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":768}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection




def write_collection():
    insert_pipe = (
        pipe.input('id', 'question', 'answer')
            .map('question', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))
            .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
            .map(('id', 'vec'), 'insert_status', ops.ann_insert.milvus_client(host='172.24.15.115', port='19530', collection_name='question_answer'))
            .output()
    )
    
    with open('question_answer.csv', encoding='utf-8') as f:
        reader = csv.reader(f)
        next(reader)
        for row in reader:
            insert_pipe(*row)
    collection.load()




def query_collection(qs, collection_name, source_dict):
    
    ans_pipe = (
    pipe.input('question')
        .map('question', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map('vec', 'res', ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name=c, limit=1))
        .map('res', 'answer', lambda x: [id_answer[int(i[0])] for i in x])
        .output('question', 'answer')
    )
    
    ans = ans_pipe(qs)
    ans = DataCollection(ans)
    # ans是一个 topk列表, 可以通过ans[0][字段名] 访问
    return ans

if __name__ == '__main__':
    # 创建表
    create_collection('question_answer', 768)
    
    # 写入数据
    write_collection()

    df = pd.read_csv('question_answer.csv')
 id_answer = df.set_index('id')['answer'].to_dict()


    # 查询数据
    result = query_collection(qs='what is collection?', collection_name='question', source_dict=id_answer)    
 print(result)

Faiss

Faiss, 是一个较底层向量搜索库, 性能较高, 对10亿量级的索引可以做到毫秒级检索的性能, 支持GPU加速。

Faiss搜索

# step0 - 已有向量数据集
xb = get_vector()
    
# step1 - 创建索引
dim, measure = 64, faiss.METRIC_L2
param = 'Flat'
index = faiss.index_factory(dim, param, measure)
index.is_trained

# step2 - 添加向量
index.add(xb)

# step3 - 查询最相似的top4
index.search(xq, 4)

向量数据库部署和集成Web UI

# 部署docker
$ yum install docker
$ service start docker

# 部署docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

# 下载yml文件同时启动容器
$ wget https://github.com/milvus-io/milvus/releases/download/v2.2.8/milvus-standalone-docker-compose.yml -O docker-compose.yml
$ docker-compose up -d 

# 部署attu
docker run -p 8000:3000 -e HOST_URL=http://localhost:8000 -e MILVUS_URL=localhost:19530 zilliz/attu:latest

# 访问attu web
http://localhost:8000

#填写milvus集群连接地址
milvus address ip: 19530

参考

Milvus https://github.com/milvus-io/pymilvus
Autofaiss https://github.com/criteo/autofaiss
FaissSearcher https://github.com/mechsihao/FaissSearcher
Towhee https://github.com/towhee-io/towhee/tree/main
Faiss Index https://zhuanlan.zhihu.com/p/357414033

个人总结

最近写python较少, pandas和numpy操作有些遗忘了
不必去理解太高深的算法, 主要掌握基本算法和工程运用

下一篇文章预计会写下GPTcache结合向量检索的使用