[langchain] 다중 벡터저장소 검색기(MultiVector Retriever)

하나의 문서를 여러 벡터에 저장하는 게 도움이 될 수 있다. 여러 가지 유용한 유스케이스가 있다. 하나의 문서당 여러 가지 벡터를 만드는데는 많이 복잡할 수 있다. MultiVectorRetriever를 통해 여러가지 벡터를 일반적인 방법이다.

문서당 여러 개의 벡터를 저장하는 것이 유리한 경우가 있다. 이는 여러 가지 사용 사례에서 유용할 수 있다. LangChain에는 이러한 설정을 쉽게 쿼리할 수 있게 해주는 기본 MultiVectorRetriever가 있다.

문서당 여러 벡터를 생성하는 방법은 아래와 같다

작은 청크: 문서를 더 작은 청크로 나누고, 이를 임베딩한다. (ParentDocumentRetriever).
요약: 각 문서에 대한 요약을 생성하고, 문서와 함께 (또는 대신) 이를 임베딩한다.
가상의 질문: 각 문서가 적절하게 답변할 수 있는 가상의 질문을 생성하고, 문서와 함께 (또는 대신) 이를 임베딩한다.

문서 로딩

load_dotenv()

loaders = [
    TextLoader("./data/news2.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000)
# 기본 문서도 너무 크면 안되므로 4000자로 짜른다.
split_docs = text_splitter.split_documents(docs)

news2.txt 파일을 로딩한다. 기본 문서도 너무 크면 문제가 되므로 4000자로 짜른다.

작은 청크

큰 청크로 검색하는 방법이 유용한 경우도 있지만, 데이터를 작은 청크로 임베딩 하는 것이 일반적으로 더 좋다. 이렇게 하면 임베딩이 의미론적 의미를 최대한 가깝게 찾을수 있지만, 가능한 많은 컨텍스트를 하위 단계로 전달할 수 있다. 이것이 바로 ParentDocumentRetriever가 하는 일이다.

# child 청크를 저장할 vectorstore를 생성
vectorstore = Chroma(
    persist_directory="chroma_multi_vector_store",
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings(),
)

# parent 문서 저장 레이어
store = InMemoryByteStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in split_docs]

# child 청크를 만들기 위한 splitter
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(split_docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

question = "비타민"
# vectorstore는 작은 청크를 조회
docs = retriever.vectorstore.similarity_search(question)
print("====== child docs result ========")
for i, doc in enumerate(docs):
    print(
        f"[문서 {i}][{len(doc.page_content)}] {doc.page_content.replace('\n', ' ')}"
    )

full_docs = retriever.invoke(question)
print("====== parent docs result ========")
for i, doc in enumerate(full_docs):
    print(
        f"[문서 {i}][{len(doc.page_content)}] {doc.page_content.replace('\n', ' ')}"
    )

위의 코드는 ParentDocumentRetriever에서 하는 것과 거의 유사하다.

실행결과

====== child docs result ========
[문서 0][335] 2. 비타민은 탄xxxxx
[문서 1][393] 3. 비타민은 우리몸의 xxxxx
[문서 2][165] 1950년대 노벨상 xxxxx
[문서 3][304] 한국어에서 xxxxx
====== parent docs result ========
[문서 0][3094] 1. 비타민은 소량으로 xxxxx

child docs에서 검색된 결과를 parent docs에서 가져온다. child docs는 400자로 짤랐으므로 글자수가 400이하이고 부모문서는 4000자로 짤랐으므로 3094로 표시되는 것을 알 수 있다.

요약

정보를 요약함으로써 검색 시 더 좋은 결과가 나올 수 있다. 아래는 요약을 생성하고 이를 임베딩하는 방법이다. 검색 결과를 리턴할 때는 요약된 문서가 아니라 원본 문서를 리턴한다.

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(split_docs, {"max_concurrency": 5})

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    persist_directory="chroma_multi_vector_store",
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings(),
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in split_docs]
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

question = "비타민"
sub_docs = vectorstore.similarity_search(question)
print("====== summary child docs result ========")
for i, doc in enumerate(sub_docs):
    print(
        f"[문서 {i}][{len(doc.page_content)}] {doc.page_content.replace('\n', ' ')}"
    )

retrieved_docs = retriever.invoke(question)
print("====== summary docs result ========")
for i, doc in enumerate(retrieved_docs):
    print(
        f"[문서 {i}][{len(doc.page_content)}] {doc.page_content.replace('\n', ' ')}"
    )

실행 결과

====== summary child docs result ========
[문서 0][303] 비타민은 신체 기능을 조절하는데 xxxxx
====== summary docs result ========
[문서 0][3095] 1. 비타민은 소량으로 신체 기능을 xxxxx

가설 쿼리 (Hypothetical Queries)

LLM을 사용하여 특정 문서에 대해 물어볼 수 있는 가상의 질문 목록을 생성할 수도 있다. 그런 다음 이 질문들을 임베딩할 수 있다.

functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

print("split_docs[0]", split_docs[0])
hypothetical_docs = chain.invoke(split_docs[0])
print("====== hypothetical docs result ========")
for i, doc in enumerate(hypothetical_docs):
    print(f"[문서 {i}] {doc}")

hypothetical_questions = chain.batch(split_docs, {"max_concurrency": 5})
print("====== hypothetical_questions result ========")
print(hypothetical_questions)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    persist_directory="chroma_multi_vector_store",
    collection_name="hypo-questions",
    embedding_function=OpenAIEmbeddings(),
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for s in question_list
        ]
    )

retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

question = "비타민"
sub_docs = vectorstore.similarity_search(question)
print("====== hypothetical_questions result ========")
for i, doc in enumerate(sub_docs):
    print(
        f"[문서 {i}][{len(doc.page_content)}] {doc.page_content.replace('\n', ' ')}"
    )

retrieved_docs = retriever.invoke(question)
print("====== hypothetical_retrieved docs result ========")
for i, doc in enumerate(retrieved_docs):
    print(
        f"[문서 {i}][{len(doc.page_content)}] {doc.page_content.replace('\n', ' ')}"
    )

실행 결과

====== hypothetical  docs result ========
[문서 0] 비타민 C는 어떤 동물에게는 비타민이고, 어떤 동물에게는 호르몬인가요?
[문서 1] 비타민은 어떻게 체내에서 합성되나요?
[문서 2] 비타민과 호르몬의 차이점은 무엇인가요?
====== hypothetical_questions result ========
[['비타민과 호르몬의 차이점은 무엇인가요?', '비타민의 역사와 그것이 어떻게 발견되었는지 설명해주세요.', '비타민 과다증에 대해 설명해주세요.']]
====== hypothetical_questions result ========
[문서 0][19] 비타민 과다증에 대해 설명해주세요.
[문서 1][31] 비타민의 역사와 그것이 어떻게 발견되었는지 설명해주세요.
[문서 2][21] 비타민과 호르몬의 차이점은 무엇인가요?
[문서 3][50] How was the existence of vitamins first confirmed?
====== hypothetical_retrieved docs result ========
[문서 0][3095] 1. 비타민은 소량으로 신체 기능을 조절한다는 점에서 호르몬과 비슷하지만 xxxxxx