[langchain] PDF용 질문/응답 시스템 구축

Langchain 공식문서의 내용을 정리한 것입니다. 내용 및 예제는 일부 변경하였지만 가능한 구조는 유지했습니다.

Build a PDF ingestion and Question/Answering system - https://python.langchain.com/docs/tutorials/pdf_qa/

PDF 파일은 종종 다른 출처에서는 얻을 수 없는 중요한 비구조화 데이터를 포함하고 있다. PDF 파일은 길이가 상당히 길 수 있으며, 일반적으로 일반 텍스트 파일과 달리 언어 모델의 프롬프트에 직접 입력할 수 없다.

이 튜토리얼에서는 PDF 파일에 대한 질문에 답할 수 있는 시스템을 만들 것이다. 더 구체적으로, 문서 로더를 사용하여 LLM에서 사용할 수 있는 형식으로 텍스트를 로드한 다음, 출처 자료에서 인용을 포함하여 질문에 답하기 위해 검색 증강 생성(RAG) 파이프라인을 구축할 것이다.

이 튜토리얼은 RAG 튜토리얼에서 더 깊이 다룬 개념을 간략하게 설명할 것이므로, 아직 진행하지 않았다면 먼저 해당 튜토리얼을 살펴보는 것이 좋다.

자, 시작해보자!

문서 로딩

먼저, 로드할 PDF를 선택해야 한다. 우리는 Nike의 연례 공개 SEC 보고서에서 문서를 사용할 것이다. 이 문서는 100페이지가 넘으며, 중요한 데이터와 긴 설명 텍스트가 혼합되어 포함되어 있다. 그러나 원하는 PDF를 사용하셔도 된다.

PDF를 선택한 후, 다음 단계는 LLM이 더 쉽게 처리할 수 있는 형식으로 로드하는 것이다. LLM은 일반적으로 텍스트 입력을 필요로 하므로, 이를 위해 LangChain에는 여러 가지 내장 문서 로더가 있다. 아래에서는 pypdf 패키지를 기반으로 하여 파일 경로에서 읽어오는 문서 로더를 사용할 것이다.

%pip install -qU pypdf langchain_community

from langchain_community.document_loaders import PyPDFLoader

file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

print(docs[0].page_content[0:100])
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K

{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}

무슨 일이 있었나요?

PDF 로딩: 로더는 지정된 경로에서 PDF를 메모리로 읽어온다.
텍스트 추출: 그런 다음 pypdf 패키지를 사용하여 텍스트 데이터를 추출한다.
문서 생성: 마지막으로, PDF의 각 페이지에 대해 페이지의 내용과 텍스트가 문서의 어디에서 왔는지를 나타내는 메타데이터를 포함한 LangChain 문서를 생성한다.

LangChain에는 다른 데이터 소스를 위한 다양한 문서 로더가 있으며, 필요에 따라 사용자 지정 문서 로더를 만들 수도 있다.

RAG 질문 답변하기

다음으로, 로드된 문서를 나중에 검색할 수 있도록 준비한다. 텍스트 분할기를 사용하여 로드된 문서를 LLM의 컨텍스트 창에 더 쉽게 맞출 수 있는 더 작은 문서로 분할한 다음, 이를 벡터 저장소에 로드한다. 그 후, RAG 체인에서 사용할 수 있도록 벡터 저장소에서 검색기를 생성할 수 있다.

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = InMemoryVectorStore.from_documents(
    documents=splits, embedding=OpenAIEmbeddings()
)

retriever = vectorstore.as_retriever()

마지막으로, rag_chain을 구성한 몇가지 내장 헬퍼 함수를 사용할 것이다.

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What was Nike's revenue in 2023?"})

results

{'input': "What was Nike's revenue in 2023?",
 'context': [Document(page_content='Table of Contents\nFISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\nThe following tabl...quivalent basis.', metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf'}),
  Document(page_content='Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and.... strategic pricing actions', metadata={'page': 30, 'source': '../example_data/nke-10k-2023.pdf'}),
  Document(page_content="Table of Contents\nNORTH AMERICA\n(Dollars in millions) FISCAL 2023FISCAL 2022 % CHANGE....and the addition of new stores.", metadata={'page': 39, 'source': '../example_data/nke-10k-2023.pdf'}),
  Document(page_content="Table of Contents\nEUROPE, MIDDLE EAST & AFRICA\n(Dollars in millions) FISCAL 2023FISCA...store sales growth of 22%.", metadata={'page': 40, 'source': '../example_data/nke-10k-2023.pdf'})],
 'answer': 'According to the financial highlights, Nike, Inc. achieved record revenues of $51.2 billion in fiscal 2023, which increased 10% on a reported basis and 16% on a currency-neutral basis compared to fiscal 2022.'}

결과 딕셔너리의 답변 키에서 최종 답변과 LLM이 답변을 생성하는 데 사용한 컨텍스트를 모두 얻을 수 있다.

컨텍스트 아래의 값을 살펴보면, 각 문서가 로드된 페이지 콘텐츠의 조각을 포함하는 문서임을 알 수 있다. 유용하게도, 이러한 문서는 처음 로드했을 때의 원래 메타데이터도 보존한다.

print(results["context"][0].page_content)

Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.

print(results["context"][0].metadata)

{'page': 35, 'source': '../example_data/nke-10k-2023.pdf'}

이 특정 청크는 원본 PDF의 35페이지에서 왔다. 이 데이터를 사용하여 답변이 PDF의 어느 페이지에서 왔는지를 보여줄 수 있으며, 이를 통해 사용자는 답변이 출처 자료를 기반으로 하고 있는지를 신속하게 확인할 수 있다.

저작자표시 (새창열림)

문서 로딩

RAG 질문 답변하기

티스토리툴바