[langchain] SQL 데이터 기반의 질답 시스템 만들기

Langchain 공식문서의 내용을 정리한 것입니다. 내용 및 예제는 일부 변경하였지만 가능한 구조는 유지했습니다.

Build a Question/Answering system over SQL data - https://python.langchain.com/docs/tutorials/sql_qa/

구조화된 데이터를 쿼리할 수 있도록 LLM 시스템을 활성화하는 것은 비정형 텍스트 데이터를 처리하는 것과 질적으로 다를 수 있다. 비정형 텍스트 데이터의 경우 벡터 데이터베이스에서 검색할 수 있는 텍스트를 생성하는 것이 일반적이지만, 구조화된 데이터에서는 LLM이 SQL과 같은 DSL(도메인 특화 언어)로 쿼리를 작성하고 실행하는 방식이 자주 사용된다. 이 가이드에서는 데이터베이스 내 테이블 데이터를 대상으로 Q&A 시스템을 구축하는 기본 방법을 다룬다. 체인과 에이전트를 사용한 구현을 모두 다루며, 이를 통해 데이터베이스에 질문을 하고 자연어 답변을 얻을 수 있는 시스템을 만들 것이다. 두 방법의 주요 차이점은 에이전트는 질문에 답할 때까지 필요한 만큼 데이터베이스를 반복해서 쿼리할 수 있다는 점이다.

보안 주의사항

SQL 데이터베이스에 대한 Q&A 시스템을 구축할 때는 모델이 생성한 SQL 쿼리를 실행해야 한다. 이를 수행하는 데는 고유한 위험이 따른다. 체인/에이전트의 필요에 맞게 데이터베이스 연결 권한을 최대한 좁게 설정하는 것이 중요하다. 이렇게 하면 모델 기반 시스템 구축 시 발생할 수 있는 위험을 완화할 수 있지만, 완전히 제거할 수는 없다. 일반적인 보안 모범 사례에 대한 더 자세한 내용은 여기에서 확인하라.

준비

이 가이드의 의존성

%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas

필요한 환경 변수 설정

# Using LangSmith is recommended but not required. Uncomment below lines to use.
# import os
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

없다면 Titanic dataset을 다운로드 한다.

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv

import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.shape)
print(df.columns.tolist())

실행결과

(887, 8)
['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

SQL

SQL을 사용하여 CSV 데이터를 처리하는 것은 권한을 제한하고 쿼리를 정제하는 것이 임의의 Python 코드보다 더 쉽기 때문에 권장되는 접근 방식이다.

대부분의 SQL 데이터베이스는 CSV 파일을 테이블로 로드하는 과정을 간단하게 지원한다(DuckDB, SQLite 등). 이를 완료한 후에는 SQL 튜토리얼에서 설명한 모든 체인 및 에이전트 생성 기법을 사용할 수 있다. 다음은 SQLite에서 이를 수행하는 간단한 예시이다.

from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

engine = create_engine("sqlite:///titanic.db")
df.to_sql("titanic", engine, index=False)

db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM titanic WHERE Age < 2;"))

['titanic']
[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]

그리고 나서 상호 작용할 SQL agent를 만든다.

from dotenv import load_dotenv

load_dotenv()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

from langchain_community.agent_toolkits import create_sql_agent

agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)

agent_executor.invoke({"input": "what's the average age of survivors"})

[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3m
Invoking: `sql_db_list_tables` with `{}`


[0m[38;5;200m[1;3mtitanic[0m[32;1m[1;3m
Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`


[0m[33;1m[1;3m
CREATE TABLE titanic (
    "Survived" BIGINT, 
    "Pclass" BIGINT, 
    "Name" TEXT, 
    "Sex" TEXT, 
    "Age" FLOAT, 
    "Siblings/Spouses Aboard" BIGINT, 
    "Parents/Children Aboard" BIGINT, 
    "Fare" FLOAT
)

/*
3 rows from titanic table:
Survived    Pclass    Name    Sex    Age    Siblings/Spouses Aboard    Parents/Children Aboard    Fare
0    3    Mr. Owen Harris Braund    male    22.0    1    0    7.25
1    1    Mrs. John Bradley (Florence Briggs Thayer) Cumings    female    38.0    1    0    71.2833
1    3    Miss. Laina Heikkinen    female    26.0    0    0    7.925
*/[0m[32;1m[1;3m
Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS Average_Age FROM titanic WHERE Survived = 1'}`


[0m[36;1m[1;3m[(28.408391812865496,)][0m[32;1m[1;3mThe average age of survivors in the Titanic dataset is approximately 28.41 years.[0m

[1m> Finished chain.[0m

실행결과

{'input': "what's the average age of survivors",
 'output': 'The average age of survivors in the Titanic dataset is approximately 28.41 years.'}

이 접근 방식은 여러 CSV 파일에도 쉽게 일반화될 수 있다. 각 CSV 파일을 데이터베이스의 개별 테이블로 로드하기만 하면 된다. 자세한 내용은 아래의 "Multiple CSVs" 섹션을 참조하라.

Pandas

SQL 대신, Pandas와 같은 데이터 분석 라이브러리와 LLM의 코드 생성 기능을 사용하여 CSV 데이터와 상호작용할 수도 있다. 그러나 이 접근 방식은 철저한 안전장치가 없는 한 실제 운영 환경에는 적합하지 않다. 이러한 이유로, 코드 실행 유틸리티와 생성기는 langchain-experimental 패키지에 포함되어 있다.

체인

대부분의 LLM은 충분한 Pandas Python 코드를 학습했기 때문에, 요청만으로도 Pandas 코드를 생성할 수 있다.

ai_msg = llm.invoke(
    "I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else."
)
print(ai_msg.content)

\`\`\`python
correlation = df['Age'].corr(df['Fare'])
correlation
\`\`\`

이 기능을 Python 실행 도구와 결합하여 간단한 데이터 분석 체인을 만들 수 있다. 먼저 CSV 테이블을 데이터프레임으로 로드하고, 이 데이터프레임에 도구가 접근할 수 있도록 해야 한다.

import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tools import PythonAstREPLTool

df = pd.read_csv("titanic.csv")
tool = PythonAstREPLTool(locals={"df": df})
tool.invoke("df['Fare'].mean()")

32.30542018038331

Python 도구를 올바르게 사용하도록 하기 위해, 도구(Tools) 호출을 사용할 것이다.

llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)
response = llm_with_tools.invoke(
    "I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
response

AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_SBrK246yUbdnJemXFC8Iod05', 'function': {'arguments': '{"query":"df.corr()[\'Age\'][\'Fare\']"}', 'name': 'python_repl_ast'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 125, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-1fd332ba-fa72-4351-8182-d464e7368311-0', tool_calls=[{'name': 'python_repl_ast', 'args': {'query': "df.corr()['Age']['Fare']"}, 'id': 'call_SBrK246yUbdnJemXFC8Iod05'}])

response.tool_calls

[{'name': 'python_repl_ast',
  'args': {'query': "df.corr()['Age']['Fare']"},
  'id': 'call_SBrK246yUbdnJemXFC8Iod05'}]

함수 호출을 dict으로 추출하기 위해 도구의 output 파서를 추가할 것이다.

from langchain_core.output_parsers.openai_tools import JsonOutputKeyToolsParser

parser = JsonOutputKeyToolsParser(key_name=tool.name, first_tool_only=True)
(llm_with_tools | parser).invoke(
    "I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)

{'query': "df[['Age', 'Fare']].corr()"}

그리고 프롬프트와 결합하여 매번 데이터프레임 정보를 지정하지 않고도 질문만 지정할 수 있도록 한다.

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Return ONLY the valid Python code and nothing else. \
Don't assume you have access to any libraries other than built-in Python ones and pandas."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
code_chain = prompt | llm_with_tools | parser
code_chain.invoke({"question": "What's the correlation between age and fare"})

{'query': "df[['Age', 'Fare']].corr()"}

마지막으로, 생성된 코드가 실제로 실행되도록 Python 도구를 추가하자.

chain = prompt | llm_with_tools | parser | tool
chain.invoke({"question": "What's the correlation between age and fare"})

0.11232863699941621

이렇게 해서 간단한 데이터 분석 체인이 만들어졌다. LangSmith 추적을 통해 중간 단계를 살펴볼 수 있다: LangSmith Trace

마지막에 대화형 응답을 생성하기 위해 추가 LLM 호출을 추가할 수 있다. 이렇게 하면 도구 출력만으로 응답하지 않게 된다. 이를 위해 프롬프트에 채팅 기록을 위한 MessagesPlaceholder를 추가해야 한다.

from operator import itemgetter

from langchain_core.messages import ToolMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas.
Respond directly to the question once you have enough information to answer it."""
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            system,
        ),
        ("human", "{question}"),
        # This MessagesPlaceholder allows us to optionally append an arbitrary number of messages
        # at the end of the prompt using the 'chat_history' arg.
        MessagesPlaceholder("chat_history", optional=True),
    ]
)


def _get_chat_history(x: dict) -> list:
    """Parse the chain output up to this point into a list of chat history messages to insert in the prompt."""
    ai_msg = x["ai_msg"]
    tool_call_id = x["ai_msg"].additional_kwargs["tool_calls"][0]["id"]
    tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x["tool_output"]))
    return [ai_msg, tool_msg]


chain = (
    RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)
    .assign(tool_output=itemgetter("ai_msg") | parser | tool)
    .assign(chat_history=_get_chat_history)
    .assign(response=prompt | llm | StrOutputParser())
    .pick(["tool_output", "response"])
)

chain.invoke({"question": "What's the correlation between age and fare"})

{'tool_output': 0.11232863699941616,
 'response': 'The correlation between age and fare is approximately 0.1123.'}

이 실행에 대한 LangSmith trace이다: https://smith.langchain.com/public/14e38d70-45b1-4b81-8477-9fd2b7c07ea6/r

에이전트

복잡한 질문의 경우 LLM이 이전 실행의 입력과 출력을 유지하면서 코드를 반복적으로 실행할 수 있는 것이 유용할 수 있다. 이때 에이전트가 등장한다. 에이전트는 LLM이 도구를 몇 번 호출해야 하는지 결정하고, 지금까지 수행한 실행을 추적할 수 있도록 해준다. create_pandas_dataframe_agent는 데이터프레임 작업을 쉽게 해주는 내장 에이전트이다.

from langchain_experimental.agents import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(llm, df, agent_type="openai-tools", verbose=True)
agent.invoke(
    {
        "input": "What's the correlation between age and fare? is that greater than the correlation between fare and survival?"
    }
)

1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Age', 'Fare']].corr().iloc[0,1]"}`


[0m[36;1m[1;3m0.11232863699941621[0m[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Fare', 'Survived']].corr().iloc[0,1]"}`


[0m[36;1m[1;3m0.2561785496289603[0m[32;1m[1;3mThe correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.

Therefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).[0m

[1m> Finished chain.[0m

{'input': "What's the correlation between age and fare? is that greater than the correlation between fare and survival?",
 'output': 'The correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.\n\nTherefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).'}

이 실행에 대한 LangSmith trace이다: https://smith.langchain.com/public/6a86aee2-4f22-474a-9264-bd4c7283e665/r

여러 CSV 파일

여러 CSV 파일(또는 데이터프레임)을 처리하려면 Python 도구에 여러 데이터프레임을 전달하기만 하면 된다. create_pandas_dataframe_agent 생성자는 기본적으로 이를 지원하므로, 단일 데이터프레임 대신 데이터프레임 목록을 전달할 수 있다. 만약 우리가 직접 체인을 구성하고 있다면, 다음과 같은 방법으로 할 수 있다.

df_1 = df[["Age", "Fare"]]
df_2 = df[["Fare", "Survived"]]

tool = PythonAstREPLTool(locals={"df_1": df_1, "df_2": df_2})
llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)
df_template = """\`\`\`python
{df_name}.head().to_markdown()
>>> {df_head}
\`\`\`"""
df_context = "\n\n".join(
    df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)
    for _df, df_name in [(df_1, "df_1"), (df_2, "df_2")]
)

system = f"""You have access to a number of pandas dataframes. \
Here is a sample of rows from each dataframe and the python code that was used to generate the sample:

{df_context}

Given a user question about the dataframes, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas. \
Make sure to refer only to the variables mentioned above."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])

chain = prompt | llm_with_tool | parser | tool
chain.invoke(
    {
        "question": "return the difference in the correlation between age and fare and the correlation between fare and survival"
    }
)