[langchain] 출력 파서 (Output Parser)

LangChain의 OutputParser는 LangChain 프레임워크에서 다양한 종류의 출력 데이터를 파싱하고 처리하기 위한 중요한 컴포넌트이다. 이 프레임워크는 대형 언어 모델을 활용하여 텍스트 기반 애플리케이션을 만들 때 사용된다. OutputParser는 모델의 출력 결과를 적절한 형식으로 변환하고, 이를 애플리케이션의 다음 단계로 전달하는 역할을 한다.

출력 데이터를 string, json, List 등 원하는 형태대로 결과를 받아서 이후 데이터 조작하기 쉽게 만들어준다.

모든 Output Parser는 get_format_instructions()를 가지고 있고 Runnable이며 체인에 주입할 수 있다. 또한 PromptTemplate를 통해 직접 instructions를 주입할 수도 있다.

Pydantic Output Parser

PydanticOutputParser는 스키마에 대응하는 출력 형식으로 LLM에 쿼리하도록 명시하는 방식이다.

아래와 같은 예제를 한번 보자.

class Person(BaseModel):
    name: str = Field(description="person's name")
    hometown: str = Field(description="person's hometown")
    birthday: str = Field(description="person's birthday")


# prompt 생성
chat_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an AI that provides information about historical figures.\n{format_instructions}",
        ),
        ("human", "Tell me about {name}"),
    ]
)

# chat model 생성
chat = ChatOpenAI()

# output parser 생성
output_parser = PydanticOutputParser(pydantic_object=Person)

# chain 형성
runnable = chat_prompt | chat | output_parser

# chain 실행
res = runnable.invoke(
    {
        "name": "소녀시대 윤아",
        "format_instructions": output_parser.get_format_instructions(),
    }
)
print(res)

출력형식을 PydanticOutputParser로 설정하고 타입을 Person으로 지정하였다. 출력 값에서 name, hometown, birthday에 대한 값을 지정하여 출력한다.

출력결과

name='윤아' hometown='서울, 대한민국' birthday='May 30, 1990'

다음으로 모델이 List로 된 것을 한번 보자.

class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")

parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

model = ChatOpenAI()

chain = prompt | model | parser

actor_query = "송강호의 출연작"
result = chain.invoke({"query": actor_query})
print(result)

송강호의 출연작을 목록으로 출력하는 것을 name<str>, file_names<List>로 나타내도록 모델을 구성했다.

실행결과

name='송강호' film_names=['기생충', '살인의 추억', '광해, 왕이 된 남자', '반도', '엽기적인 그녀']

가끔 영화결과가 맞지 않은 것도 있는 것 같지만, 포맷을 정의한 모델 형식대로 출력을 한다.

실제 prompt의 내용은 아래와 같이 작성이 된다.

{
  "prompts": [
    "Human: Answer the user query.\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{\"properties\": {\"name\": {\"title\": \"Name\", \"type\": \"string\"}, \"film_names\": {\"items\": {\"type\": \"string\"}, \"title\": \"Film Names\", \"type\": \"array\"}}}\n```\n송강호의 출연작"
  ]
}

CommaSeparatedListOutputParser

CommaSeparatedListOutputParser는 쉼표(,)로 구분된 항목을 리턴할 필요가 있을 때 사용할 수 있다.

import langchain
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import (
    CommaSeparatedListOutputParser,
)
from langchain_openai import ChatOpenAI

load_dotenv()

output_parser = CommaSeparatedListOutputParser()

format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="{subject}에 대해서 5개를 나열.\n{format_instructions}",
    input_variables=["subject"],  
    partial_variables={"format_instructions": format_instructions},
)

llm = ChatOpenAI()

# 프롬프트, 모델, 출력 파서를 연결하여 체인 생성
chain = prompt | llm | output_parser

question = "서초역 맛집"
result = chain.invoke({"subject": question})
print(result)

출력결과

['마카롱티', '미스터서왕만두', '더플레이스', '더플레이스', '더플레이스']

실제 프롬프트의 내용은 아래와 같다.

{
  "prompts": [
    "Human: 서초역 맛집에 대해서 5개를 나열.\nYour response should be a list of comma separated values, eg: `foo, bar, baz` or `foo,bar,baz`"
  ]
}

StructuredOutputParser

StructuredOutputParser는 다양한 필드를 리턴할 때 사용할 수 있다. Pydantic/JSON 파서가 더 강력한 반면 StructuredOutputParser가 좀 덜 강력하다.

import langchain
from dotenv import load_dotenv
from langchain.output_parsers import (
    StructuredOutputParser,
    ResponseSchema,
)
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

load_dotenv()

response_schemas = [
    ResponseSchema(name="answer", description="사용자의 질문에 대한 답변"),
    ResponseSchema(
        name="source",
        description="사용자의 질문에 답하기 위해 사용된 출처를 표시해야 한다.",
    ),
]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="answer the users question as best as possible.\n{format_instructions}\n{question}",
    input_variables=["question"],
    partial_variables={"format_instructions": format_instructions},
)

llm = ChatOpenAI()

chain = prompt | llm | output_parser

question = "2002년 한일 월드컵에서 대한민국의 성적은?"
result = chain.invoke({"question": question})
print(result)

출력결과

{'answer': '2002년 한일 월드컵에서 대한민국은 4강에 진출하였습니다.', 'source': 'https://ko.wikipedia.org/wiki/2002%EB%85%84_FIFA_%EC%9B%94%EB%93%9C%EC%BB%B5'}

프롬프트의 내용은 아래와 같다.

{
  "prompts": [
    "Human: answer the users question as best as possible.\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing \"```json\" and \"```\":\n\n```json\n{\n\t\"answer\": string  // 사용자의 질문에 대한 답변\n\t\"source\": string  // 사용자의 질문에 답하기 위해 사용된 출처를 표시해야 한다.\n}\n```\n2002년 한일 월드컵에서 대한민국의 성적은?"
  ]
}

JsonOutputParser

JsonOutputParser는 임의의 JSON 스키마를 지정해서 LLM에게 지정한 스키마대로 출력을 하도록 요청하는 것이다.

import langchain
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import (
    JsonOutputParser,
)
from langchain_core.pydantic_v1 import Field
from langchain_openai import ChatOpenAI
from pydantic import BaseModel

load_dotenv()


class Movie(BaseModel):
    running_time: str = Field(description="상영시간에 해당")
    manager_name: str = Field(description="감독 이름")
    attendance: str = Field(description="총 관객 수에 해당")
    release_date: str = Field(description="개봉일에 해당")
    country: str = Field(description="영화를 만든 국가에 해당")


parser = JsonOutputParser(pydantic_object=Movie)

template = """
Answer the user query.
{format_instructions}

{query}
"""

query = "영화 '실미도'의 상영시간, 감독, 관객 수, 개봉일, 만든 국가에 대해 알려주세요"


prompt = PromptTemplate(
    template=template,
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

model = ChatOpenAI()
chain = prompt | model | parser

result = chain.invoke({"query": query})

print(result)

실행결과

{
  "running_time": "130 minutes",
  "manager_name": "Kang Woo-suk",
  "attendance": "4.7 million",
  "release_date": "March 13, 2003",
  "country": "South Korea"
}

프롬프트의 내용은 아래와 같다.

{
  "prompts": [
    "Human: \nAnswer the user query.\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{\"properties\": {\"running_time\": {\"title\": \"Running Time\", \"type\": \"string\"}, \"manager_name\": {\"title\": \"Manager Name\", \"type\": \"string\"}, \"attendance\": {\"title\": \"Attendance\", \"type\": \"string\"}, \"release_date\": {\"title\": \"Release Date\", \"type\": \"string\"}, \"country\": {\"title\": \"Country\", \"type\": \"string\"}}}\n```\n\n영화 실미도'의 상영시간, 감독, 관객 수, 개봉일, 만든 국가에 대해 알려주세요"
  ]
}