[LangChain v1.0] Models

출처: https://docs.langchain.com/oss/python/langchain/models

LLM(Large Language Models)은 인간처럼 텍스트를 해석하고 생성할 수 있는 강력한 AI 도구입니다. 콘텐츠 작성, 언어 번역, 요약, 질문 답변 등 각 작업에 대한 특별한 학습 없이도 다양한 작업을 수행할 수 있을 만큼 다재다능합니다.

텍스트 생성 외에도 많은 모델이 다음을 지원합니다:

Tool calling - 외부 도구(데이터베이스 쿼리 또는 API 호출 등)를 호출하고 응답에 결과를 사용합니다.
Structured output - 모델의 응답이 정의된 형식을 따르도록 제한됩니다.
Multimodality - 이미지, 오디오, 비디오와 같이 텍스트가 아닌 데이터를 처리하고 반환합니다.
Reasoning - 모델이 결론에 도달하기 위해 다단계 추론을 수행합니다.

모델은 에이전트의 추론 엔진입니다. 에이전트의 의사 결정 프로세스를 주도하고, 어떤 도구를 호출할지 결정하며, 결과를 해석하고, 최종 답변을 제공할 시점을 결정합니다.

선택한 모델의 품질과 기능은 에이전트의 신뢰성과 성능에 직접적인 영향을 미칩니다. 다양한 모델이 다양한 작업에 뛰어납니다 - 어떤 모델은 복잡한 지시사항을 따르는 데 더 뛰어나고, 다른 모델은 구조화된 추론에, 일부는 더 많은 정보를 처리하기 위한 더 큰 컨텍스트 윈도우를 지원합니다.

LangChain의 표준 모델 인터페이스는 많은 다양한 제공자 통합에 대한 액세스를 제공하므로, 귀하의 사례에 가장 적합한 모델을 찾기 위해 쉽게 실험하고 전환할 수 있습니다.

제공자별 통합 정보 및 기능에 대해서는 제공자의 chat model page를 참조하세요.

Basic usage

모델은 두 가지 방법으로 활용할 수 있습니다:

With agents - 모델은 agent를 생성할 때 동적으로 지정할 수 있습니다.
Standalone - 모델은 에이전트 프레임워크 없이도 텍스트 생성, 분류 또는 추출과 같은 작업을 위해 (에이전트 루프 외부에서) 직접 호출할 수 있습니다.

동일한 모델 인터페이스가 두 컨텍스트 모두에서 작동하므로, 간단하게 시작하여 필요에 따라 더 복잡한 에이전트 기반 워크플로우로 확장할 수 있는 유연성을 제공합니다.

Initialize a model

LangChain에서 독립 실행형 모델을 시작하는 가장 쉬운 방법은 init_chat_model을 사용하여 선택한 chat model provider에서 초기화하는 것입니다(아래 예시):

OpenAI

👉 OpenAI chat model integration docs 읽기

pip install -U "langchain[openai]"

init_chat_model

import os
from langchain.chat_models import init_chat_model

os.environ["OPENAI_API_KEY"] = "sk-..."

model = init_chat_model("gpt-4.1")

Model Class

response = model.invoke("Why do parrots talk?")

모델 parameters를 전달하는 방법에 대한 정보를 포함하여 자세한 내용은 init_chat_model을 참조하세요.

Key methods

Invoke - 모델이 메시지를 입력으로 받아 완전한 응답을 생성한 후 메시지를 출력합니다.

Stream - 모델을 호출하지만 생성되는 대로 실시간으로 출력을 스트리밍합니다.

Batch - 더 효율적인 처리를 위해 모델에 여러 요청을 배치로 전송합니다.

채팅 모델 외에도 LangChain은 임베딩 모델 및 벡터 저장소와 같은 다른 인접 기술에 대한 지원을 제공합니다. 자세한 내용은 integrations page를 참조하세요.

Parameters

채팅 모델은 동작을 구성하는 데 사용할 수 있는 매개변수를 사용합니다. 지원되는 전체 매개변수 집합은 모델 및 제공자에 따라 다르지만 표준 매개변수는 다음과 같습니다:

model

string required

제공자와 함께 사용하려는 특정 모델의 이름 또는 식별자입니다.

api_key

string

모델의 제공자와 인증하는 데 필요한 키입니다. 이것은 일반적으로 모델에 대한 액세스를 신청할 때 발급됩니다. 종종 environment variable을 설정하여 액세스합니다.

temperature

number

모델 출력의 무작위성을 제어합니다. 숫자가 높을수록 응답이 더 창의적이고, 낮을수록 더 결정론적입니다.

timeout

number

요청을 취소하기 전에 모델로부터 응답을 기다리는 최대 시간(초)입니다.

max_tokens

number

응답의 총 tokens 수를 제한하여 출력이 얼마나 길 수 있는지를 효과적으로 제어합니다.

max_retries

number

네트워크 타임아웃 또는 속도 제한과 같은 문제로 인해 실패한 경우 시스템이 요청을 다시 보내기 위해 시도하는 최대 횟수입니다.

init_chat_model을 사용하여 이러한 매개변수를 인라인 **kwargs로 전달합니다:

Initialize using model parameters

model = init_chat_model(
    "claude-sonnet-4-5-20250929",
    # Kwargs passed to the model:
    temperature=0.7,
    timeout=30,
    max_tokens=1000,
)

각 채팅 모델 통합에는 제공자별 기능을 제어하는 데 사용되는 추가 매개변수가 있을 수 있습니다. 예를 들어, ChatOpenAI에는 OpenAI Responses 또는 Completions API 사용 여부를 지정하는 use_responses_api가 있습니다.

특정 채팅 모델이 지원하는 모든 매개변수를 찾으려면 chat model integrations 페이지로 이동하세요.

Invocation

채팅 모델은 출력을 생성하기 위해 호출되어야 합니다. 각각 다른 사용 사례에 적합한 세 가지 기본 호출 메서드가 있습니다.

Invoke

모델을 호출하는 가장 간단한 방법은 단일 메시지 또는 메시지 목록과 함께 invoke()를 사용하는 것입니다.

Single message

response = model.invoke("Why do parrots have colorful feathers?")
print(response)

대화 기록을 나타내기 위해 모델에 메시지 목록을 제공할 수 있습니다. 각 메시지에는 모델이 대화에서 메시지를 보낸 사람을 나타내는 데 사용하는 역할이 있습니다. 역할, 유형 및 콘텐츠에 대한 자세한 내용은 messages 가이드를 참조하세요.

Dictionary format

from langchain.messages import HumanMessage, AIMessage, SystemMessage

conversation = [
    {"role": "system", "content": "You are a helpful assistant that translates English to French."},
    {"role": "user", "content": "Translate: I love programming."},
    {"role": "assistant", "content": "J'adore la programmation."},
    {"role": "user", "content": "Translate: I love building applications."}
]

response = model.invoke(conversation)
print(response)
# AIMessage("J'adore créer des applications.")

Message objects

from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

conversation = [
    SystemMessage("You are a helpful assistant that translates English to French."),
    HumanMessage("Translate: I love programming."),
    AIMessage("J'adore la programmation."),
    HumanMessage("Translate: I love building applications.")
]

response = model.invoke(conversation)
print(response)
# AIMessage("J'adore créer des applications.")

Stream

대부분의 모델은 출력 콘텐츠가 생성되는 동안 스트리밍할 수 있습니다. 출력을 점진적으로 표시하면 스트리밍이 사용자 경험을 크게 향상시키며, 특히 긴 응답의 경우 더욱 그렇습니다.

stream()을 호출하면 생성되는 대로 출력 청크를 생성하는 iterator를 반환합니다. 루프를 사용하여 각 청크를 실시간으로 처리할 수 있습니다:

Basic text streaming

for chunk in model.stream("Why do parrots have colorful feathers?"):
    print(chunk.text, end="|", flush=True)

모델이 전체 응답 생성을 완료한 후 단일 AIMessage를 반환하는 invoke()와 달리, stream()은 각각 출력 텍스트의 일부를 포함하는 여러 AIMessageChunk 객체를 반환합니다. 중요한 점은 스트림의 각 청크가 합산을 통해 전체 메시지로 수집되도록 설계되었다는 것입니다:

Construct an AIMessage

full = None  # None | AIMessageChunk
for chunk in model.stream("What color is the sky?"):
    full = chunk if full is None else full + chunk

print(full.text)
# The
# The sky
# The sky is
# The sky is typically
# The sky is typically blue
# ...

print(full.content_blocks)
# [{"type": "text", "text": "The sky is typically blue..."}]

결과 메시지는 invoke()로 생성된 메시지와 동일하게 처리될 수 있습니다 - 예를 들어, 메시지 기록에 집계되어 대화 컨텍스트로 모델에 다시 전달될 수 있습니다.

스트리밍은 프로그램의 모든 단계가 청크 스트림을 처리하는 방법을 알고 있는 경우에만 작동합니다. 예를 들어, 스트리밍이 불가능한 애플리케이션은 처리되기 전에 전체 출력을 메모리에 저장해야 하는 애플리케이션입니다.

Advanced streaming topics

"Auto-streaming" chat models

LangChain은 특정 경우에 스트리밍 메서드를 명시적으로 호출하지 않더라도 스트리밍 모드를 자동으로 활성화하여 채팅 모델에서 스트리밍을 단순화합니다. 이는 비스트리밍 invoke 메서드를 사용하지만 채팅 모델의 중간 결과를 포함하여 전체 애플리케이션을 스트리밍하려는 경우 특히 유용합니다.

예를 들어 LangGraph agents에서 노드 내에서 model.invoke()를 호출할 수 있지만, 스트리밍 모드에서 실행 중인 경우 LangChain이 자동으로 스트리밍에 위임합니다.

How it works

채팅 모델을 invoke()할 때, 전체 애플리케이션을 스트리밍하려고 시도하는 것을 감지하면 LangChain이 자동으로 내부 스트리밍 모드로 전환됩니다. 호출 결과는 invoke를 사용하는 코드에 관한 한 동일합니다. 그러나 채팅 모델이 스트리밍되는 동안 LangChain은 LangChain의 콜백 시스템에서 on_llm_new_token 이벤트를 호출하는 것을 처리합니다.

콜백 이벤트를 통해 LangGraph stream() 및 astream_events()가 실시간으로 채팅 모델의 출력을 표시할 수 있습니다.

Streaming events

LangChain 채팅 모델은 astream_events()를 사용하여 의미론적 이벤트도 스트리밍할 수 있습니다.

이를 통해 이벤트 유형 및 기타 메타데이터를 기반으로 필터링을 단순화하고 백그라운드에서 전체 메시지를 집계합니다. 예제는 아래를 참조하세요.

async for event in model.astream_events("Hello"):
    if event["event"] == "on_chat_model_start":
        print(f"Input: {event['data']['input']}")
    elif event["event"] == "on_chat_model_stream":
        print(f"Token: {event['data']['chunk'].text}")
    elif event["event"] == "on_chat_model_end":
        print(f"Full message: {event['data']['output'].text}")
    else:
        pass

Input: Hello
Token: Hi
Token:  there
Token: !
Token:  How
Token:  can
Token:  I
...
Full message: Hi there! How can I help today?

이벤트 유형 및 기타 세부 정보는 astream_events() 참조를 참조하세요.

Batch

독립적인 요청 모음을 모델에 배치하면 처리가 병렬로 수행될 수 있으므로 성능을 크게 향상시키고 비용을 절감할 수 있습니다:

Batch

responses = model.batch([
    "Why do parrots have colorful feathers?",
    "How do airplanes fly?",
    "What is quantum computing?"
])

for response in responses:
    print(response)

이 섹션은 클라이언트 측에서 모델 호출을 병렬화하는 채팅 모델 메서드 batch()를 설명합니다.

OpenAI 또는 Anthropic과 같은 추론 제공자가 지원하는 배치 API와는 구별됩니다.

기본적으로 batch()는 전체 배치에 대한 최종 출력만 반환합니다. 생성이 완료되는 대로 각 개별 입력에 대한 출력을 받으려면 batch_as_completed()로 결과를 스트리밍할 수 있습니다:

Yield batch responses upon completion

for response in model.batch_as_completed([
    "Why do parrots have colorful feathers?",
    "How do airplanes fly?",
    "What is quantum computing?"
]):
    print(response)

batch_as_completed()를 사용하는 경우 결과가 순서 없이 도착할 수 있습니다. 각각은 필요에 따라 원래 순서를 재구성하기 위해 일치시킬 입력 인덱스를 포함합니다.

batch() 또는 batch_as_completed()를 사용하여 많은 수의 입력을 처리할 때 최대 병렬 호출 수를 제어할 수 있습니다. 이는 RunnableConfig 딕셔너리에서 max_concurrency 속성을 설정하여 수행할 수 있습니다.

Batch with max concurrency

model.batch(
    list_of_inputs,
    config={
        'max_concurrency': 5,  # Limit to 5 parallel calls
    }
)

지원되는 속성의 전체 목록은 RunnableConfig 참조를 참조하세요.

배치에 대한 자세한 내용은 reference를 참조하세요.

Tool calling

모델은 데이터베이스에서 데이터 가져오기, 웹 검색 또는 코드 실행과 같은 작업을 수행하는 도구를 호출하도록 요청할 수 있습니다. 도구는 다음의 쌍입니다:

도구 이름, 설명 및/또는 인수 정의(종종 JSON 스키마)를 포함하는 스키마
실행할 함수 또는 coroutine

"function calling"이라는 용어를 들을 수 있습니다. 우리는 이것을 "tool calling"과 같은 의미로 사용합니다.

정의한 도구를 모델에서 사용할 수 있도록 하려면 bind_tools()를 사용하여 바인딩해야 합니다. 후속 호출에서 모델은 필요에 따라 바인딩된 도구 중 하나를 호출하도록 선택할 수 있습니다.

일부 모델 제공자는 모델 또는 호출 매개변수를 통해 활성화할 수 있는 내장 도구를 제공합니다(예: ChatOpenAI, ChatAnthropic). 자세한 내용은 해당 provider reference를 확인하세요.

도구 생성에 대한 세부 정보 및 기타 옵션은 tools guide를 참조하세요.

Binding user tools

from langchain.tools import tool

@tool
def get_weather(location: str) -> str:
    """Get the weather at a location."""
    return f"It's sunny in {location}."

model_with_tools = model.bind_tools([get_weather])

response = model_with_tools.invoke("What's the weather like in Boston?")

for tool_call in response.tool_calls:
    # View tool calls made by the model
    print(f"Tool: {tool_call['name']}")
    print(f"Args: {tool_call['args']}")

사용자 정의 도구를 바인딩할 때 모델의 응답에는 도구를 실행하라는 요청이 포함됩니다. agent와 별도로 모델을 사용하는 경우, 요청된 작업을 수행하고 후속 추론에 사용할 결과를 모델에 반환하는 것은 귀하의 몫입니다. agent를 사용하는 경우 에이전트 루프가 도구 실행 루프를 처리합니다.

아래에서 도구 호출을 사용할 수 있는 몇 가지 일반적인 방법을 보여줍니다.

Tool execution loop

모델이 도구 호출을 반환하면 도구를 실행하고 결과를 모델에 다시 전달해야 합니다. 이것은 모델이 도구 결과를 사용하여 최종 응답을 생성할 수 있는 대화 루프를 생성합니다. LangChain에는 이 오케스트레이션을 처리하는 agent 추상화가 포함되어 있습니다.

다음은 이를 수행하는 방법에 대한 간단한 예입니다:

Tool execution loop

# Bind (potentially multiple) tools to the model
model_with_tools = model.bind_tools([get_weather])

# Step 1: Model generates tool calls
messages = [{"role": "user", "content": "What's the weather in Boston?"}]
ai_msg = model_with_tools.invoke(messages)
messages.append(ai_msg)

# Step 2: Execute tools and collect results
for tool_call in ai_msg.tool_calls:
    # Execute the tool with the generated arguments
    tool_result = get_weather.invoke(tool_call)
    messages.append(tool_result)

# Step 3: Pass results back to model for final response
final_response = model_with_tools.invoke(messages)
print(final_response.text)
# "The current weather in Boston is 72°F and sunny."

도구에서 반환된 각 ToolMessage에는 원래 도구 호출과 일치하는 tool_call_id가 포함되어 모델이 결과를 요청과 연관시키는 데 도움이 됩니다.

Forcing tool calls

기본적으로 모델은 사용자의 입력을 기반으로 사용할 바인딩된 도구를 자유롭게 선택할 수 있습니다. 그러나 특정 도구 또는 주어진 목록의 any 도구를 사용하도록 강제하여 도구 선택을 강제할 수 있습니다:

Force use of any tool

model_with_tools = model.bind_tools([tool_1], tool_choice="any")

Parallel tool calls

많은 모델이 적절한 경우 여러 도구를 병렬로 호출하는 것을 지원합니다. 이를 통해 모델은 다양한 소스에서 동시에 정보를 수집할 수 있습니다.

Parallel tool calls

model_with_tools = model.bind_tools([get_weather])

response = model_with_tools.invoke("What's the weather in Boston and Tokyo?")

# The model may generate multiple tool calls
print(response.tool_calls)
# [
#   {'name': 'get_weather', 'args': {'location': 'Boston'}, 'id': 'call_1'},
#   {'name': 'get_weather', 'args': {'location': 'Tokyo'}, 'id': 'call_2'},
# ]

# Execute all tools (can be done in parallel with async)
results = []
for tool_call in response.tool_calls:
    if tool_call['name'] == 'get_weather':
        result = get_weather.invoke(tool_call)
        ...
    results.append(result)

모델은 요청된 작업의 독립성을 기반으로 병렬 실행이 적절한 시기를 지능적으로 결정합니다.

도구 호출을 지원하는 대부분의 모델은 기본적으로 병렬 도구 호출을 활성화합니다. 일부(OpenAI 및 Anthropic 포함)는 이 기능을 비활성화할 수 있습니다. 이렇게 하려면 parallel_tool_calls=False를 설정하세요:

model.bind_tools([get_weather], parallel_tool_calls=False)

Streaming tool calls

응답을 스트리밍할 때 도구 호출은 ToolCallChunk를 통해 점진적으로 빌드됩니다. 이를 통해 완전한 응답을 기다리지 않고 생성되는 동안 도구 호출을 볼 수 있습니다.

Streaming tool calls

for chunk in model_with_tools.stream("What's the weather in Boston and Tokyo?"):
    # Tool call chunks arrive progressively
    for tool_chunk in chunk.tool_call_chunks:
        if name := tool_chunk.get("name"):
            print(f"Tool: {name}")
        if id_ := tool_chunk.get("id"):
            print(f"ID: {id_}")
        if args := tool_chunk.get("args"):
            print(f"Args: {args}")

# Output:
# Tool: get_weather
# ID: call_SvMlU1TVIZugrFLckFE2ceRE
# Args: {"lo
# Args: catio
# Args: n": "B
# Args: osto
# Args: n"}
# Tool: get_weather
# ID: call_QMZdy6qInx13oWKE7KhuhOLR
# Args: {"lo
# Args: catio
# Args: n": "T
# Args: okyo
# Args: "}

청크를 누적하여 완전한 도구 호출을 빌드할 수 있습니다:

Accumulate tool calls

gathered = None
for chunk in model_with_tools.stream("What's the weather in Boston?"):
    gathered = chunk if gathered is None else gathered + chunk

print(gathered.tool_calls)

Structured outputs

모델은 주어진 스키마와 일치하는 형식으로 응답을 제공하도록 요청받을 수 있습니다. 이는 출력을 쉽게 파싱하고 후속 처리에 사용할 수 있도록 보장하는 데 유용합니다. LangChain은 구조화된 출력을 시행하기 위한 여러 스키마 유형 및 메서드를 지원합니다.

Pydantic models은 필드 유효성 검사, 설명 및 중첩 구조를 갖춘 가장 풍부한 기능 세트를 제공합니다.

from pydantic import BaseModel, Field

class Movie(BaseModel):
    """A movie with details."""
    title: str = Field(..., description="The title of the movie")
    year: int = Field(..., description="The year the movie was released")
    director: str = Field(..., description="The director of the movie")
    rating: float = Field(..., description="The movie's rating out of 10")

model_with_structure = model.with_structured_output(Movie)

response = model_with_structure.invoke("Provide details about the movie Inception")
print(response)
# Movie(title="Inception", year=2010, director="Christopher Nolan", rating=8.8)

구조화된 출력에 대한 주요 고려 사항:

Method parameter: 일부 제공자는 다른 메서드('json_schema', 'function_calling', 'json_mode')를 지원합니다

'json_schema'는 일반적으로 제공자가 제공하는 전용 구조화된 출력 기능을 나타냅니다

'function_calling'은 주어진 스키마를 따르는 tool call을 강제하여 구조화된 출력을 도출합니다

'json_mode'는 일부 제공자가 제공하는 'json_schema'의 전신입니다 - 유효한 json을 생성하지만 스키마는 프롬프트에 설명되어야 합니다

Include raw*: include_raw=True를 사용하여 파싱된 출력과 원시 AIMessage를 모두 가져옵니다

Supported models

LangChain은 다양한 제공자의 모델을 지원합니다. 전체 목록과 각 통합에 대한 자세한 내용은 chat model integrations 페이지를 참조하세요.

Advanced topics

Multimodal

일부 models는 images, audio, video와 같은 non-textual data를 처리하고 결과로 반환할 수 있습니다. 이러한 non-textual data를 모델에 전달하려면 content blocks를 제공하면 됩니다.

다음의 멀티모달 기능을 가진 LangChain chat models은 모두 아래를 지원합니다:

Cross-provider standard format(공급자 간 표준 포맷, 자세한 내용은 messages guide 참고)

OpenAI chat completions format

자세한 내용은 messages guide의 multimodal 섹션을 참고하세요.
일부 모델은 응답에 multimodal data를 포함해 반환할 수 있습니다. 이렇게 호출하면, 생성된 AIMessage에는 multimodal types의 content blocks가 들어갑니다.

response = model.invoke("Create a picture of a cat")
print(response.content_blocks)

# [
#     {"type": "text", "text": "Here's a picture of a cat"},
#     {"type": "image", "base64": "...", "mime_type": "image/jpeg"},
# ]

Reasoning

최신 모델은 결론에 도달하기 위해 multi-step reasoning을 수행할 수 있습니다. 이는 복잡한 문제를 더 작고 다루기 쉬운 단계로 분해하는 과정을 포함합니다.
기반 모델이 이를 지원한다면, 모델이 최종 답변에 이르는 방식을 더 잘 이해할 수 있도록 이 reasoning process를 surface(노출) 할 수 있습니다.

Stream reasoning output

for chunk in model.stream("Why do parrots have colorful feathers?"):
    reasoning_steps = [r for r in chunk.content_blocks if r["type"] == "reasoning"]
    print(reasoning_steps if reasoning_steps else chunk.text)

Complete reasoning output

response = model.invoke("Why do parrots have colorful feathers?")
reasoning_steps = [b for b in response.content_blocks if b["type"] == "reasoning"]
print(" ".join(step["reasoning"] for step in reasoning_steps))

모델에 따라 reasoning에 투입할 노력 수준(level of effort)을 지정할 수 있는 경우가 있습니다.
또한 필요하다면 reasoning을 완전히 끄도록 요청할 수도 있습니다.
이 설정은 다음과 같은 형태를 취할 수 있습니다.

범주형 tiers: 예) 'low', 'high'
token budget(정수값) 지정

자세한 내용은 사용 중인 chat model의 integrations 페이지 또는 reference 문서를 참고하세요.

Local models

LangChain은 사용자의 자체 하드웨어에서 모델을 로컬로 실행하는 것을 지원합니다. 이는 데이터 프라이버시가 중요하거나, 커스텀 모델을 호출하고자 하거나, 클라우드 기반 모델 사용 시 발생하는 비용을 피하고 싶을 때 유용합니다.
Ollama는 로컬에서 모델을 실행하는 가장 쉬운 방법 중 하나입니다. 로컬 integrations의 전체 목록은 integrations 페이지에서 확인하세요.

Prompt caching

여러 provider들은 동일한 토큰을 반복 처리할 때 지연(latency)과 비용을 줄이기 위한 prompt caching 기능을 제공합니다. 이 기능은 암시적(implicit) 또는 명시적(explicit) 형태로 제공될 수 있습니다.

Implicit prompt caching: 요청이 캐시에 적중하면, provider가 자동으로 비용 절감을 적용합니다.
예: OpenAI, Gemini(Gemini 2.5 이상).
Explicit caching: 더 강한 제어권이나 확실한 비용 절감을 위해, 사용자가 직접 캐시 포인트를 지정할 수 있습니다.
예: ChatOpenAI(via prompt_cache_key), Anthropic의 AnthropicPromptCachingMiddleware 및 cache_control 옵션, AWS Bedrock, Gemini.

대개 prompt caching은 minimum input token threshold(최소 입력 토큰 임계값) 이상에서만 작동합니다. 자세한 조건과 수치는 각 provider pages를 참고하세요.

모델 응답의 usage metadata에 cache 사용 여부가 반영됩니다.

Server-side tool use

일부 provider는 server-side tool-calling loop를 지원합니다. 즉, 모델이 한 번의 대화 턴에서 web search, code interpreters, 기타 tools와 상호작용하고 그 결과를 분석할 수 있습니다.
모델이 server-side에서 tool을 호출하면, 응답 메시지의 content에는 해당 tool invocation과 result를 나타내는 내용이 포함됩니다. 응답의 content blocks에 접근하면, provider-agnostic format으로 server-side tool calls와 results를 확인할 수 있습니다:

Invoke with server-side tool use

from langchain.chat_models import init_chat_model

model = init_chat_model("gpt-4.1-mini")

tool = {"type": "web_search"}
model_with_tools = model.bind_tools([tool])

response = model_with_tools.invoke("What was a positive news story from today?")
response.content_blocks

Result

[
    {
        "type": "server_tool_call",
        "name": "web_search",
        "args": {
            "query": "positive news stories today",
            "type": "search"
        },
        "id": "ws_abc123"
    },
    {
        "type": "server_tool_result",
        "tool_call_id": "ws_abc123",
        "status": "success"
    },
    {
        "type": "text",
        "text": "Here are some positive news stories from today...",
        "annotations": [
            {
                "end_index": 410,
                "start_index": 337,
                "title": "article title",
                "type": "citation",
                "url": "..."
            }
        ]
    }
]

이는 단일 대화 턴에 해당하며, client-side tool-calling처럼 전달해야 하는 관련 ToolMessage objects는 없습니다.
사용 가능한 tools와 사용 방법에 대한 자세한 내용은 해당 provider의 integration page를 참고하세요.

Rate limiting

많은 chat model providers는 일정 시간 동안 가능한 invocations(호출 횟수)에 제한을 둡니다. Rate limit에 도달하면 보통 rate limit error 응답을 받게 되며, 추가 요청을 보내기 전에 대기가 필요합니다.

이 rate limits를 관리하기 위해, chat model integrations는 초기화 시 전달할 수 있는 rate_limiter 파라미터를 지원합니다. 이를 통해 요청이 만들어지는 속도(rate)를 제어할 수 있습니다.

rate limiter를 초기화하고 사용하는 법

LangChain에는 (선택 사항인) 내장 InMemoryRateLimiter가 포함되어 있습니다.
이 리미터는 thread-safe(스레드 안전)하며, 동일한 프로세스 내 여러 threads에서 공유해 사용할 수 있습니다.

from langchain_core.rate_limiters import InMemoryRateLimiter

rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.1,
    check_every_n_seconds=0.1,
    max_bucket_size=10,
)

model = init_chat_model(
    model="gpt-5",
    rate_limiter=rate_limiter
)

제공된 rate limiter는 단위 시간당 요청 수만 제한할 수 있습니다.
요청의 크기(예: 토큰/바이트 등) 기준으로도 제한이 필요하다면, 이 리미터는 도움이 되지 않습니다.

Base URL or proxy

여러 chat model integrations에서 API 요청의 base URL을 설정할 수 있습니다. 이를 통해 OpenAI-호환 API를 제공하는 모델 공급자를 사용하거나, 프록시 서버를 통해 요청을 보낼 수 있습니다.

Base URL

많은 model provider가 OpenAI-compatible API를 제공합니다(예: Together AI, vLLM).
이러한 제공자를 사용할 때는 init_chat_model 호출 시 적절한 base_url 파라미터를 지정하면 됩니다:

model = init_chat_model(
    model="MODEL_NAME",
    model_provider="openai",
    base_url="BASE_URL",
    api_key="YOUR_API_KEY",
)

direct chat model class instantiation을 사용할 때는, parameter name이 provider마다 달라질 수 있습니다. 자세한 내용은 각 제공자의 reference를 확인하세요.

Proxy configuration

HTTP 프록시가 필요한 배포 환경에서는, 일부 model integrations에서 proxy configuration(프록시 설정)을 지원합니다.

from langchain_openai import ChatOpenAI

model = ChatOpenAI(
    model="gpt-4o",
    openai_proxy="http://proxy.example.com:8080"
)

프록시 지원 여부는 각 integration마다 다릅니다.
프록시 설정 옵션은 사용 중인 모델 제공자(reference 문서)를 확인하세요.

Log probabilities

일부 models는 초기화 시 logprobs parameter를 설정하여, 주어진 토큰의 가능도를 나타내는 token-level log probabilities를 반환하도록 구성할 수 있습니다.

model = init_chat_model(
    model="gpt-4o",
    model_provider="openai"
).bind(logprobs=True)

response = model.invoke("Why do parrots talk?")
print(response.response_metadata["logprobs"])

Token usage

여러 model provider는 호출 응답에 token usage 정보를 포함해서 반환합니다. 이 정보가 제공되는 경우, 해당 모델이 생성한 AIMessage 객체에 함께 담깁니다.
자세한 내용은 messages guide를 참고하세요.

일부 provider API(대표적으로 OpenAI와 Azure OpenAI chat completions)는 streaming 환경에서 token usage data를 받으려면 사용자가 opt-in(수신 동의)하도록 요구합니다.
자세한 내용은 integration guide의 streaming usage metadata 섹션을 참고하세요.

아래와 같이 callback 또는 context manager를 사용하면, 애플리케이션에서 여러 모델 전반의 aggregate token counts를 추적할 수 있습니다.

Callback handler

from langchain.chat_models import init_chat_model
from langchain_core.callbacks import UsageMetadataCallbackHandler

model_1 = init_chat_model(model="gpt-4o-mini")
model_2 = init_chat_model(model="claude-haiku-4-5-20251001")

callback = UsageMetadataCallbackHandler()
result_1 = model_1.invoke("Hello", config={"callbacks": [callback]})
result_2 = model_2.invoke("Hello", config={"callbacks": [callback]})
callback.usage_metadata

Context manager

from langchain.chat_models import init_chat_model
from langchain_core.callbacks import get_usage_metadata_callback

model_1 = init_chat_model(model="gpt-4o-mini")
model_2 = init_chat_model(model="claude-haiku-4-5-20251001")

with get_usage_metadata_callback() as cb:
    model_1.invoke("Hello")
    model_2.invoke("Hello")
    print(cb.usage_metadata)

{
    'gpt-4o-mini-2024-07-18': {
        'input_tokens': 8,
        'output_tokens': 10,
        'total_tokens': 18,
        'input_token_details': {'audio': 0, 'cache_read': 0},
        'output_token_details': {'audio': 0, 'reasoning': 0}
    },
    'claude-haiku-4-5-20251001': {
        'input_tokens': 8,
        'output_tokens': 21,
        'total_tokens': 29,
        'input_token_details': {'cache_read': 0, 'cache_creation': 0}
    }
}

Invocation config

모델을 호출할 때 RunnableConfig 딕셔너리를 사용해 config 매개변수로 추가 설정을 전달할 수 있습니다. 이는 실행 동작, 콜백, 메타데이터 추적 등을 런타임에서 제어할 수 있게 합니다.

Invocation with config

response = model.invoke(
    "Tell me a joke",
    config={
        "run_name": "joke_generation",      # Custom name for this run
        "tags": ["humor", "demo"],          # Tags for categorization
        "metadata": {"user_id": "123"},     # Custom metadata
        "callbacks": [my_callback_handler], # Callback handlers
    }
)

이러한 configuration values는 특히 다음 상황에서 유용합니다:

LangSmith tracing으로 디버깅할 때
커스텀 logging / monitoring을 구현할 때
production 환경에서 리소스 사용을 제어할 때
복잡한 pipelines 전반의 invocations를 추적할 때

Configurable models

또한 configurable_fields를 지정해 런타임 구성 가능한(runtime-configurable) 모델을 만들 수 있습니다.
만약 model 값을 명시하지 않으면, 기본적으로 model과 model_provider가 구성 가능하도록 설정됩니다.

from langchain.chat_models import init_chat_model

configurable_model = init_chat_model(temperature=0)

configurable_model.invoke(
    "what's your name",
    config={"configurable": {"model": "gpt-5-nano"}},  # Run with GPT-5-Nano
)
configurable_model.invoke(
    "what's your name",
    config={"configurable": {"model": "claude-sonnet-4-5-20250929"}},  # Run with Claude
)

Configurable model with default values

기본 model 값을 가진 configurable model을 만들고, 어떤 파라미터를 구성 가능하게 할지 지정하며, 구성 가능한 파라미터에 접두사(prefix)를 붙일 수도 있습니다:

first_model = init_chat_model(
        model="gpt-4.1-mini",
        temperature=0,
        configurable_fields=("model", "model_provider", "temperature", "max_tokens"),
        config_prefix="first",  # Useful when you have a chain with multiple models
)

first_model.invoke("what's your name")

first_model.invoke(
    "what's your name",
    config={
        "configurable": {
            "first_model": "claude-sonnet-4-5-20250929",
            "first_temperature": 0.5,
            "first_max_tokens": 100,
        }
    },
)

Using a configurable model declaratively

configurable model에서는 bind_tools, with_structured_output, with_configurable 같은 declarative operations를 호출할 수 있으며, 일반적으로 인스턴스화한 chat model object와 동일한 방식으로 configurable model을 체이닝할 수 있습니다.

from pydantic import BaseModel, Field


class GetWeather(BaseModel):
    """Get the current weather in a given location"""

        location: str = Field(..., description="The city and state, e.g. San Francisco, CA")


class GetPopulation(BaseModel):
    """Get the current population in a given location"""

        location: str = Field(..., description="The city and state, e.g. San Francisco, CA")


model = init_chat_model(temperature=0)
model_with_tools = model.bind_tools([GetWeather, GetPopulation])

model_with_tools.invoke(
    "what's bigger in 2024 LA or NYC", config={"configurable": {"model": "gpt-4.1-mini"}}
).tool_calls

[
    {
        'name': 'GetPopulation',
        'args': {'location': 'Los Angeles, CA'},
        'id': 'call_Ga9m8FAArIyEjItHmztPYA22',
        'type': 'tool_call'
    },
    {
        'name': 'GetPopulation',
        'args': {'location': 'New York, NY'},
        'id': 'call_jh2dEvBaAHRaw5JUDthOs7rt',
        'type': 'tool_call'
    }
]

model_with_tools.invoke(
    "what's bigger in 2024 LA or NYC",
    config={"configurable": {"model": "claude-sonnet-4-5-20250929"}},
).tool_calls

[
    {
        'name': 'GetPopulation',
        'args': {'location': 'Los Angeles, CA'},
        'id': 'toolu_01JMufPf4F4t2zLj7miFeqXp',
        'type': 'tool_call'
    },
    {
        'name': 'GetPopulation',
        'args': {'location': 'New York City, NY'},
        'id': 'toolu_01RQBHcE8kEEbYTuuS8WqY1u',
        'type': 'tool_call'
    }
]

Langchain v1.0

저작자표시 (새창열림)

Basic usage

Initialize a model

Key methods

Parameters

model

api_key

temperature

timeout

max_tokens

max_retries

Invocation

Invoke

Stream

Batch

Tool calling

Structured outputs

Supported models

Advanced topics

Multimodal

Reasoning

Local models

Prompt caching

Server-side tool use

Rate limiting

Base URL or proxy

Base URL

Proxy configuration

Log probabilities

Token usage

Invocation config

Configurable models

Langchain v1.0

티스토리툴바