LLM 평가 방법론 - G Eval — PureMax77 Dev Note

생성형 AI 시대, 어떻게 '잘' 만들었는지 평가할 것인가?

챗봇, 텍스트 요약, 기사 작성 등 자연어 생성(NLG) 모델의 발전은 놀랍습니다. 하지만 생성된 텍스트의 품질을 어떻게 평가해야 할까요? 기존의 BLEU나 ROUGE 같은 지표들은 정답 텍스트와의 단어 일치율만 보기 때문에, 문맥의 자연스러움, 창의성, 논리적 흐름 등 중요한 측면을 놓치기 쉽습니다. 실제로 이러한 지표들은 인간의 평가 결과와 낮은 상관관계를 보이는 경우가 많습니다. 그렇다고 모든 결과물을 사람이 직접 평가하기에는 시간과 비용이 너무 많이 듭니다.

이러한 고민 속에서 Microsoft 연구팀은 G-Eval이라는 혁신적인 평가 프레임워크를 제시했습니다. 바로 강력한 거대 언어 모델(LLM), 특히 GPT-4를 평가자로 활용하는 방식입니다.

G-Eval이란 무엇인가?

G-Eval은 "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (Liu et al., 2023) 논문에서 제안된 방법론으로, LLM의 뛰어난 언어 이해 및 추론 능력을 사용하여 NLG 시스템이 생성한 결과물의 품질을 평가하는 프레임워크입니다. 핵심 아이디어는 LLM에게 평가 기준을 명확히 알려주고, 평가 과정에서 사고의 연쇄(Chain-of-Thoughts, CoT)를 따르도록 유도하여 인간의 평가 방식과 유사하게 만드는 것입니다.

G-Eval의 핵심 작동 방식: Auto CoT와 Form-Filling

G-Eval은 단순히 LLM에게 "이 텍스트 어때?"라고 묻는 것 이상을 수행합니다. 다음 요소들이 핵심입니다.

명확한 지시 (Task Introduction & Evaluation Criteria):
- LLM에게 어떤 종류의 텍스트를 평가하는지 (예: "뉴스 기사 요약문 평가"), 어떤 기준으로 평가할지 (예: "일관성", "정확성", "유창성") 명확하게 알려줍니다.
- 각 평가 기준에 대한 상세한 설명과 점수 척도(예: 1-5점)를 제공합니다.
자동 사고 연쇄 생성 (Auto Chain-of-Thoughts, Auto CoT):
- G-Eval의 가장 흥미로운 부분 중 하나입니다. 평가자가 평가 기준에 따라 어떤 단계를 거쳐 생각해야 할지를 LLM 스스로 생성하도록 합니다.
- 예를 들어 '일관성'을 평가한다면, LLM은 "1. 원문을 읽고 핵심 내용을 파악한다. 2. 요약문을 읽는다. 3. 요약문이 원문의 내용과 논리적으로 연결되는지 확인한다. 4. 문장 간의 흐름이 자연스러운지 평가한다. 5. 기준에 따라 1-5점 척도로 점수를 매긴다." 와 같은 구체적인 평가 단계(Evaluation Steps)를 스스로 만듭니다.
- 이를 통해 평가의 일관성과 신뢰성을 높이고, 각 작업마다 사람이 평가 단계를 설계하는 수고를 덜어줍니다.
구조화된 결과 출력 (Form-Filling):
- LLM이 평가 점수와 그 근거를 미리 정의된 형식(Form)에 맞춰 깔끔하게 출력하도록 유도합니다. 이를 통해 결과를 자동으로 파싱하고 분석하기 용이해집니다.

G-Eval의 효과: 인간과의 높은 상관관계

논문에 따르면, G-Eval (특히 GPT-4를 사용했을 때)은 기존의 어떤 자동 평가 방식보다 인간의 평가 결과와 훨씬 높은 상관관계를 보였습니다. 텍스트 요약 평가에서는 인간 평가와의 스피어만 상관계수가 0.514에 달하며, 이는 이전 연구들을 크게 뛰어넘는 수치입니다.

이는 G-Eval이 단순히 단어 일치 수준을 넘어, 의미론적 유사성, 논리성, 문맥 적합성 등 인간 평가자가 중요하게 여기는 요소들을 LLM이 효과적으로 파악하고 평가할 수 있음을 시사합니다.

G-Eval의 의미와 고려사항

G-Eval은 NLG 평가 분야에 다음과 같은 중요한 의미를 갖습니다.

더 정확한 자동 평가: 인간 평가에 더 가까운 자동 평가 방식을 제공합니다.
유연성 및 확장성: 새로운 작업이나 평가 기준에 대해서도 프롬프트 수정만으로 비교적 쉽게 적용 가능합니다.
효율성: 인간 평가 대비 시간과 비용을 절약할 수 있는 잠재력을 가집니다.

물론 고려할 점도 있습니다. 평가 결과는 평가자로 사용되는 LLM(예: GPT-4)의 성능과 잠재적 편향에 의존하며, 프롬프트 설계에 따라 결과가 달라질 수 있다는 점입니다.

G-Eval 파이썬 예제 코드

import openai
import os
import re

# --- OpenAI API 설정 ---
# 중요: 실제 API 키를 사용하거나 환경 변수에서 불러오세요.
# 아래 YOUR_OPENAI_API_KEY 부분을 실제 키로 바꿔주세요.
# openai.api_key = os.getenv("OPENAI_API_KEY")
openai.api_key = "YOUR_OPENAI_API_KEY" # <--- 여기에 실제 API 키를 입력하세요! (주의!)

if not openai.api_key or openai.api_key == "YOUR_OPENAI_API_KEY":
    print("API Key가 설정되지 않았습니다. 코드를 수정해주세요.")
    exit()

# --- OpenAI API 호출 함수 (오류 처리 없음) ---

def generate_cot_with_openai(task_introduction, evaluation_criteria, model="gpt-3.5-turbo"):
    """OpenAI API를 호출하여 CoT (Evaluation Steps)를 생성합니다 (오류 처리 없음)."""
    print(f"--- OpenAI API 호출 ({model}): CoT 생성을 요청합니다. ---")
    cot_prompt = f"""{task_introduction}

{evaluation_criteria}

Based on the task introduction and evaluation criteria, generate detailed step-by-step evaluation steps (Chain-of-Thoughts) that an evaluator should follow. Output ONLY the evaluation steps, starting with '1.'.

Evaluation Steps:"""

    # API 호출 (성공 가정)
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are an expert evaluator designer."},
            {"role": "user", "content": cot_prompt}
        ],
        temperature=0.2,
        max_tokens=200
    )
    generated_cot = response.choices[0].message.content.strip()
    print(f"\nOpenAI ({model})가 생성한 CoT:\n{generated_cot}\n------------------------------------")
    return generated_cot


def evaluate_with_openai(full_prompt, model="gpt-3.5-turbo"):
    """전체 프롬프트를 OpenAI API에 전달하여 평가를 수행합니다 (오류 처리 없음)."""
    print(f"\n--- OpenAI API 호출 ({model}): 평가 수행을 요청합니다. ---")
    # print(f"전체 프롬프트 (LLM 입력):\n{full_prompt}\n") # 필요시 주석 해제

    # API 호출 (성공 가정)
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are an expert evaluator. Follow instructions precisely and output ONLY in the specified format."},
            {"role": "user", "content": full_prompt}
        ],
        temperature=0.0,
        max_tokens=300
    )
    llm_output = response.choices[0].message.content.strip()
    print(f"OpenAI ({model}) 응답:\n{llm_output}\n------------------------------------")
    return llm_output

# --- G-Eval 평가 로직 (오류 처리 없음) ---

def g_eval_summarization_coherence(source_text, summary, model="gpt-3.5-turbo"):
    """G-Eval로 요약문의 일관성을 평가하는 함수 (OpenAI 사용, 오류 처리 없음)"""

    task_introduction = "Evaluate a summary based on a news article for its Coherence."
    evaluation_criteria = """Evaluation Criteria:
Coherence (1-5) - How well-organized and logically structured is the summary? Does it flow smoothly? (1=Poor, 5=Excellent)"""

    # 1단계: CoT 생성 (성공 가정)
    evaluation_steps = generate_cot_with_openai(task_introduction, evaluation_criteria, model=model)

    # 2단계: 전체 프롬프트 구성
    full_prompt = f"""{task_introduction}

{evaluation_criteria}

Evaluation Steps:
{evaluation_steps}

Source Text:
'''{source_text}'''

Summary to Evaluate:
'''{summary}'''

Output ONLY the evaluation score and reasoning in the specified format below.

Evaluation Form (scores ONLY):
- Coherence:"""

    # 3단계: 평가 수행 (성공 가정)
    llm_response = evaluate_with_openai(full_prompt, model=model)

    # 4단계: LLM 응답 파싱 (형식 일치 가정)
    # 점수 추출 (마지막 Coherence 점수를 찾음, 성공 가정)
    score_match = list(re.finditer(r"- Coherence:\s*(\d)", llm_response))
    score = int(score_match[-1].group(1)) if score_match else None # 기본 확인은 유지

    # Reasoning 추출 (성공 가정)
    reasoning_match = re.search(r"Reasoning:\s*(.*)", llm_response, re.DOTALL | re.IGNORECASE)
    reasoning = reasoning_match.group(1).strip() if reasoning_match else "Reasoning not found" # 기본 확인은 유지

    return score, reasoning

# --- 예시 실행 ---
article = """The city council approved the new budget for the fiscal year after a lengthy debate. Key allocations include increased funding for public transport infrastructure and resources for community green spaces. Opposition members raised concerns about the sustainability of the proposed tax adjustments."""
summary_to_evaluate = """City council passed the new budget, focusing on public transport and parks. Some members questioned the tax plan."""

selected_model = "gpt-3.5-turbo" # 또는 "gpt-4-turbo-preview" 등

print(f"\n--- Starting G-Eval Example (Simplified) using OpenAI API ({selected_model}) ---")
coherence_score, coherence_reasoning = g_eval_summarization_coherence(article, summary_to_evaluate, model=selected_model)

# 결과 출력
print("\n--- 최종 평가 결과 ---")
print(f"평가 항목: Coherence (일관성)")
print(f"점수 (1-5): {coherence_score}")
print(f"평가 근거: {coherence_reasoning}")

예상 출력 결과 (OpenAI API 응답에 따라 달라짐)

(API 호출 결과는 매번 달라질 수 있으며, 선택한 모델(gpt-3.5-turbo, gpt-4 등)에 따라서도 품질이 달라집니다.)

OpenAI API Key loaded from environment variable. # 또는 Warning 메시지

--- Starting G-Eval Example using OpenAI API (gpt-3.5-turbo) ---
--- OpenAI API 호출 (gpt-3.5-turbo): CoT 생성을 요청합니다. ---

OpenAI (gpt-3.5-turbo)가 생성한 CoT:
1. Read the Source Text to understand the main points and context.
2. Read the Summary to Evaluate.
3. Compare the summary to the source text, checking if the summary logically follows the information presented in the source.
4. Assess the flow between sentences in the summary. Are the transitions smooth and logical?
5. Based on the definition of Coherence (well-organized, logical structure, smooth flow), assign a score from 1 (Poor) to 5 (Excellent).
------------------------------------

--- OpenAI API 호출 (gpt-3.5-turbo): 평가 수행을 요청합니다. ---
OpenAI (gpt-3.5-turbo) 응답:
Evaluation Form (scores ONLY):
- Coherence: 4

Reasoning:
The summary correctly identifies the key decisions made by the city council (budget approval, focus areas) and the opposition's concern. The sentences are concise and follow a logical sequence, making it easy to understand the outcome of the council meeting.
------------------------------------

--- 최종 평가 결과 ---
평가 항목: Coherence (일관성)
점수 (1-5): 4
평가 근거: The summary correctly identifies the key decisions made by the city council (budget approval, focus areas) and the opposition's concern. The sentences are concise and follow a logical sequence, making it easy to understand the outcome of the council meeting.

LLM 평가 방법론 - ROUGE (0)	2025.12.10
LLM 평가 방법론 - GPTScore (0)	2025.12.10
LLM 평가 방법론 - LLM as a Judge (0)	2025.12.10
LLM 평가 방법론 - 벤치마크 (0)	2025.12.10
학습시킨 LLM 얼마나 똑똑한지 알고있니? (0)	2025.12.10