Entity Extraction (엔티티 추출)

정의: 비정형 텍스트에서 **의미 있는 객체(엔티티)**를 자동으로 식별하는 작업. 예: Person, Company, Product, Location 등의 타입으로 분류.

기본 개념

엔티티란?

텍스트: "Alice는 Google에서 엔지니어로 일한다."

엔티티:
  - Alice (타입: Person)
  - Google (타입: Company)
  - 엔지니어 (타입: Job Title)

추출의 목표

비정형 텍스트 → 구조화된 데이터 → 그래프 노드

입력:
  "Alice는 Google에서 엔지니어로 일한다. Bob은 Google CEO다."

출력:
  [
    {"entity": "Alice", "type": "Person", "role": "엔지니어"},
    {"entity": "Google", "type": "Company", "industry": "기술"},
    {"entity": "Bob", "type": "Person", "role": "CEO"}
  ]

추출 방법론

1. 규칙 기반 (Rule-based)

방식: 정규식·패턴 매칭

import re
 
patterns = {
    "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
    "phone": r"\b\d{3}-\d{4}-\d{4}\b",
    "date": r"\d{4}-\d{2}-\d{2}"
}
 
text = "Alice의 이메일: alice@google.com, 날짜: 2026-05-04"
 
for entity_type, pattern in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{entity_type}: {matches}")

장점: ✅ 빠름, ✅ 결정론적 단점: ❌ 경직된 규칙, ❌ 변동성 낮은 엔티티만 가능

2. 통계 기반 (Statistical - NER)

방식: Named Entity Recognition (NER) 모델 학습

from transformers import pipeline
 
nlp = pipeline("ner", model="dslim/bert-base-multilingual-cased-ner")
 
text = "Alice는 Google에서 일한다."
entities = nlp(text)
 
# 결과:
# [
#   {"entity": "B-PER", "word": "Alice", "score": 0.999},
#   {"entity": "B-ORG", "word": "Google", "score": 0.998}
# ]

장점: ✅ 자동 학습, ✅ 구조화된 학습 데이터 활용 단점: ❌ 훈련 데이터 필요, ❌ 도메인별 성능 편차

3. LLM 기반 (대규모 언어모델)

방식: 프롬프트 엔지니어링으로 추출 지시

from openai import OpenAI
 
client = OpenAI()
 
text = """
Alice는 Google에서 엔지니어로 일한다. 
Bob은 Google의 CEO다.
그들은 2020년부터 함께 일하고 있다.
"""
 
prompt = f"""
다음 텍스트에서 다음 타입의 엔티티를 찾아줘:
- Person (사람)
- Company (회사)
- Date (날짜)
 
JSON 형식으로 반환해줘:
[{{"entity": "...", "type": "...", "context": "..."}}]
 
텍스트: {text}
"""
 
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
 
result = json.loads(response.choices[0].message.content)
print(result)

출력:

[
  {"entity": "Alice", "type": "Person", "context": "엔지니어"},
  {"entity": "Google", "type": "Company", "context": "고용주"},
  {"entity": "Bob", "type": "Person", "context": "CEO"},
  {"entity": "2020년", "type": "Date", "context": "시작 시점"}
]

장점: ✅ 높은 정확도, ✅ 복잡한 맥락 이해, ✅ 도메인 적응 쉬움 단점: ❌ API 비용, ❌ 레이턴시, ❌ 도메인 특화 필요

세 방법의 비교

방법	정확도	속도	비용	복잡도	도메인 적응
규칙 기반	낮음	빠름	무료	낮음	어려움
통계 기반	중상	중간	낮음	중간	중간
LLM 기반	높음	느림	높음	높음	쉬움

GraphRAG에서의 엔티티 추출

온톨로지 정의

EntityTypes:
  - Person
    attributes: [name, age, job_title, email]
  - Company
    attributes: [name, industry, founded_year]
  - Product
    attributes: [name, category, price]
  - Location
    attributes: [name, country, coordinates]

프롬프트 설계

def create_extraction_prompt(text, entity_types, attributes):
    return f"""
    다음 텍스트에서 엔티티를 추출해줘.
    
    엔티티 타입: {', '.join(entity_types)}
    
    각 타입별 속성:
    {yaml.dump(attributes)}
    
    텍스트:
    {text}
    
    다음 JSON 스키마로 반환해줘:
    {{
      "entities": [
        {{
          "entity": "엔티티명",
          "type": "타입",
          "attributes": {{}},
          "confidence": 0.0~1.0
        }}
      ]
    }}
    """

실제 파이프라인

멀티 단계 추출

def extract_entities_multi_pass(text, entity_types):
    """
    1차: 고신뢰도 엔티티만 추출
    2차: 문맥에서 누락된 엔티티 추출
    3차: 속성 보강
    """
    
    # 1차 추출
    entities = llm_extract(text, entity_types, confidence_threshold=0.9)
    
    # 2차 추출 (누락)
    missing_context = find_missing_context(text, entities)
    if missing_context:
        more_entities = llm_extract(missing_context, entity_types, confidence_threshold=0.7)
        entities.extend(more_entities)
    
    # 3차 속성 보강
    for entity in entities:
        entity['attributes'] = extract_attributes(entity, text)
    
    return entities

중복 제거 및 병합

def merge_duplicate_entities(entities):
    """동일 엔티티의 여러 언급 병합"""
    merged = {}
    
    for entity in entities:
        key = normalize(entity['entity'])  # "Alice" == "alice"
        
        if key in merged:
            # 속성 병합
            merged[key]['attributes'].update(entity['attributes'])
        else:
            merged[key] = entity
    
    return list(merged.values())

성능 지표

지표	설명	계산식
정밀도 (Precision)	추출한 것 중 맞는 비율	TP / (TP + FP)
재현율 (Recall)	존재하는 것 중 찾은 비율	TP / (TP + FN)
F1-Score	정밀도와 재현율의 조화평균	2 × (P × R) / (P + R)

from sklearn.metrics import precision_score, recall_score, f1_score
 
# 금표(정답)
gold_standard = ["Alice", "Google", "2020"]
 
# 추출 결과
predicted = ["Alice", "Google", "2020", "Bob"]  # 하나 더 추출
 
precision = 3 / 4  # 0.75
recall = 3 / 3  # 1.0
f1 = 2 * (0.75 * 1.0) / (0.75 + 1.0)  # 0.857

최적화 팁

프롬프트 최적화 — 예시(in-context examples) 포함
온톨로지 정교화 — 엔티티 타입을 세밀하게 정의
다단계 추출 — 여러 pass로 신뢰도 향상
도메인 데이터 — fine-tune으로 성능 개선
후처리 — 중복 제거, 속성 보강

JYP Garden

탐색기

Entity Extraction (엔티티 추출)

Entity Extraction (엔티티 추출)

기본 개념

엔티티란?

추출의 목표

추출 방법론

1. 규칙 기반 (Rule-based)

2. 통계 기반 (Statistical - NER)

3. LLM 기반 (대규모 언어모델)

세 방법의 비교

GraphRAG에서의 엔티티 추출

온톨로지 정의

프롬프트 설계

실제 파이프라인

멀티 단계 추출

중복 제거 및 병합

성능 지표

최적화 팁

관련 개념

그래프 뷰

목차