Relationship Extraction (관계 추출)

정의: 텍스트에서 추출된 **엔티티 간의 의미 있는 관계(Relationship)**를 자동으로 식별하는 작업. 예: WORKS_AT, MANAGES, OWNS 등.

기본 개념

관계란?

텍스트: "Alice는 Google에서 엔지니어로 일한다. Bob이 Alice를 관리한다."

엔티티:
  - Alice (Person)
  - Google (Company)
  - Bob (Person)

관계:
  - Alice --WORKS_AT--> Google
  - Bob --MANAGES--> Alice

추출의 목표

구조화되지 않은 관계 → 정형화된 관계 삼중항

입력:
  엔티티: [Alice, Bob, Google]
  텍스트: "Alice는 Google에서 엔지니어로 일한다. Bob이 Alice를 관리한다."

출력:
  [
    {"source": "Alice", "target": "Google", "relation": "WORKS_AT", "attributes": {"role": "engineer"}},
    {"source": "Bob", "target": "Alice", "relation": "MANAGES", "attributes": {"title": "manager"}}
  ]

추출 방법론

1. 규칙 기반 (Rule-based)

방식: 문법 패턴·의존성 파싱

import spacy
from dependency_parser import parse_dependencies
 
nlp = spacy.load("en_core_web_sm")
 
text = "Alice is a manager at Google."
doc = nlp(text)
 
for token in doc:
    if token.dep_ == "nsubj" and doc[token.head.i].lemma_ == "work":
        # nsubj 의존성으로 주어 찾기
        source = token.text  # Alice
        target = doc[token.head.i].text  # work

장점: ✅ 빠름, ✅ 투명성 단점: ❌ 경직된 패턴, ❌ 문법에 의존

2. 통계 기반 (Statistical - Relation Classification)

방식: 사전 훈련된 관계 분류 모델

from transformers import AutoTokenizer, AutoModelForSequenceClassification
 
model_name = "microsoft/deberta-base-finetuned-relation-extraction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
 
text = "Alice works at Google"
entity1 = "Alice"
entity2 = "Google"
 
inputs = tokenizer(text, entity1, entity2, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(-1).item()
 
relation = ["WORKS_AT", "OWNS", "MANAGES"][predicted_class]
print(f"{entity1} --{relation}--> {entity2}")

장점: ✅ 자동 학습, ✅ 성능 우수 단점: ❌ 훈련 데이터 필요, ❌ 도메인별 편차

3. LLM 기반 (대규모 언어모델)

방식: 프롬프트로 지정한 관계 타입 추출

from openai import OpenAI
 
client = OpenAI()
 
entities = [
    {"entity": "Alice", "type": "Person"},
    {"entity": "Google", "type": "Company"},
    {"entity": "Bob", "type": "Person"}
]
 
text = """
Alice는 Google에서 엔지니어로 일한다.
Bob이 Alice를 관리한다.
Alice는 Bob의 직속 부하다.
"""
 
prompt = f"""
다음 엔티티들 간의 관계를 찾아줘:
 
엔티티: {[e['entity'] for e in entities]}
 
관계 타입:
- WORKS_AT: 누가 어디서 일하는가
- MANAGES: 누가 누구를 관리하는가
- REPORTS_TO: 누가 누구에게 보고하는가
 
텍스트:
{text}
 
JSON 형식으로 반환해줘:
[
  {{
    "source": "엔티티1",
    "target": "엔티티2",
    "relation": "관계타입",
    "confidence": 0.0~1.0,
    "evidence": "증거 문장"
  }}
]
"""
 
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
 
relationships = json.loads(response.choices[0].message.content)

출력:

[
  {
    "source": "Alice",
    "target": "Google",
    "relation": "WORKS_AT",
    "confidence": 0.95,
    "evidence": "Alice는 Google에서 엔지니어로 일한다"
  },
  {
    "source": "Bob",
    "target": "Alice",
    "relation": "MANAGES",
    "confidence": 0.93,
    "evidence": "Bob이 Alice를 관리한다"
  },
  {
    "source": "Alice",
    "target": "Bob",
    "relation": "REPORTS_TO",
    "confidence": 0.92,
    "evidence": "Alice는 Bob의 직속 부하다"
  }
]

장점: ✅ 높은 정확도, ✅ 문맥 이해, ✅ 유연함 단점: ❌ API 비용, ❌ 지연시간

세 방법의 비교

방법	정확도	속도	비용	적응성
규칙 기반	낮음	빠름	무료	낮음
통계 기반	중상	중간	낮음	중간
LLM 기반	높음	느림	높음	높음

GraphRAG에서의 관계 추출

온톨로지 정의

RelationshipTypes:
  - WORKS_AT
    from: Person
    to: Company
    properties: [since, role, department]
  
  - MANAGES
    from: Person
    to: Person
    properties: [since, title]
  
  - OWNS
    from: Person / Company
    to: Company / Product
    properties: [stake, start_date]
  
  - MANUFACTURES
    from: Company
    to: Product
    properties: [start_year, quantity]

다단계 추출 전략

def extract_relationships_with_validation(text, entities, rel_types):
    """
    1차: 관계 후보 추출
    2차: 엔티티 타입 매칭 검증
    3차: 신뢰도 기반 필터링
    """
    
    # 1차: 추출
    candidates = llm_extract_relationships(text, entities, rel_types)
    
    # 2차: 타입 매칭 검증
    validated = []
    for rel in candidates:
        source_entity = find_entity(entities, rel['source'])
        target_entity = find_entity(entities, rel['target'])
        rel_type_def = rel_types[rel['relation']]
        
        # 엔티티 타입이 관계 정의와 맞는지 확인
        if (source_entity['type'] in rel_type_def['from'] and
            target_entity['type'] in rel_type_def['to']):
            validated.append(rel)
    
    # 3차: 신뢰도 필터링
    high_confidence = [r for r in validated if r['confidence'] > 0.8]
    
    return high_confidence

관계 속성 추출

관계별 속성 정의

relationship_attributes = {
    "WORKS_AT": {
        "since": "언제부터 일하는가",
        "role": "직책/역할",
        "department": "부서"
    },
    "MANAGES": {
        "since": "언제부터 관리하는가",
        "title": "직급"
    }
}

속성 추출 프롬프트

def extract_relationship_attributes(text, source, target, rel_type, expected_attrs):
    prompt = f"""
    다음 문장에서 {rel_type} 관계의 속성을 추출해줘:
    
    주체: {source}
    대상: {target}
    관계: {rel_type}
    
    추출할 속성: {expected_attrs}
    
    문장:
    {text}
    
    JSON 형식으로 반환해줘:
    {{
      "attributes": {{...}}
    }}
    """
    return llm.invoke(prompt)

중복 제거 및 병합

동일 관계의 여러 언급 병합

def merge_duplicate_relationships(relationships):
    """동일 관계의 여러 언급을 통합"""
    merged = {}
    
    for rel in relationships:
        key = (rel['source'], rel['target'], rel['relation'])
        
        if key in merged:
            # 신뢰도 높은 것 선택
            if rel['confidence'] > merged[key]['confidence']:
                merged[key] = rel
            # 속성 병합
            merged[key]['attributes'].update(rel.get('attributes', {}))
        else:
            merged[key] = rel
    
    return list(merged.values())

성능 평가

실제 예시

금표 (정답):
  [
    {"source": "Alice", "target": "Google", "relation": "WORKS_AT"},
    {"source": "Bob", "target": "Alice", "relation": "MANAGES"}
  ]

추출 결과:
  [
    {"source": "Alice", "target": "Google", "relation": "WORKS_AT"},  ✓ TP
    {"source": "Bob", "target": "Alice", "relation": "MANAGES"},      ✓ TP
    {"source": "Alice", "target": "Bob", "relation": "REPORTS_TO"}    ✗ FP (반대 관계)
  ]

정밀도 (Precision): 2 / 3 = 0.667
재현율 (Recall): 2 / 2 = 1.0
F1-Score: 2 × (0.667 × 1.0) / (0.667 + 1.0) = 0.8

최적화 팁

온톨로지 명확화 — 관계 타입을 정확히 정의
엔티티 검증 — 추출 전 엔티티 정확도 확보
문맥 길이 — 충분한 문맥 포함
방향성 처리 — MANAGES ≠ REPORTS_TO
상호 관계 — OWNS ↔ OWNED_BY 중복 처리
다단계 검증 — 타입 매칭, 신뢰도 필터링

JYP Garden

탐색기

Relationship Extraction (관계 추출)

Relationship Extraction (관계 추출)

기본 개념

관계란?

추출의 목표

추출 방법론

1. 규칙 기반 (Rule-based)

2. 통계 기반 (Statistical - Relation Classification)

3. LLM 기반 (대규모 언어모델)

세 방법의 비교

GraphRAG에서의 관계 추출

온톨로지 정의

다단계 추출 전략

관계 속성 추출

관계별 속성 정의

속성 추출 프롬프트

중복 제거 및 병합

동일 관계의 여러 언급 병합

성능 평가

실제 예시

최적화 팁

관련 개념

그래프 뷰

목차