Definition

AI 연구 인프라(Research Infrastructure for AI)는 AI 에이전트가 과학 연구에 효과적으로 참여할 수 있도록 데이터와 시스템을 구축·정리하는 기술적 기반이다.

The Current Reality: Labs Are Not Ready

Typical Lab Data Setup ❌

연구원에게 "데이터베이스 보여주세요" → 보여주는 것:
├─ Folder of Excel files
│  ├─ cells merged randomly (병합된 셀)
│  ├─ inconsistent column names (불일치 컬럼명)
│  ├─ special characters in filenames (특수문자: #, $, %, etc)
│  ├─ dates in 10 different formats
│  └─ handwritten notes in margins
├─ PDF papers with embedded images
│  └─ Impossible for AI to extract data
├─ PowerPoint presentations with figures
│  └─ Data not in machine-readable format
└─ Scattered lab notebooks (physical & digital)
   └─ Inconsistent recording standards

Why This Fails for AI ❌

AI reads these formats and sees:
├─ "What is this merged cell? Data? Header?"
├─ "Column 'exp_data' vs 'ExperimentData' vs 'ED' — are they the same?"
├─ "SpecialChar$#%.txt — valid filename?"
├─ "Date format: MM/DD/YY or DD/MM/YY?"
└─ → Complete incomprehension

Human (reading same data):
├─ Knows from context what things mean
├─ Understands implied structure
├─ Can infer missing information
├─ → Natural comprehension

The Metaphor: Learning a New Language

Imagine your brilliant new colleague speaks **Slovak only**.

To work with this colleague:
├─ You must learn Slovak (or hire translator)
├─ Your team must learn Slovak
├─ All meetings must be in Slovak
├─ Documents must be in Slovak
└─ Massive infrastructure overhaul required

Similarly, to work with AI:
├─ Your lab must speak "AI-readable language"
├─ All data must be in machine-readable format
├─ Databases must be AI-comprehensible
├─ Huge investment in standardization
└─ → This is what [[wiki/concepts/Research-Infrastructure-for-AI]] means

What AI-Ready Infrastructure Looks Like

1. Structured Data Format ✅

Instead of: Excel file with merged cells
Use: 
├─ CSV/JSON with strict schema
├─ Clear column definitions
├─ Consistent data types (INT, FLOAT, DATE, STRING)
├─ NO merged cells, special formatting
└─ Machine-parseable (AI can read 100%)

2. Standardized Naming Conventions ✅

Instead of: "exp_data.xlsx", "ExperimentData.csv", "ED#2025.txt"
Use:
├─ Consistent naming: "experiment_data_20250504.csv"
├─ Clear semantic meaning
├─ Parseable date formats (ISO 8601: YYYY-MM-DD)
├─ No special characters (only: letters, numbers, _, -)
└─ Machine + Human readable

3. Centralized Database ✅

Instead of: Scattered Excel files across shared drive
Use:
├─ Centralized Database (PostgreSQL, MongoDB, etc)
├─ Single source of truth
├─ Access control & audit logs
├─ Backup & recovery
├─ API for programmatic access (AI can query directly)
└─ Real-time data synchronization

4. Ontology-Based Schema ✅

Beyond just tables:
├─ Define concepts: Experiment, Measurement, Parameter
├─ Define relationships: Experiment → Measurement → Parameter
├─ Enforce consistency: All measurements must have units, timestamps, source
├─ Enable reasoning: AI can infer implications
└─ [[wiki/concepts/Ontology]] structures make data AI-comprehensible

5. Machine-Readable Literature ✅

Instead of: PDF papers, PowerPoint presentations
Use:
├─ Semantic markup (RDF, JSON-LD)
├─ Structured metadata
├─ Linked references (each citation machine-readable)
├─ Extracted figures as data (not images)
└─ AI can analyze 10,000 papers in minutes

Implementation Challenges

1. Cost & Effort

"This is a massive, expensive undertaking"

Reality:
├─ Database infrastructure: $100K - $1M+ (setup & maintenance)
├─ Staff training: months to years
├─ Data migration: substantial effort
├─ Ongoing standardization: continuous cost
└─ → Major investment required

2. Legacy Data

Existing research data:
├─ Decades of inconsistent formats
├─ Missing metadata
├─ Ambiguous formats
├─ Manual extraction & cleaning required
└─ → One-time massive effort to migrate

3. Researcher Resistance

Challenges:
├─ Researchers want flexibility (AI needs rigidity)
├─ "Why do I need to follow this format?"
├─ Learning new systems takes time
├─ Feels like bureaucracy to scientists
└─ → Cultural change management required

Why It’s Worth It

Benefits Realization

Once infrastructure ready:
├─ AI processes data instantly (not days)
├─ No data re-entry or manual processing
├─ Consistent data quality across team
├─ Reproducible research
├─ [[Automated Scientist]] becomes possible
└─ → Return on investment enormous

Research Impact

Before:
├─ 1 Postdoc → 1 paper/2 years
└─ Limited by human capacity

After:
├─ 1 Postdoc + AI → 10 papers/year
├─ [[wiki/concepts/Research-Automation-Pipeline]] fully activated
├─ [[wiki/concepts/Human-AI-Research-Partnership]] realized
└─ Exponential discovery acceleration

Practical Roadmap

Phase 1: Assessment (Month 1-2)

├─ Audit current data landscape
├─ Identify bottlenecks
├─ Define AI-readiness requirements
└─ Cost-benefit analysis

Phase 2: Pilot (Month 3-6)

├─ Select 1-2 research areas
├─ Build database + schema
├─ Migrate sample data
├─ Test with AI tools
└─ Iterate and improve

Phase 3: Full Implementation (Month 7-18)

├─ Scale infrastructure
├─ Migrate all legacy data
├─ Train entire team
├─ Establish governance
└─ Monitor and optimize

Phase 4: Integration (Month 19+)

├─ Deploy [[Automated Scientist]]
├─ Enable [[wiki/concepts/Human-AI-Research-Partnership]]
├─ Continuous improvement
└─ Competitive advantage established

Success Indicators

Infrastructure is "AI-ready" when:
├─ [ ] All experimental data in structured format
├─ [ ] Zero manual data entry for AI processing
├─ [ ] AI can query any dataset directly
├─ [ ] Researcher describes experiment once, system captures all metadata
├─ [ ] Literature seamlessly integrated with experimental data
├─ [ ] Reproducibility fully automated
└─ → [[Automated Scientist]] can operate independently

References