name: Yumi Park
location: Seoul, Korea
education: M.S. Statistics, Dongguk University
focus: RAG Systems · Video Retrieval · Data Pipelines
background:
- Designed and built RAG systems for enterprise document search
- Built Video RAG prototype for gov't R&D project (1st round pass)
- ML modeling at EdTech company (Woongjin ThinkBig AI Labs)
- Taught generative AI to non-developers at Samsung C&T AI Academy
- Wrote technical deep-dives on VQGAN, Transformer, BERT source code (byumm315.tistory.com)Hybrid retrieval pipeline combining BM25 + InternVideo2 + ColBERT. Selected for government R&D funding (1st round pass).
InternVideo2 FAISS ColBERT BM25 Runway DINOv2 C2PA Gradio
- Indexed 7,010 MSR-VTT videos with hybrid search: BM25 · Dense · WRRF fusion · ColBERT · ITM reranking
- MSR-VTT 1k-A benchmark: ITC dense alone R@1 3.5% → R@1 44.4% after full ITM (paper: 51.9%; −10.6%p gap attributed to unresolved ITC collapse)
- Scene Graph → 2-path routing (USE_AS_IS / TRANSFORM) PD workstation
- DINOv2 transition scoring + DreamColour 3D LUT color grading + C2PA ES256 provenance signing
AI chatbot over 9 construction regulation PDFs · Samsung C&T AI Academy project · Team lead
Python FAISS BM25 bge-reranker GPT-4o-mini Streamlit
- FAISS + BM25 hybrid retrieval with bge-reranker for precision improvement
- 7-type query classification via GPT-4o-mini with per-type response strategy branching
- Designed step-by-step implementation notebooks for non-developer teammates
Image synthesis PoC to address manufacturing defect data scarcity
VQGAN MaskGIT PyTorch HuggingFace Gradio
- v1 (taming VQGAN + custom MaskGIT) → v2 (LlamaGen VQGAN + Halton-MaskGIT): architecture migration driven by codebook incompatibility — Halton-MaskGIT (ICLR 2025) is built for LlamaGen VQGAN and cannot be combined with taming VQGAN; pretrained weights available only for LlamaGen made the switch the practical choice
- Merged 3 datasets including NEU-DET; 2,659 images → 8× augmentation
- VQGAN fine-tuning: Edge IoU +10.6%, PSNR +3.1%, SSIM +0.73%
- MaskGIT training loss 6.77 (target ~4.0) → convergence failure — structural limitation of 69M-param model on 21K data
- Deployed Gradio inpainting demo on HuggingFace Spaces
Hybrid search engine for construction standard terminology
Python FAISS ColBERT OpenAI API
- Weighted fusion: OpenAI Embedding semantic similarity (60%) + ColBERT token-level MaxSim (40%)
- Achieved accurate standard term retrieval from colloquial queries
- Circuit Breaker + Rate Limiting for API failure resilience
High-concurrency ticketing system · Bootcamp final project · Team lead
Spring Boot Redis MySQL React Docker AWS
- Designed and implemented AI review summarization feature (Together AI → OpenAI migration)
- Refactored seat management from section-based to grade-based architecture
- Implemented dynamic seat layout rendering based on venue scale (small / medium / large)
| Period | Role | Key Work |
|---|---|---|
| 2025.08–11 | AI Instructor, Samsung C&T AI Academy (Elice) | Generative AI curriculum & RAG chatbot training for construction industry professionals |
| 2023.09–2024.12 | Research Team, Woongjin ThinkBig AI Labs | CatBoost difficulty prediction model for exam items (R²=0.57), ALP system reverse analysis |


