📝 Paper[논문 리뷰] Defending Against Prompt Injection with a Few DefensiveTokensAug 8, 2025PaperSafety / Alignment
📝 Paper[논문 리뷰] PoisonBench: Assessing LM Vulnerability to Poisoned Preference DataJul 22, 2025PaperSafety / AlignmentBenchmark
📝 Paper[논문 리뷰] Cheating Automatic LLM Benchmarks: NULL Models Achieve High Win RatesJul 15, 2025PaperSafety / AlignmentBenchmark
📝 Paper[논문 리뷰] DarkBench: benchmarking dark patterns in LLMJul 9, 2025PaperSafety / AlignmentBenchmark
📝 Paper[논문 리뷰] Safety Alignment Should be made more than just a few tokens deepMay 27, 2025Safety / AlignmentPaper
📝 Paper[논문 리뷰] Do I Know This Entity? Knowledge Awareness and Hallucinations in Language ModelsMar 22, 2025PaperICLR2025