About Kumo
KUMO is a novel benchmark designed to systematically evaluate the complex reasoning capabilities of Large Language Models (LLMs) through procedurally generated reasoning games. Explore the limits of LLM reasoning and track model performance on our interactive leaderboard.
Illustrative Overview
Kumo consists of over 5000 tasks on 100 diverse environments, covering a range from
arts, astronomy, social sciences to medicine.
Huggingface Leaderboard
Our full leaderboard is available on Huggingface.
Please refer to our github repo for more details.
Citation
@article{lin2025generative,
title={Generative Evaluation of Complex Reasoning in Large Language Models},
author={Lin, Haowei and Wang, Xiangyu and Yan, Ruilin and Huang, Baizhou and Ye, Haotian and Zhu, Jianhua and Wang, Zihao and Zou, James and Ma, Jianzhu and Liang, Yitao},
journal={arXiv preprint arXiv:2504.02810},
year={2025}
}