KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

PaperPaper CodeCode CodeData LeaderboardLeaderboard

About Kumo

KUMO is a novel benchmark designed to systematically evaluate the complex reasoning capabilities of Large Language Models (LLMs) through procedurally generated reasoning games. Explore the limits of LLM reasoning and track model performance on our interactive leaderboard.

Illustrative Overview

teaser.png

Kumo consists of over 5000 tasks on 100 diverse environments, covering a range from arts, astronomy, social sciences to medicine.

Huggingface Leaderboard

Our full leaderboard is available on Huggingface. Please refer to our github repo for more details.

Citation

@article{lin2025generative,
    title={Generative Evaluation of Complex Reasoning in Large Language Models},
    author={Lin, Haowei and Wang, Xiangyu and Yan, Ruilin and Huang, Baizhou and Ye, Haotian and Zhu, Jianhua and Wang, Zihao and Zou, James and Ma, Jianzhu and Liang, Yitao},
    journal={arXiv preprint arXiv:2504.02810},
    year={2025}
}
        

Contact Us

Have any questions about Kumo? Please contact us at yitaol@pku.edu.cn or create an issue on Github. For potential collaboration, please contact yitaol@pku.edu.cn.