segunda-feira, 23 de dezembro de 2024
Show HN: A registry of agent benchmarks (including many OSS agent trajectories) https://ift.tt/PTai2Uz
Show HN: A registry of agent benchmarks (including many OSS agent trajectories) If you're interested in exploring what LLM-based agent systems these days actually do to solve certain benchmarks such as SWEBench or WebArena, we created a small leaderboard with our team, that allows to view a lot of public and OSS agent results including all the runtime traces (the step-by-step reasoning behind the scenes). Looking at traces is actually quite interesting, as they reveal a lot about the inner working and shortcomings of current agent system, e.g. see https://ift.tt/4tQS25i... for an example trace. https://ift.tt/oMhUgRP December 23, 2024 at 05:57AM
Assinar:
Postar comentários (Atom)
DJ Sandro
http://sandroxbox.listen2myradio.com
Nenhum comentário:
Postar um comentário