I have been building a "realistic" but simplified 2D universe simulator since GPT3.5 was first introduced, it has slowly grown to become a 1000loc Python file that uses Pygame. I think it represents a good mix of what might appear in training as code examples and physics understanding. Aside from occasionally "playing" the game, its a zero-player game... I mainly use it to test new models. Can the model handle 1000 lines of complex physics code? Can the model make improvements to areas that I already know are poorly optimized or incorrect?