Our project focuses on developing an AI agent capable of defeating a Cuphead boss using deep reinforcement learning. Our method utilizes a two-stage machine learning approach: first, a computer vision component using YOLO (You Only Look Once) for real-time object detection and game state understanding, which is then followed by a deep Q-learning network (DQN) for action decision making. We would have to manually start the game and load up the level, but the agent would take over from there. The agent processes raw gameplay images to identify critical elements such as the player character, boss, projectiles, and health indicators, then uses this information to make optimal decisions for movement and dodging. Currently, our agent has demonstrated significant progress, reaching the third phase of the boss fight after 3500 training runs, showing promising potential for achieving complete victory with further optimization.
Approach
Our solution combines state-of-the-art computer vision with reinforcement learning, implemented through a dual-stage pipeline:
Stage 1: Computer Vision (State Recognition)
We utilize the YOLO (You Only Look Once) object detection model for real-time game state understanding. The model processes game frames to detect and classify. We also had to label the data manually in order to train the YOLO recognition model such as labeling the character, bosses, projectiles, and health indicators:
Player character position
Boss position and state
Projectiles and hazards
Health indicators
Progress indicators
The detection results are normalized and vectorized into a state representation suitable for the reinforcement learning agent:
Optimal position maintenance: +0.1 * (1 - distance_from_optimal/screen_width)
Edge penalty: -0.1 when too close to screen edges
Projectile avoidance:
Dynamic reward based on distance increase from projectiles
Scaled by phase: 0.1 * sqrt(current_phase)
Evaluation
Quantitative Metrics
Training Progress
Current achievement: Reached Phase 3 in 3,500 runs
Baseline comparison: Random agent (<25% boss health depletion)
Current performance: ~60% boss health depletion
Average survival time: Increasing trend (graph to be added)
Health Management
Starting health: 4 points
Average health at phase transitions:
Phase 1 → 2: 1.7 health points
Phase 2 → 3: 1.2 health points
Phase Progression
Average time to reach Phase 2: 85 seconds
Average time to reach Phase 3: 170 seconds
Success rate reaching Phase 2: 60%
Success rate reaching Phase 3: 10%
Reward Progress
Reward results over time over different models being trained:
-60 to -50 is the gradual progress
Reason for this is because the reward function is not optimized yet and we aren’t rewarding the agent for surviving more and performing more optimal actions
This graph plots average total rewards across 25 episode segments. An episode is a singular run of the boss battle. Episode rewards were not initially recorded, though model checkpoints were periodically saved. To approximate the full training curve, a separate program later captured rewards from loading the model at earlier checkpoints.
Qualitative Analysis
Behavioral Improvements
Developed consistent dodging patterns for common projectiles
Learned to maintain optimal attack position
Shows adaptation to different boss phases
Demonstrates emergent strategies for health preservation
Learning Challenges
Initial difficulty with edge case projectile patterns
Occasional suboptimal positioning in Phase 3
Room for improvement in phase transition strategies
Visualization
(Screenshots and performance graphs to be added showing:)
Object detection overlay
Action probability distributions
Reward accumulation over time
Phase progression success rates
Remaining Goals and Challenges
Our prototype, while showing promising results in reaching the third phase of the boss fight, still has several limitations and challenges we aim to address in the remainder of the quarter:
Reward System Optimization
The current reward system, while functional, needs significant refinement. Our agent’s average reward of -60 to -50 indicates that the reward structure isn’t effectively encouraging optimal behavior. We plan to:
Implement a more nuanced phase-based reward system that better reflects the increasing difficulty
Add specific rewards for successful dodge patterns
Develop a more sophisticated positioning reward that accounts for both attack opportunities and safety
Evaluation Depth
While we have basic metrics, we need more comprehensive evaluation to truly understand our agent’s performance:
Implement detailed tracking of action distributions per phase
Compare performance against human players of varying skill levels
Analyze failure cases to identify patterns in unsuccessful runs
Create visualization tools for real-time decision-making process
Technical Challenges
Several technical hurdles remain:
Frame Processing Speed: Our current YOLO implementation occasionally causes frame drops, which can affect the agent’s performance. We’re investigating optimization techniques and may need to simplify our object detection model.
State Space Complexity: The current state representation might be too simplified for the complex patterns in Phase 3. We’re considering expanding the state space to include historical data for better pattern recognition.
Action Timing: The fixed 0.1s action delay might be suboptimal for certain scenarios since the agent isn’t able to chain together actions and hence not react optimal enough to the game. We plan to experiment with dynamic action timing based on the game state.
Anticipated Obstacles
Computational Resources: Training with an expanded state space and more sophisticated reward system will require significantly more computational resources. We may need to optimize our code or seek additional GPU resources.
Overfitting Concerns: As we fine-tune the reward system, there’s a risk of overfitting to specific boss patterns. We’ll need to ensure our agent maintains adaptability.
Time Constraints: Implementing and testing all planned improvements within the quarter will be challenging. We’ve prioritized our goals and will focus on the most impactful changes first.
Resources Used
Development Tools and Libraries
PyTorch: Primary framework for implementing the DQN
Ultralytics YOLO: Object detection and tracking
OpenCV (cv2): Image processing and frame capture
MSS: Screen capture functionality
PyDirectInput: Game control interface
NumPy: Numerical computations and array operations
PyGetWindow: Window management for game interaction