Weekly WaveShine #2

Analytics And Hogwarts Legacy, Mobile Game Design, and Balancing Metagame Systems

Hello, and thanks for stopping by for the this week’s entry in my Weekly WaveShine series! In this week’s entry you can look forward to:

  1. An analytics plan for checking for issues with progression in Hogwarts Legacy as part of a game design exercise [Go To]
  2. A slightly better formatted game design doc for a mobile game I made a prototype for [Go To]
  3. My reading and a summary of my notes on metagame systems from Game Balance [Go To]
Want more or to figure out what the heck a WaveShine is? Check out last week’s blog entry or my most recent entry!

Analytics and Hogwarts Legacy

As part of a game design exercise from Game Balance by Ian Schreiber and Brenda Romero, I have built an analytics plan that defines some key questions regarding the balance within Hogwarts Legacy, a popular AAA game that was released recently. The plan then breaks down what metrics are needed to answer these questions and what statistical techniques would be employed to gain said insights.

Disclaimer: I strongly oppose the rather horrible things about the trans and LGBTQIA+ community said by the original author of this game’s IP. That being said, it was difficult to turn down the chance to experience one of my favorite childhood stories coming to life. While I believe it is an interesting subject of study from a design perspective, my choice to do so does not indicate any support or approval of discrimination in any form.

Hogwarts Legacy

One has to seriously question a bunch of teenagers running around with a spell that outright kills people.

The Problem

  • Scenario: Your team has acquired the right to make a remake of a popular game of your choice.
  • Challenge: Create an analytics plan to help direct your team’s development efforts to improve the balance and overall player experience.
  • Limitations: This is not a sequel. The general mechanics, plot, and systems must remain in place. The changes a generally limited to updating audio, graphics, platform support, and tweaks to game balance.

Deliverables

A complete response to this exercise includes the following:

  1. Define the balance questions to be addressed.
  2. Define the metrics that will be used in the analysis and the sources of this data. Assume that actually getting this data is logistically possible.
  3. Perform a statistical analysis that uses these data to provide insight into the balance questions above.

What Questions Need Answers?

I personally feel the game is rather mediocre when separated from the Harry Potter IP. My gut intuition is that although the game itself will sell well, it will not be a definitive title for this generation of consoles and that most players will not play the game to completion. As a result, the game will not be remembered as a masterfully designed game and it misses a lot of potential for monetization. I think that this is due to a combination of factors, and that one of the most prominent of these is the overall imbalance of progression and narrative pacing within the game.

I believe that these balance issues are widespread and impact a wide number of systems. As such, the overall objective of this analysis is to identify which of these systems have the greatest effect on overall player engagement. The questions I believe that analytics can help provide insights into are:

  1. How likely are players to complete the game? Is there a pattern to main story quest (MSQ) % completion? – In other words, I want to figure out if there is actually an issue or if I’m just shouting at windmills.
  2. Is progression with spells a significant determinant of player engagement? – My observation is that with the exception of the Unforgivable Curses, the Player obtains spells very quickly before progression within this system stops. Learning new spells was a notable increase in player power/ability, and I want to know if reaching this gap in progression caused players to become bored.
  3. Is progression with gear a significant determinant of player engagement? – Around the time that spell progression drops off, Players unlock gear crafting as a new form of progression. This new system really only gives players the ability to apply passives to their gear, and generally, selecting a statistically superior form of gear is preferable to a particular passive ability. I want to know if this system sufficiently carries player interest through to later segments of the game.
  4. Is progression with abilities a significant determinant of player engagement? – Progression with abilities is noticeably weighted toward the early game and the abilities themselves are largely passive additions to existing abilities the Player already possesses. With the exception of a few abilities, they don’t really fundamentally increase the number of options the Player has. They just make certain options a little stronger. I want to know whether gaining new abilities is enough to keep Players engaged with the game.
  5. Are these trends affected by the primary motivations of players within the game? – I think that different players have different motivations for playing and the effects of various systems on their patterns of play might be affected by their player group. Depending on the business objectives of the development efforts, it may be preferable to tailor our development efforts to favor particular player groups.

What Metrics Do We Need?

The metrics I’m particularly interested are measurements of key gameplay events/data based on player save files. I prefer these data over additional player surveys because I believe that players that are invested enough to answer a survey about the game will likely exhibit different play patterns than the average player. I also prefer gameplay data over additional playtesting because based on the constraints of the question, this data can be assumed to already be available and the largeness of the data set will help offset statistical errors related to sampling biases that may arise in concentrated playtests.

From player save data, I’d like to gather the following on a per player basis:

  1. The raw number of hours played – While not a perfect indicator of player engagement, it is a good starting point and can be used to identify key player groups.
  2. Overall % completion – This statistic is visibly tracked in game, and can help gauge player engagement and gameplay motivations, especially for completionists.
  3. MSQ % completion – Another statistic that is visibly tracked in game. Likely a better way to track average player engagement since most players probably aren’t completionists.
  4. % of time spent in combat – Not a visibly tracked statistic in the game, but one that I assume was collected based on the original definition of this problem.
  5. The amount of elapsed play time (Δt) from the last time a spell was acquired to the last time a player played the game – To be used in determining the effect spell progression had on player engagement. While not a definitive measure, it can be used to determine the strength of this claim and whether the system merits additional investigation.
  6. Δt from the last gear score progression to the last play time – To assess whether gear score was a meaningful form of progression for players. In my opinion, this felt like a pretty meaningless system. Just equip whatever has higher stats and transmog its appearance into whatever you want.
  7. Δt from the last gear crafting to the last play time – Gear crafting also seemed low impact given the fact that a statistically superior gear was likely to drop at any given moment.
  8. Δt from the ability acquisition to the last play time – Again, I felt that ability acquisition wasn’t a particularly strong driver for player engagement, and that analyses based on this metric will help determine whether or not this system is in need of attention.
I think that these metrics will provide a good starting point for further analyses.

What Analyses Should We Perform?

Alright, this is where the questionable math begins. I’m not a statistics major, I’m an engineering major, which means I think I know how to do statistics, but probably don’t. 

Analysis #1: Z-test on Overall MSQ % Completion

So the first thing I want to do is perform hypothesis testing on Players’ overall completion of the MSQ. I believe that the mean percentage of this value can indicate how engaged players are with our Hogwarts Legacy.

For those that don’t know, hypothesis testing is a form of statistical analysis that is used to determine whether or not a particular hypothesis is statistically significant based on a sample of data. For more info, you can read here. The hypotheses we want to test are:

  • Null Hypothesis: µ = 0.5 In English, the null hypothesis is that players are just as likely to complete the game as they are not.
  • Alternative Hypothesis: µ ≠ 0.5 In English, the alternative hypothesis is that players are either more likely or not as likely to finish our game.
The goal of this test would be to ideally reject the null hypothesis. If the Z-score winds up being statistically significant and positive, then we can assume with 97.5% confidence that an equally large sample of players would likely to finish 50% of the game. If it’s negative, we would assume the opposite. I think that if players aren’t finishing even 50% of the game on average, it’s a sign that the game might not be engaging overall. Adjustments to balance and progression won’t necessarily prop up a bad underlying gameplay experience or narrative, but they can definitely help.

Visualization of a Two-Sided Z Test

Credit for image here.

Analysis #2: Player Group Binning

The next analysis I’d like to perform is at attempt to separate the players into “bins” based on their primary motivations for playing the game as indicated by their save data. This information is particularly difficult to gather as both a direct survey and user play testing would be influenced by response bias. I would say this subsequent analysis also is not statistically rigorous, but is at least a somewhat structured approach to approximating these player groups.

The procedure I propose is for a series of key metrics, to set a critical value (in the graph to the right 1.65 standard deviations above the mean) corresponding to some top % of players (in the graph to the right, the top 5% of players). This assumes that the distribution of the data follows the normal distribution. This critical value would act as the threshold that a player would need to meet to qualify for categorization into a given bin. Players will only belong to one bin to simplify the analysis.

Visualization of a Right-Sided Z Test

Credit for image here. This analysis is concerned with the interval marked in red.

The bins, metrics, and order I’d define them in is as follows:

  1. Completionists – The metric they’d be binned by would be % total completion of the game. 
  2. Combat-Driven – The metric they’d be binned on # of enemies defeated / MSQ% completion from the set of players that are not completionists. This is because completionists are likely to enter combat and generally perform a lot of side activities per MSQ% completion compared to the average player, but might not be directly motivated by their love of the combat system.
  3. Other – This group would be binned based on the number of hours played / MSQ% completion. This group represents players that are playing for reasons that aren’t related to the main quest like in-game photography, base building, RPing, exploration, and other non-gameplay related motivations. Completionists and combat-driven players would likely score highly on this metric, so this bin is formed after to reduce the chances of mis-categorization.
  4. Narrative-Driven – They would be binned based on % MSQ completion. This is because completionists would also have a high MSQ% completion rate, so filtering them out first will help isolate players more motivated by narrative. Also, combat-driven and other-driven players may have completed a lot of the story line to unlock to the full extent of the systems they’re interested in, but not because they particularly enjoyed the story.
  5. Casuals – These are the players that don’t belong to any of the other bins.
These bins could then be used to assemble player profiles and potentially observe common trends and behaviors within these approximate user groups. Additionally, these bins can be individually applied to any of the other analyses in this analytics plan to check if there are trends that are being produced due to the presence or absence of these player groups.

Analysis #3: Spell Progression Trends

The next thing I’d want to do is create a scatterplot based on the normalized values of the average amount of time that has passed since the player last acquired a spell to the last time they played a game vs MSQ% completion to see if there is any trend or relationship between these variables. After controlling for outliers, possibly by using the bins defined above, the what I’d specifically be looking for is a negative relationship between the two variables. That is to say, the more standard deviations above the mean a player’s average amount of time is, the more standard deviations below the mean the player’s MSQ% completion is. If this general trend holds, then it would indicate that progression with spells is likely an important factor to player engagement, and that further analysis of how key progression events within the spell system affect average MSQ % completion or even absolute play time.

Again, this would support my original qualitative analysis of the game which is that progression with spells is one of the strongest forms of engagement when it comes to systems within the game and that because this progression occurs to quickly, it negatively impacts overall player engagement as the game proceeds.

Analysis #4: Gear Progression Trends

The next analysis I’d like to perform is on gear progression. Using a similar method to the section above, I’d like to generate a scatterplot of the normalized values for the amount of time since the player last increased their gear score to when they last played the game as compared to the normalized values of MSQ % completion. I’d again suggest controlling for outliers and potentially applying binning to the samples before generating the scatterplots. I would also perform the same analysis but instead of the amount of time since improving gear score, I’d like to look at the amount of time since the player intentionally used the crafting system. For both of these scatter plots, I’d expect for there to be no general trend. This would support the theory that there is probably not a very strong correlation between player interactions with the gear system and their actual engagement with the game. My rationale behind this is that generally there isn’t a reason to equip anything other than the statistically most powerful piece of gear you have in your inventory and that the drop rate of gear is so high that players don’t have much incentive to engage with the crafting system.

If this were the case, this would be grounds for further analysis of the gear system to see if there are some adjustments to drop rate or overall gear score progression that could be altered to make this a more meaningful system for players. 

Analysis #5: Ability Progression Trends

Finally, I’d like to take a look at the ability system. Again, we’d generate a scatterplot of the normalized values for the amount of time since the player last unlocked a new ability to when they last played the game as compared to the normalized values of MSQ % completion. I would again expect no statistically significant trend to be present. My rationale behind this is that while abilities seem exciting on the surface, the abilities themselves do not produce any meaningful changes in the way the player plays the game. It simply weights how strong a given option(s) are, but doesn’t actually increase the depth of decision-making within the game. Therefore, I’d expect in the long term that players would realize the lack of depth of this system, and unlocking additional ability progression would cease to be a driver to player engagement.

Again, we’re using this higher-level test to determine whether or not there is value in continuing to investigate the ability system and opportunities for rebalancing.

Conclusion

As a result of performing these analyses, we’d have a clearer picture into whether or not there is an issue with player engagement, what the overall demographics of our player base looks like, and which systems have room to improve. Furthermore, we’d be able to tell if certain systems are more successful at engaging specific player groups. With this information in hand, we’d be able to make an informed decision about what the business and design priorities of the effort are and what systems will be best to change in order to meet these objectives. Further, more specific analytics may be employed or the team could proceed directly to playtesting and iteration depending on timelines and business needs.

Mobile Design and Prototyping Plan

Unfortunately, I haven’t been able to spend time in Unreal working on my side project because I was performing a technical test for a potential employer. While I can’t elaborate on the specific employer or the details of the test, the problem itself was sufficiently open-ended such that I can share my analysis and the plan for the prototype I ultimately wound up implementing as well as my general process and rationale behind my decisions.

Defining the Problem Space

As always, I started this problem by first defining the problem space. In this scenario, the assumptions I made as I defined the problem space are as follows:

  • Context: The company was at least from what I could research was between titles, and so I assumed that the prototype I’d been asked to create was part of a project that was in pre-production, where the organization is trying to determine if a business case can be made for developing the game. Based on the company’s portfolio of games, it was clear to me that they focused almost exclusively on mobile games. When it comes to mobile games, the context that these games tend to be played in is virtually anywhere, at almost any time. Typically these play sessions are short and occur between other activities in the Player’s life.
  • Target Audience: Based on the typical business objectives of mobile games, I assumed that the target audience would be broad in order to maximize the number of potentially monetizing players.  
  • Constraints: An obvious constraint based on the context above is that the game would need to be a mobile game. Mobile games bring with them a number of constraints including limitations on the control scheme, performance limitations of the platform, saturation of the target market, and need for friction driving monetization in order to be business sustainable.
Using these assumptions, I determined the following qualities to be present in the final gameplay experience of this design:
  1. The gameplay loop of the game should be short. While overall progression and the game itself may be long if not infinite, the primary gameplay loop should be structured so that Players can play and forget about the game as fits into their life schedules.
  2. Gameplay should be simple to learn, and difficult to master. Because the target audience is broad, we want the gameplay to be sufficiently simple that even younger players can engage with the game and enjoy the gameplay, but to possess the complexity needed to maintain the attention of adults. That being said, this complexity should be generally related to optimizing play but not explicitly needed to enjoy the actual gameplay within the game. This makes it possible to make it hard for players to distinguish whether the friction to monetize is because of the game progressively ramping them out or because they are simply playing sub-optimally.
  3. Gameplay and the overall design should support monetization. As has been pointed out several times, monetization is critical within the mobile games market. The design should support the development of systems that are complex enough to provide the necessary means to adjust balance and increase friction to monetize and meet Key Performance Indicators (KPIs).
  4. Gameplay should be distinctive relative to the rest of the market. Finally, the mobile games market is highly saturated. In such a scenario, the two options available are to attempt to offer a similar product that directly competes with another existing product with some minor improvements in an attempt to take their user base or to offer a more novel product that provides an experience that is not already present on the market. The first approach is generally very difficult to do, and ultimately requires a significant prior investment in order to yield a product that is capable of out-performing already existing games. As such, it would be more realistic for a smaller company to focus on creating a more novel experience.

The Solution

The overall design I suggested exploring given this problem space was is a rhythm-based tower-defense game where Players can place gacha characters that can be tapped on to produce greater damage depending on each character’s unique rhythms. Depending on budget constraints, characters could make individual contributions to a common musical theme by replacing the “instrument” used to play each individual part of the sound track by simply passing each instrument through an audio filter or a set of samples that is unique to each character. Particularly rare characters could potentially change the song played itself, helping to drive the value of making additional gacha rolls.

Strengths

Here are what I believe are some of the strengths of this design separated out into the key elements of a design:

  1. Gameplay
    1. Relies on a gacha/tower-defense model that has proven to be both engaging for players and highly amenable to monetization.
    2. Rhythm-based gameplay is (to my knowledge) unique or at very least uncommon among existing competitors.
    3. This design was somewhat inspired by the Piano Tiles game. Its popularity despite its simplicity seems to lie in the high chance of entering a flow state when making repetitive, yet rhythmic movements.
    4. Gameplay footage will likely be fairly unique and action-packed looking, potentially helping with marketability.
  2. Tech
    1. Overall, the tower-defense style gameplay still seems to fall within the technical capabilities of the company.
    2. The design is conducive to using simplified graphics during actual gameplay, reducing reliance on performance constraints.
  3. Aesthetics
    1. A higher focus on audio might be a means to differentiate a game in a market that is already visually highly saturated.
    2. The musical theme doesn’t necessarily preclude either a fantasy or more modern/sci-fi theme, but can be made visually distinctive by introducing musical motifs and instruments to visual designs.
  4. Story
    1. This design doesn’t place any particular restrictions on the narrative elements of the game. In general, story is one of the least emphasized elements in the mobile gaming space anyways.

Weaknesses

Here are some of the challenges/risks that I personally see in this design:

  1. Gameplay
    1. Rhythm-based tower defense may still not be distinctive or enjoyable enough to attract monetizing players away from games they’re currently playing.
    2. Unclear whether this rhythm emphasis would even make the overall experience of a TD game more enjoyable.
    3. May be challenging to visualize exactly what players need to do while working with the limited screen space offered by the mobile platform.
  2. Tech
    1. A very tight demand on precision of timing. Ensuring that the game is visually accurate and consistent with the demands of the gameplay could be challenging given technical constraints of the mobile platform.
    2. The engine may not support multiple musical voices that would add impact to this design.
    3. Also, will need to determine whether to perform audio mixing while the player plays the game, or to pre-mix the audio and download the assets as part of the application before loading and playing the appropriate audio lines.
  3. Aesthetic
    1. The musically driven theming of this game might not fit into all contexts where players tend to play mobile games. Sometimes, players can’t or don’t want their game to make any noise which may be an issue if the game’s main aesthetic draw is the music.
    2. Visually driving home the musical theme may require additional assets beyond what the company already owns/has on hand.
  4. Story
    1. This design doesn’t really seem to have any particular narrative driver. Again, likely not a critical issue, but also would be nice to have.

Prototyping and Next Steps

Based on the limitations faced by this particular design, I personally feel it would be necessary to quickly create a few simple prototypes to see if the actual task of rhythmically touching particular units on the screen would be enjoyable. The first prototype (which is what I implemented) is intended to help simulate this task. The prototype I implemented has simplified features that reflect what the overall task of a rhythm based TD game might feel like. The features are as follows:

  1. 2 characters play attacking animations in time with the prototype’s sound track. One character is red, the other is blue.
  2. Along the bottom of the screen can be seen what is effectively a musical measure. As the notes in blue and red approach the right side of the measure, the Player can tap 1 for blue and 2 for red to increase the damage that character does during that particular attack animation.
  3. There is a modifiable fudge-factor of about .05 seconds on either side of the timing. If Players press the appropriate button within this timing window, the corresponding character will produce a visual effect and do more damage during that particular attack animation.

The hope is that this helps internal playtesters envision what it would look/play like with more units. Units could be organized into categories/classes to group them under a single “note” on the musical staff in order to keep complexity lower. Additionally, there are a number of alternative ways to visualize the rhythm the Player will need to press, but I figured it would be most important to first confirm whether the task of rhythmically pressing a few buttons/spots on the screen would be enjoyable or not.

In terms of next steps, subsequent prototypes should likely experiment with better ways to visualize oncoming rhythms, control schemes that are actually implemented and tested on touch screens, and possibly the technical capacity for playing all of the different musical sources present in this design. It would also be good to take a closer look at the current state of the mobile market and consider the desired visual theme. Getting some rough concept art as well as some general estimates of feasibility from artists would also be helpful.

What I’m Reading: Balancing Metagame Systems

Between the technical test I mentioned above and completely re-writing my entire resume, I wasn’t able to finish Chapter 14: Metagame Systems from Game Balance by Ian Schreiber and Brenda Romero. That being said, so far, the chapter has covered the following:

  • Defining the difference between ranking and rating systems
  • Describing typical purposes of ranking/rating systems
  • Listing a few examples of rating systems, their strengths and drawbacks. Including
    • Harkness rating – Used to maintain ratings between tournaments based on average scores of attendees, the system doesn’t factor for the ratings of individual opponents and fails if unusually low or high rated players are present at the tournament.
    • Elo rating – Used to maintain ratings between either single events or tournaments, the system factors for the individual skill/rating of each opponent, how quickly ratings change can be manually adjusted by whatever authority is maintaining the ratings, the system is only valid for 2 player 1 v 1 games.
    • Glicko rating – A variation on Elo rating where the coefficient for determining the rate at which ratings change decreases as the player plays more games and the system becomes more certain of the player’s level of play. This rating system can approach accurate values rather quickly, but without external limits placed on the co-efficient, highly active players may struggle to quickly change their rating which may discourage active play.
    • TrueSkill – A rating system created by Microsoft where a player’s rating and uncertainty are both tracked. Opponents with high uncertainty will yield smaller changes to the player’s rating. Matches between players with low uncertainty (especially upsets) will produce greater changes in rating. Can handle 1 v. 1 matches as well as free-for-all and team based matches. Also has flexibility in handling team-based rating adjustments and weighting, allowing for other metrics reflecting level of individual player performance beyond simple win/loss rate to be factored in.
    • Masterpoints – A rating system where players are awarded points for wins, less points for draws, and no points for losses. The idea is for players to always be accumulating an increasing number of points. This rating system encourages more active play, but makes it difficult to distinguish between active players who are mediocre vs. highly-skilled players that don’t play as much. Also does not factor for relative differences between players when adjusting ratings. Rating keepers may rely on additional external modifications like creating different categories of Masterpoints to make it distinguish between high-level competition and low-level competitions.
    • Online Rating Systems – Online games often employ or extend one or more of the above rating systems to implement matchmaking. In general, these systems try to create matches between players of nearly equal rating (and hopefully variance), but grows the range of permissible matches as time within the matchmaking queue elapses.
My main takeaway from all of this was that there are number of different rating techniques one can employ, and what’s most important is to establish what the intended player experience is when dealing with your rating system before selecting an implementation and making any necessary modifications to achieve that experience.