Data Science and Clash Royale¶

CMSC320 Tutorial by Rohan Mathur

Introduction¶

Clash Royale is a real-time strategy mobile video game created by Supercell. It's a combination of a collectable card game, a tower defense game, and a multiplayer online battle arena game. The goal of the game is to destroy your opponent's towers while protecting your own towers. To do this, you build a deck consisting of 8 cards that you can unlock through chests. Whoever takes the most towers, or the first person to take a king tower wins the game. On Kaggle, I found a dataset that consisted of many clash royale matches and recorded lots of data about each match. After looking through the data, I thought it would be interesting to do some analysis on the data and see if I can find any hidden clash royale trends. In this tutorial, I will walk you through the process of my data science pipeline, including data collection/management, data exploration, hypothesis testing, and some introductory machine learning methods.

The Data¶

The data I obtained from Kaggle was in the form of 3 CSV files. The link to the page is https://www.kaggle.com/datasets/bwandowando/clash-royale-season-18-dec-0320-dataset. The CSV with the match data ("BattlesStaging_01012021_WL_tagged.csv") is the one I will use for the majority of the analysis. The relevant columns are:
winner.card1.id / loser.card1.id -> what the names of the cards are
winner.card1.level / loser.card1.level -> the level the card is ... (it goes up to card8)
winner.startingTrophies / loser.startingTrophies -> number of trophies the players have before the match
winner.totalcard.level / loser.totalcard.level -> the level count of every card in their deck combined
winner.legendary.count / loser.legendary.count -> the number of legendary cards in their deck
winner.elixir.count / loser.elixir.count -> the average elixir cost of their deck
winner.cards.list / loser.cards.list -> a list of every card in their deck
There are more columns with interesting information as well, but the ones above are the ones used for the analysis.
The wincons dataset and the CardMasterList dataset are supplementary data used to translate some of the information in our main dataset columns. The cards are given with id number instead of with their name, so I use the CardMasterList to manually decipher them. The wincons dataset is used for data exploration related to clash royale win conditions.
Below you can use the wincons dataset, a list of columns in the main dataset, and part of the main dataset.

In [ ]:
import pandas as pd
df = pd.read_csv("BattlesStaging_01012021_WL_tagged.csv", nrows = 10000)
wincons = pd.read_csv("Wincons.csv")
indexes = []
# getting rid of outdated win condition information
indexes.append(wincons[wincons['card_name'] == 'Baby Dragon'].index[0])
indexes.append(wincons[wincons['card_name'] == 'Prince'].index[0])
indexes.append(wincons[wincons['card_name'] == 'Giant Skeleton'].index[0])
indexes.append(wincons[wincons['card_name'] == 'Mega Knight'].index[0])
wincons.drop(indexes, inplace=True)
cards = pd.read_csv("CardMasterListSeason18_12082020.csv")
display(wincons.head())
# drop the columns that we don't need 
to_drop = ['battleTime' , 'winner.tag', 'winner.clan.tag', 'winner.clan.badgeId', 'loser.tag', 
'loser.clan.tag', 'loser.clan.badgeId', 'tournamentTag']
df.drop(to_drop, axis=1, inplace=True)
print(df.columns)
display(df)
id card_id card_name
0 1 26000056 Skeleton Barrel
1 2 27000002 Mortar
2 3 26000024 Royal Giant
3 4 26000067 Elixir Golem
4 5 26000021 Hog Rider
Index(['Unnamed: 0', 'arena.id', 'gameMode.id', 'average.startingTrophies',
       'winner.startingTrophies', 'winner.trophyChange', 'winner.crowns',
       'winner.kingTowerHitPoints', 'winner.princessTowersHitPoints',
       'loser.startingTrophies', 'loser.trophyChange', 'loser.crowns',
       'loser.kingTowerHitPoints', 'loser.princessTowersHitPoints',
       'winner.card1.id', 'winner.card1.level', 'winner.card2.id',
       'winner.card2.level', 'winner.card3.id', 'winner.card3.level',
       'winner.card4.id', 'winner.card4.level', 'winner.card5.id',
       'winner.card5.level', 'winner.card6.id', 'winner.card6.level',
       'winner.card7.id', 'winner.card7.level', 'winner.card8.id',
       'winner.card8.level', 'winner.cards.list', 'winner.totalcard.level',
       'winner.troop.count', 'winner.structure.count', 'winner.spell.count',
       'winner.common.count', 'winner.rare.count', 'winner.epic.count',
       'winner.legendary.count', 'winner.elixir.average', 'loser.card1.id',
       'loser.card1.level', 'loser.card2.id', 'loser.card2.level',
       'loser.card3.id', 'loser.card3.level', 'loser.card4.id',
       'loser.card4.level', 'loser.card5.id', 'loser.card5.level',
       'loser.card6.id', 'loser.card6.level', 'loser.card7.id',
       'loser.card7.level', 'loser.card8.id', 'loser.card8.level',
       'loser.cards.list', 'loser.totalcard.level', 'loser.troop.count',
       'loser.structure.count', 'loser.spell.count', 'loser.common.count',
       'loser.rare.count', 'loser.epic.count', 'loser.legendary.count',
       'loser.elixir.average'],
      dtype='object')
Unnamed: 0 arena.id gameMode.id average.startingTrophies winner.startingTrophies winner.trophyChange winner.crowns winner.kingTowerHitPoints winner.princessTowersHitPoints loser.startingTrophies ... loser.cards.list loser.totalcard.level loser.troop.count loser.structure.count loser.spell.count loser.common.count loser.rare.count loser.epic.count loser.legendary.count loser.elixir.average
0 0 54000050.0 72000006.0 5363.0 5372.0 28.0 2.0 4145.0 [1484] 5354.0 ... [26000000, 26000015, 26000023, 27000004, 28000... 104 3 1 4 1 1 4 2 3.500
1 1 54000050.0 72000006.0 5407.0 5409.0 29.0 1.0 5304.0 [579, 3082] 5405.0 ... [26000023, 26000027, 26000037, 26000046, 26000... 104 6 1 1 0 1 2 5 4.250
2 2 54000050.0 72000006.0 5741.0 5749.0 28.0 2.0 5762.0 [2080, 2099] 5733.0 ... [26000022, 26000027, 26000028, 26000041, 26000... 104 7 0 1 4 2 1 1 4.125
3 3 54000050.0 72000006.0 4307.0 4316.0 28.0 2.0 4392.0 [1322] 4298.0 ... [26000012, 26000027, 26000031, 26000033, 26000... 80 6 1 1 2 1 2 3 3.750
4 4 54000050.0 72000006.0 5776.5 5783.0 28.0 3.0 5832.0 [3668, 3668] 5770.0 ... [26000010, 26000011, 26000021, 26000037, 26000... 104 5 1 2 2 4 0 2 3.250
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 9995 54000050.0 72000006.0 4936.0 4915.0 33.0 1.0 5832.0 [3242, 1063] 4957.0 ... [26000017, 26000024, 26000025, 26000055, 26000... 103 5 1 2 4 1 2 1 4.500
9996 9996 54000050.0 72000006.0 4540.0 4543.0 29.0 2.0 3436.0 [2014] 4537.0 ... [26000000, 26000001, 26000016, 26000021, 26000... 93 5 0 3 4 2 1 1 3.125
9997 9997 54000050.0 72000006.0 4441.0 4442.0 29.0 3.0 4824.0 [3052, 2734] 4440.0 ... [26000005, 26000011, 26000015, 26000017, 26000... 91 7 0 1 2 4 1 1 3.750
9998 9998 54000050.0 72000006.0 5663.0 5684.0 26.0 1.0 5832.0 [711, 2363] 5642.0 ... [26000004, 26000012, 26000015, 26000036, 26000... 104 6 0 2 2 2 4 0 3.875
9999 9999 54000050.0 72000006.0 5290.0 5304.0 27.0 1.0 5832.0 [1131, 2632] 5276.0 ... [26000027, 26000042, 26000049, 26000054, 26000... 102 5 1 2 2 2 2 2 4.000

10000 rows × 66 columns

Data Exploration¶

With the main dataset, there are a couple things I want to look into. Mainly, I want to look into the distribution of cards among both the winners and the losers. With that, we can see if certain cards are used more often than others. We can see if the winners use some cards more often than the losers. I also want to look into the frequency of win conditions. A win condition in clash royale is the centerpiece of your deck, and that card is your means to take your opponent's towers and win the game. I want to see which win conditions are the most common and how the frequency differs among the winners and losers. Below this, you can see the distribution of cards among the winners side.

In [ ]:
import matplotlib.pyplot as plt
# make a list containing all of the cards in all of the winners' decks
winnerCards = []
winnerCards = df['winner.card1.id'].tolist()
temp = df['winner.card2.id'].tolist()
winnerCards.extend(temp)
temp = df['winner.card3.id'].tolist()
winnerCards.extend(temp)
temp = df['winner.card4.id'].tolist()
winnerCards.extend(temp)
temp = df['winner.card5.id'].tolist()
winnerCards.extend(temp)
temp = df['winner.card6.id'].tolist()
winnerCards.extend(temp)
temp = df['winner.card7.id'].tolist()
winnerCards.extend(temp)
temp = df['winner.card8.id'].tolist()
winnerCards.extend(temp)
print(len(winnerCards))
winnerCardsSeries = pd.Series(winnerCards) 
print(winnerCardsSeries.unique())
print(winnerCardsSeries.value_counts())
# plot a frequency distribution of the cards
winnerCardsSeries.value_counts().plot.bar(figsize=(20,8), color='cornflowerblue')
plt.title('Card Distribution among Winners', fontsize=20)
plt.xlabel('Cards', fontsize=14)
plt.ylabel('Number of Decks which Include a Card', fontsize=14)
plt.show
80000
[26000008 26000056 26000044 28000004 28000011 26000041 26000061 26000055
 26000004 26000023 28000015 26000009 26000015 26000047 27000003 26000032
 26000017 26000037 26000034 28000001 26000005 26000018 26000049 26000059
 26000001 26000010 27000006 28000002 26000028 26000029 27000002 26000000
 26000040 28000010 26000024 28000003 26000054 28000008 28000007 26000011
 26000080 26000085 28000012 26000035 26000003 26000019 26000043 28000009
 26000046 26000042 26000026 26000012 26000006 26000014 26000027 26000063
 26000020 27000010 26000064 26000030 26000051 26000062 28000018 26000053
 28000000 26000021 26000038 26000045 26000060 26000068 28000017 28000006
 26000048 27000004 26000039 26000067 26000016 26000007 27000008 26000052
 26000084 26000033 26000022 26000083 28000016 28000005 28000014 26000025
 27000001 26000036 26000057 26000058 26000002 26000013 27000012 26000031
 26000050 28000013 27000000 27000007 27000005 27000009]
28000008    3133
28000011    2998
28000000    2502
28000001    2153
26000000    2038
            ... 
28000018     129
26000002     105
27000005      93
26000085      83
26000028      80
Length: 102, dtype: int64
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>

From the value counts command and the graph above, we can see the 5 most commons cards on the winner side are 28000008 (zap), 28000011 (the log), 28000000 (fireball), 28000001 (arrows), and 26000000 (knight). The top 2 are quite close, the third is a little further down, and the 4th and 5th are close but further down from the third as well. This is interesting because the top 4 most common cards are spell cards. Perhaps spell cards are more common because they are versatile and can fit into many decks. The fifth most common card is the knight. The knight is a very generic troop card that can be used in many situations, both to attack and to defend. The trend from these cards seems to be that versatility is the most important quality.

Below this, we will look at the winners' win conditions.

In [ ]:
import numpy as np
# Make a list of every win condition card in the winner decks
winner_wincons = []
for i in range(0, len(winnerCards)):
    if winnerCards[i] in wincons['card_id'].to_list():
        winner_wincons.append(winnerCards[i])

print("Number of wincons present in the winner decks: " + str(len(winner_wincons)))
winner_wincons_series = pd.Series(winner_wincons) 
# plot the frequency distribution of the win conditions
winner_wincons_series.value_counts().plot.bar(figsize=(20,8), color='cornflowerblue')
plt.title('Win Condition Card Distribution among Winners', fontsize=20)
plt.xlabel('Win Condition Cards', fontsize=14)
plt.ylabel('Number of Decks which Include a Card', fontsize=14)
plt.show
Number of wincons present in the winner decks: 12261
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>

From the graph above, we can see that the most common win conditions on the winners side are 26000021 (Hog Rider), 26000006 (Balloon), 28000004 (Goblin Barrel), 26000032 (Miner), and 26000009 (Golem). Both hog rider and balloon seem to have a similar frequency, and after that the distribution seems to trail off. These win conditions do not seem to have much in common with each other, other than some basic similarities. Hog rider is a rush attacker, as it is fast and goes to strike towers directly. Balloon is an air troop that also goes directly for towers and does high damage, but it is slower. Goblin barrel and miner and both chip damage cards. They can be sent anywhere on the map, so they can slowly but surely deal damage to the enemy towers. Golem is a strong tank with a lot of health that goes directly for the tower, but is very slow.

In [ ]:
loserCards = []
# make a list containing all of the cards in all of the losers' decks
loserCards = df['loser.card1.id'].tolist()
temp = df['loser.card2.id'].tolist()
loserCards.extend(temp)
temp = df['loser.card3.id'].tolist()
loserCards.extend(temp)
temp = df['loser.card4.id'].tolist()
loserCards.extend(temp)
temp = df['loser.card5.id'].tolist()
loserCards.extend(temp)
temp = df['loser.card6.id'].tolist()
loserCards.extend(temp)
temp = df['loser.card7.id'].tolist()
loserCards.extend(temp)
temp = df['loser.card8.id'].tolist()
loserCards.extend(temp)
print(len(loserCards))
loserCardsSeries = pd.Series(loserCards) 
print(loserCardsSeries.unique())
print(loserCardsSeries.value_counts())
# plot the distribution of cards
loserCardsSeries.value_counts().plot.bar(figsize=(20,8), color='maroon')
plt.title('Card Distribution among Losers', fontsize=20)
plt.xlabel('Cards', fontsize=14)
plt.ylabel('Number of Decks which Include a Card', fontsize=14)
plt.show
80000
[27000004 26000037 26000046 26000027 26000021 26000009 26000000 26000004
 26000042 28000011 26000052 26000012 28000009 28000000 26000049 28000003
 26000030 26000036 26000026 27000003 26000055 26000017 26000007 26000010
 26000032 26000059 26000018 26000056 28000001 26000033 26000011 26000041
 26000013 26000022 28000014 26000016 28000008 26000031 26000020 27000010
 28000002 26000051 26000006 26000005 26000043 26000029 28000004 26000014
 26000083 26000058 26000044 28000012 26000003 26000035 26000008 26000024
 26000047 26000039 28000015 26000064 26000054 28000007 26000001 28000010
 26000048 26000057 27000012 26000019 26000060 26000053 26000015 26000040
 27000009 27000008 26000038 28000017 28000013 27000002 26000023 26000084
 28000005 26000067 26000085 26000061 26000062 26000063 28000006 26000034
 28000016 27000007 26000045 26000028 27000005 26000002 26000025 27000006
 26000080 27000001 26000050 27000000 26000068 28000018]
28000008    2955
28000011    2882
28000000    2534
26000011    2360
28000001    2190
            ... 
26000053     106
26000060     105
28000018     102
26000085      87
26000028      82
Length: 102, dtype: int64
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>

From the above graph and value counts, we can see the five most common cards on the loser side are 28000008 (zap), 28000011 (the log), 28000000 (fireball), 26000011 (valkyrie), and 28000001 (arrows). This is interesting because this top 5 is extremely similar to the winners top 5. In fact, the graphs themselves look very similar and have extremely similar distributions. The winners and losers share 4 out of 5 of the top 5 cards. The only two differences are that arrows is in 5th place in losers, and the losers have valkyrie in 4th place and don't have knight in their top 5. Valkyrie is a very strong card because it has a lot of health and does area of effect damage, so it can hit multiple things at once. From just this data it is difficult to make a connection between losers and valkyrie, and more analysis would probably need to be done. I would guess the correlation does not have to do with the valkyrie itself, but rather with how a player can become dependent on certain cards.

In [ ]:
loser_wincons = []
# Make a list of every win condition card in the loser decks
for i in range(0, len(loserCards)):
    if loserCards[i] in wincons['card_id'].to_list():
        loser_wincons.append(loserCards[i])

print("Number of wincons present in the loser decks: " + str(len(loser_wincons)))
loser_wincons_series = pd.Series(loser_wincons) 
# plot the frequency distribution of the win conditions
loser_wincons_series.value_counts().plot.bar(figsize=(20,8), color='maroon')
plt.title('Win Condition Card Distribution among Losers', fontsize=20)
plt.xlabel('Win Condition Cards', fontsize=14)
plt.ylabel('Number of Decks which Include a Card', fontsize=14)
plt.show
Number of wincons present in the loser decks: 11924
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>

From the above graph, we can see that the five most common win conditions on the losers side are 2600021 (hog rider), 2600006 (balloon), 28000004 (goblin barrel), 26000032 (miner), and 26000009 (golem). Once again, the losers distribution looks extremely similar to the winners distribution. In fact, the top 5 cards of the winners and losers are the exact same. With little difference in the win condition card distribution between winners and losers, it could mean that the cards in the deck are not as important as a player's skill. It could also mean that certain cards are so oppressive that everyone is forced to play them in order to even have a chance at winning. Either way, all of the graph data shows that there is a "meta" in class royale. A "meta" in a gaming sense often means the most effective tactics available. Often players have to conform to the meta in order to have a chance at winning. Based on the graphs, you could conclude that using a hog rider deck is the meta as it is popular among winners and losers. However, the number of people using balloon is also very high, so no single strategy is dominating the meta. That is good, because having multiple viable strategies in the game leads to more creativity and more fun.

In [ ]:
# get the top 10 most common decks for winners
top_decks_winner = df['winner.cards.list'].value_counts()[0:10]
print(top_decks_winner)
# Plot the decks
ax = top_decks_winner.plot.bar(figsize=(20,8), color='cornflowerblue')
ax.set_xticklabels(['Deck 1', 'Deck 2', 'Deck 3', 'Deck 4', 'Deck 5', 'Deck 6', 'Deck 7', 'Deck 8', 'Deck 9', 'Deck10'])
plt.title('Top 10 Decks among Winners', fontsize=20)
plt.xlabel('Decks', fontsize=14)
plt.ylabel('Frequency of a Deck', fontsize=14)
plt.show
[26000000, 26000026, 26000030, 26000041, 27000003, 28000003, 28000004, 28000011]    175
[26000000, 26000001, 26000010, 26000030, 27000006, 27000008, 28000000, 28000011]    129
[26000004, 26000036, 26000042, 26000046, 26000050, 26000062, 28000008, 28000009]    129
[26000010, 26000014, 26000021, 26000030, 26000038, 27000000, 28000000, 28000011]    129
[26000006, 26000008, 26000029, 26000032, 26000037, 26000080, 28000001, 28000008]    107
[26000000, 26000010, 26000023, 27000006, 27000008, 28000003, 28000011, 28000012]     82
[26000000, 26000019, 26000032, 26000049, 26000058, 27000004, 28000000, 28000011]     63
[26000000, 26000015, 26000023, 27000004, 28000009, 28000010, 28000012, 28000015]     60
[26000009, 26000015, 26000023, 26000027, 26000048, 28000007, 28000012, 28000015]     59
[26000003, 26000016, 26000027, 26000032, 26000039, 26000042, 28000000, 28000008]     58
Name: winner.cards.list, dtype: int64
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>

In the above graph, we can see the frequency distribution for the top 10 most common winner decks. I will briefly discuss the top two decks. The first one is 26000000 (knight), 26000026 (princess), 26000030 (ice spirit), 26000041 (goblin gang), 27000003 (inferno tower), 28000003 (rocket), 28000004 (goblin barrel), and 28000011 (the log). This deck seems to be a version of the "log bait" deck. A log bait deck has a win condition of the goblin barrel, which is one of the top five win conditions from above. It also has the log and the knight which are two of the top five most common cards. A player using this strategy consistently uses their cards to slowly chip away at the enemy tower health while defending enemy pushes. Perhaps it is the most common because it is easy to learn and very spam-based. The next couple decks seem to be tied, so I arbitrarily chose this one to look at: 26000000 (knight), 26000001 (archers), 26000010 (skeletons), 26000030 (ice spirit), 27000006 (tesla), 27000008 (X-bow), 28000000 (fireball), 28000011 (the log). This deck has three of our top five most common cards (knight, the log, and fireball), and its win condition is X-bow, which is not in our top five win conditions list. X-bow is a card that can attack the enemy tower directly while staying out of range of the enemy tower automatic defenses. If the enemy does not distract it and take care of it quickly, it can do a lot of damage to the enemy tower. The rest of the cards in the deck seem very low elixir cost, so you can play them quickly and "cycle" back to the X-bow so you can play it again. It also seems easy to use and relatively spam-based. It makes sense that the popular decks are ones that are easy to use and can easily overwhelm your opponent.

In [ ]:
# get the top 10 most common decks for losers 
top_decks_loser = df['loser.cards.list'].value_counts()[0:10]
print(top_decks_loser)
# Plot the decks
ax = top_decks_loser.plot.bar(figsize=(20,8), color='maroon')
ax.set_xticklabels(['Deck 1', 'Deck 2', 'Deck 3', 'Deck 4', 'Deck 5', 'Deck 6', 'Deck 7', 'Deck 8', 'Deck 9', 'Deck10'])
plt.title('Top 10 Decks among Losers', fontsize=20)
plt.xlabel('Decks', fontsize=14)
plt.ylabel('Frequency of a Deck', fontsize=14)
plt.show
[26000000, 26000026, 26000030, 26000041, 27000003, 28000003, 28000004, 28000011]    159
[26000010, 26000014, 26000021, 26000030, 26000038, 27000000, 28000000, 28000011]    135
[26000000, 26000001, 26000010, 26000030, 27000006, 27000008, 28000000, 28000011]     78
[26000000, 26000010, 26000023, 27000006, 27000008, 28000003, 28000011, 28000012]     66
[26000004, 26000036, 26000042, 26000046, 26000050, 26000062, 28000008, 28000009]     62
[26000006, 26000008, 26000029, 26000032, 26000037, 26000080, 28000001, 28000008]     55
[26000000, 26000019, 26000032, 26000049, 26000058, 27000004, 28000000, 28000011]     52
[26000000, 26000015, 26000023, 27000004, 28000009, 28000010, 28000012, 28000015]     51
[26000000, 26000026, 26000030, 26000041, 27000006, 28000003, 28000004, 28000011]     49
[26000009, 26000015, 26000035, 26000039, 26000048, 28000007, 28000012, 28000015]     49
Name: loser.cards.list, dtype: int64
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>

In the above graph, we can see the frequency distribution for the top 10 most common loser decks. I will briefly discuss the top two decks. The first one is 26000000 (knight), 26000026 (princess), 26000030 (ice spirit), 26000041 (goblin gang), 27000003 (inferno tower), 28000003 (rocket), 28000004 (goblin barrel), and 28000011 (the log). Interestingly enough, this is the exact same log bait deck that we saw on the winners side. Perhaps this is because of the prevalence of the deck in general. If it is very common, it makes sense that it would have a lot of wins and losses associated with it. If it had a lot of wins but not many losses, that would indicate that it is probably the most effective strategy in the game. With many wins and losses, it shows that a strategy is strong and popular, but not overpowered. The second deck is 26000010 (skeletons), 26000014 (musketeer), 26000021 (hog rider), 26000030 (ice spirit), 26000038 (ice golem), 27000000 (cannon), 28000000 (fireball), 28000011 (the log). This deck is also tied for second on the winners side, but it was not the one I discussed. This deck has two of our top 5 losers cards: the log and fireball, and has the most popular loser win condition: hog rider. The deck is called hog rider cycle and is based on overwhelming your opponent by sending hog rider as often as possible to continually attack. It is also relatively easy to learn and spam-based. That seems to be a pattern with the top decks we have looked at. Overall, by looking at winners top decks and losers top decks, it is easy to see which decks are the most popular. Since the decks appear in both winners and losers, it seems that the decks are balanced and are not overpowered or too hard to deal with. While it is inevitable for certain strategies to be the most popular, it is good for the game to not have unbeatable strategies.

Hypothesis Testing¶

There are a couple statistics that I want to see if they differ based on if they came from a winner or a loser. To do this, I am going to perform a couple of 2 sample T-tests on some of the columns in our dataframe. This can help us see if and how certain factors contribute to wins and losses.
Learn about T-Tests:
https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/
https://towardsdatascience.com/t-test-and-hypothesis-testing-explained-simply-1cff6358633e

In [ ]:
import scipy.stats as stats
# 2 sample t tests
# Elixir Average
print("Elixir Average:")
print(df['winner.elixir.average'].mean())
print(df['loser.elixir.average'].mean())
print(stats.ttest_ind(df['winner.elixir.average'], df['loser.elixir.average']))
# Spell Counts
print("Spell Counts:")
print(df['winner.spell.count'].mean())
print(df['loser.spell.count'].mean())
print(stats.ttest_ind(df['winner.spell.count'], df['loser.spell.count']))
# Legendary Counts
print("Legendary Counts:")
print(df['winner.legendary.count'].mean())
print(df['loser.legendary.count'].mean())
print(stats.ttest_ind(df['winner.legendary.count'], df['loser.legendary.count']))
# Total Card Levels
print("Total Card Levels:")
print(df['winner.totalcard.level'].mean())
print(df['loser.totalcard.level'].mean())
print(stats.ttest_ind(df['winner.totalcard.level'], df['loser.totalcard.level']))
# Starting Trophies
print("Starting Trophies")
print(df['winner.startingTrophies'].mean())
print(df['loser.startingTrophies'].mean())
print(stats.ttest_ind(df['winner.startingTrophies'], df['loser.startingTrophies']))
Elixir Average:
3.761382142857143
3.796941071428571
Ttest_indResult(statistic=-4.949226778999882, pvalue=7.511465939739547e-07)
Spell Counts:
2.1064
2.0718
Ttest_indResult(statistic=2.9472044651245857, pvalue=0.0032103129941737107)
Legendary Counts:
1.6639
1.584
Ttest_indResult(statistic=4.994397046861575, pvalue=5.951686925696776e-07)
Total Card Levels:
99.4396
98.7528
Ttest_indResult(statistic=4.562144040400342, pvalue=5.093467257694242e-06)
Starting Trophies
5221.8336
5220.9394
Ttest_indResult(statistic=0.08536764645481916, pvalue=0.9319699682156413)
Results¶

The elixir averages, spell counts, legendary card counts, and total card levels all have a p-value of less than 0.05. Therefore, we can conclude that the population means of the winner and loser groups of those categories are different. It seems that winner decks normally have a smaller elixir average, more spell cards in the deck, more legendary cards in the deck, and higher total card levels. Higher card levels and more legendaries makes sense because those would both help a player win. A smaller elixir average and more spell cards is interesting because their correlation with winning is not as obvious. Perhaps spells are more versatile than other cards and a smaller elixir average makes a player more options and a better cycle.
The starting trophy t-test has a p-value of above 0.05. Therefore, we fail to reject the null hypothesis and we cannot conclude that the population means of the winner and loser groups with respect to starting trophies are different. This makes sense because the game tries to match up players of equal skill levels by having them play against others with similar trophy counts. This could be used as evidence to say the game's matchmaking system is fair. On the other hand, we saw that the total card levels were different for winners and losers. This could mean that despite having a similar trophy count, one player could have much higher card levels than the other. This could lead to an unfair match. One could argue that this difference could be overcome with a skill difference. More analysis on the matchmaking system would need to be conducted before concluding if it is fair or not.

Classification¶

I thought it would be an interesting task to see if I could use a machine learning classifier to predict the number of crowns the winner would take in a game based on the cards and levels in the decks of the winners and losers. You gain a crown for every enemy tower you destroy: either 1, 2 or 3. The classifiers I tried were K Nearest Neighbors and Random Forest. I used grid search to find the optimal parameters for the training of the model, and use 10 fold cross validation to check for accuracy.
Learn about Random Forests: https://www.datacamp.com/tutorial/random-forests-classifier-python
Learn about K Nearest Neighbors: https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn

In [ ]:
# Classification Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score,KFold
from sklearn import metrics
from numpy import mean
from numpy import absolute
import pandas as pd
import scipy.stats as stats

# Features
x_set = df[['winner.card1.id', 'winner.card1.level', 'winner.card2.id', 'winner.card2.level',
'winner.card3.id', 'winner.card3.level', 'winner.card4.id', 'winner.card4.level',
'winner.card5.id', 'winner.card5.level', 'winner.card6.id', 'winner.card6.level',
'winner.card7.id', 'winner.card7.level', 'winner.card8.id', 'winner.card8.level',
'loser.card1.id', 'loser.card1.level', 'loser.card2.id', 'loser.card2.level',
'loser.card3.id', 'loser.card3.level', 'loser.card4.id', 'loser.card4.level',
'loser.card5.id', 'loser.card5.level', 'loser.card6.id', 'loser.card6.level',
'loser.card7.id', 'loser.card7.level', 'loser.card8.id', 'loser.card8.level',]]
# Labels
y_set = df['winner.crowns']
X_train, X_test, y_train, y_test = train_test_split(x_set, y_set, test_size=0.2, random_state=1)

# K Nearest Neighbors

knn = KNeighborsClassifier()
params = {
    'n_neighbors': [3,5,7,9,11,13,15,17,21,23,25]
}
# gridsearch
clf2 = GridSearchCV(
    estimator=knn,
    param_grid=params,
    cv=5,
    n_jobs=5,
    verbose=1
)
# fitting
clf2.fit(X_train, y_train)
print(clf2.best_params_)
neighbors = clf2.best_params_["n_neighbors"]
knn2 = KNeighborsClassifier(n_neighbors=neighbors)
knn2.fit(X_train, y_train)
y_pred = knn2.predict(X_test)
# Holdout predictions
print("KNN Holdout Accuracy:",metrics.accuracy_score(y_test, y_pred))
# Cross Validation
kf=KFold(n_splits=10)
score2=cross_val_score(knn2, x_set, y_set,cv=kf)
error_scores2 = cross_val_score(knn2, x_set, y_set, scoring='neg_mean_absolute_error',cv=kf)
print("KNN Cross Validation Scores are {}".format(score2))
print("KNN Average Cross Validation score: {}".format(score2.mean()))
print("KNN Mean Absolute Error: {}".format(mean(absolute(error_scores2))))

# Random Forest

forest = RandomForestClassifier()
params = {'n_estimators': [5, 10, 25, 50, 100, 150, 200], 'max_depth': [5, 10, 20, 40, 80]}

#grid search
clf = GridSearchCV(
    estimator=forest,
    param_grid=params,
    cv=5,
    n_jobs=5,
    verbose=1
)
# fitting
clf.fit(X_train, y_train)
print(clf.best_params_)
n_val = clf.best_params_["n_estimators"]
depth = clf.best_params_["max_depth"]
forest2 = RandomForestClassifier(n_estimators=n_val, max_depth = depth)
forest2.fit(X_train, y_train)
y_pred=forest2.predict(X_test)
# Holdout predictions
print("RF Holdout Accuracy:",metrics.accuracy_score(y_test, y_pred))
# Cross Validation
kf=KFold(n_splits=10)
score=cross_val_score(forest2,x_set, y_set,cv=kf)
error_scores = cross_val_score(forest2,x_set, y_set, scoring='neg_mean_absolute_error',cv=kf)
print("RF Cross Validation Scores are {}".format(score))
print("RF Average Cross Validation score: {}".format(score.mean()))
print("RF Mean Absolute Error: {}".format(mean(absolute(error_scores))))

# 2 Sample t test
stats.ttest_ind(score2, score)
Fitting 5 folds for each of 11 candidates, totalling 55 fits
{'n_neighbors': 25}
KNN Holdout Accuracy: 0.547
KNN Cross Validation Scores are [0.54  0.556 0.539 0.542 0.509 0.521 0.517 0.486 0.506 0.511]
KNN Average Cross Validation score: 0.5227000000000002
KNN Mean Absolute Error: 0.6943
Fitting 5 folds for each of 35 candidates, totalling 175 fits
{'max_depth': 20, 'n_estimators': 200}
RF Holdout Accuracy: 0.5675
RF Cross Validation Scores are [0.573 0.601 0.57  0.564 0.549 0.569 0.548 0.543 0.541 0.531]
RF Average Cross Validation score: 0.5589
RF Mean Absolute Error: 0.6454000000000001
Out[ ]:
Ttest_indResult(statistic=-3.8827833633961824, pvalue=0.0010904617094874593)
Classifier Results¶

Unfortunately, our classifiers were not able to predict the number of crowns the winner would get solely given the decks and card levels of the winner and the loser. The K Nearest Neighbors model achieved around 52% accuracy and the Random Forest model achieved around 55% accuracy. The T Test performed on their cross validation scores produced a p value of less than 0.05, so we can reject the null hypothesis and say that the Random Forest average accuracy was better than the KNN average accuracy. In the future, this could be tested with different classifiers or different parameters. Perhaps the data could be altered as well with different labels. Perhaps other information could be added to the features as well.

Conclusion¶

In conclusion, by going through the data science pipeline, we were able to make some insights about Clash Royale. We learned about the most popular cards, win conditions, and decks for both the winners and losers. We made an inference that the game is balanced as the common decks and cards are present among both the winners and losers. We performed T-Tests on a couple of columns to see how certain factors can affect a game. We attempted to make a classifier to predict how many crowns a winner would get in a game. All in all, we went through the data science lifecycle while digging a little deeper into Clash Royale.