Part 1. Predict Dota 2 match winner by the first 5 minutes of the game. Gradient Boosting.
Bài đăng này đã không được cập nhật trong 7 năm
Introduction
Dota 2 is a computer game in the MOBA (Multiplayer Online Battle Arena) genre. It is played by two teams, called Radiant and Dire which consist of five players each. The main goal of the game is to destroy other team's “Ancient", located at the opposite corners of the map. Matches are generated from a queue, taking into account the level of the game all players, known as MMR (Match Making Rank).
Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks').
Description of the match
1. Players select "heroes"
Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks'). Each hero has a set of features that define his role in the team and playstyle. Among these features there are his basic attribute (Strength, Agility or Intelligence) and unique set of 4 (or for some heroes even more) skills.
2. Main part
Players can receive gold and experience for killing other people's heroes or other units. The accumulated experience influences the level of the hero, which in turn makes it possible to improve abilities. For accumulated gold, players buy items that improve the characteristics of heroes or give them new abilities.
After death, the hero goes to the "tavern" and is reborn only after some time, so the team loses the player for a while, but the player can prematurely redeem the hero from the tavern for a certain amount of gold.
During the game, teams develop their heroes, defend their part of the field and attack the enemy.
3. End of the game
The game ends when one of the teams destroys a certain number of "towers" of the enemy and destroys the throne
For the first 5 minutes of the game, predict which of the teams will win: Radiant or Dire?
The training set consists of matches, for which all of the ingame events (like kills, item purchase etc.) as well as match outcome are know. You are given only the first 5 minutes of each match and you need to predict the likelihood of Radiant victory.
The data for this contest is based on the archive of public games by OpenDota.com (formerly YASP).
6 rows × 102 columns
start_time | lobby_type | r1_hero | r1_level | r1_xp | r1_gold | r1_lh | r1_kills | r1_deaths | r1_items | r2_hero | r2_level | r2_xp | r2_gold | r2_lh | r2_kills | r2_deaths | r2_items | r3_hero | r3_level | r3_xp | r3_gold | r3_lh | r3_kills | r3_deaths | r3_items | r4_hero | r4_level | r4_xp | r4_gold | r4_lh | r4_kills | r4_deaths | r4_items | r5_hero | r5_level | r5_xp | r5_gold | r5_lh | r5_kills | r5_deaths | r5_items | d1_hero | d1_level | d1_xp | d1_gold | d1_lh | d1_kills | d1_deaths | d1_items | d2_hero | d2_level | d2_xp | d2_gold | d2_lh | d2_kills | d2_deaths | d2_items | d3_hero | d3_level | d3_xp | d3_gold | d3_lh | d3_kills | d3_deaths | d3_items | d4_hero | d4_level | d4_xp | d4_gold | d4_lh | d4_kills | d4_deaths | d4_items | d5_hero | d5_level | d5_xp | d5_gold | d5_lh | d5_kills | d5_deaths | d5_items | first_blood_time | first_blood_team | first_blood_player1 | first_blood_player2 | radiant_bottle_time | radiant_courier_time | radiant_flying_courier_time | radiant_tpscroll_count | radiant_boots_count | radiant_ward_observer_count | radiant_ward_sentry_count | radiant_first_ward_time | dire_bottle_time | dire_courier_time | dire_flying_courier_time | dire_tpscroll_count | dire_boots_count | dire_ward_observer_count | dire_ward_sentry_count | dire_first_ward_time | duration | radiant_win | tower_status_radiant | tower_status_dire | barracks_status_radiant | barracks_status_dire |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1430198770 | 7 | 11 | 5 | 2098 | 1489 | 20 | 0 | 0 | 7 | 67 | 3 | 842 | 991 | 10 | 0 | 0 | 4 | 29 | 5 | 1909 | 1143 | 10 | 0 | 0 | 8 | 20 | 3 | 757 | 741 | 6 | 0 | 0 | 7 | 105 | 3 | 732 | 658 | 4 | 0 | 1 | 11 | 4 | 3 | 1058 | 996 | 12 | 0 | 0 | 6 | 42 | 4 | 1085 | 986 | 12 | 0 | 0 | 4 | 21 | 5 | 2052 | 1536 | 23 | 0 | 0 | 6 | 37 | 3 | 742 | 500 | 2 | 0 | 0 | 8 | 84 | 3 | 958 | 1003 | 3 | 1 | 0 | 9 | 7.0 | 1.0 | 9.0 | 134.0 | -80.0 | 244.0 | 2 | 2 | 2 | 0 | 35.0 | 103.0 | -84.0 | 221.0 | 3 | 4 | 2 | 2 | -52.0 | 2874 | 1 | 1796 | 0 | 51 | 0 | |
1430220345 | 0 | 42 | 4 | 1188 | 1033 | 9 | 0 | 1 | 12 | 49 | 4 | 1596 | 993 | 10 | 0 | 1 | 7 | 67 | 4 | 1506 | 1502 | 18 | 1 | 0 | 7 | 37 | 3 | 669 | 631 | 7 | 0 | 0 | 7 | 26 | 2 | 415 | 539 | 1 | 0 | 0 | 5 | 39 | 5 | 1960 | 1384 | 16 | 0 | 0 | 8 | 88 | 3 | 640 | 566 | 1 | 0 | 1 | 5 | 79 | 3 | 720 | 1350 | 2 | 2 | 0 | 12 | 7 | 2 | 440 | 583 | 0 | 0 | 0 | 7 | 12 | 4 | 1470 | 1622 | 24 | 0 | 0 | 9 | 54.0 | 1.0 | 7.0 | 173.0 | -80.0 | 2 | 0 | 2 | 0 | -20.0 | 149.0 | -84.0 | 195.0 | 5 | 4 | 3 | 1 | -5.0 | 2463 | 1 | 1974 | 0 | 63 | 1 | ||
1430227081 | 7 | 33 | 4 | 1319 | 1270 | 22 | 0 | 0 | 12 | 98 | 3 | 1314 | 775 | 6 | 0 | 0 | 6 | 20 | 3 | 1297 | 909 | 0 | 1 | 0 | 6 | 27 | 5 | 2360 | 2096 | 26 | 1 | 1 | 6 | 4 | 3 | 1395 | 1627 | 27 | 0 | 0 | 9 | 22 | 5 | 2305 | 2028 | 19 | 1 | 1 | 10 | 66 | 3 | 1024 | 959 | 19 | 0 | 1 | 10 | 86 | 3 | 755 | 620 | 3 | 0 | 0 | 8 | 29 | 4 | 1319 | 667 | 4 | 0 | 0 | 7 | 80 | 3 | 1350 | 1512 | 25 | 0 | 0 | 7 | 224.0 | 0.0 | 3.0 | 63.0 | -82.0 | 2 | 5 | 2 | 1 | -39.0 | 45.0 | -77.0 | 221.0 | 3 | 4 | 3 | 1 | 13.0 | 2130 | 0 | 0 | 1830 | 0 | 63 | ||
1430263531 | 1 | 29 | 4 | 1779 | 1056 | 14 | 0 | 0 | 5 | 30 | 2 | 539 | 539 | 1 | 0 | 0 | 6 | 75 | 5 | 2037 | 1139 | 15 | 0 | 0 | 6 | 37 | 2 | 591 | 499 | 0 | 0 | 0 | 6 | 41 | 3 | 712 | 1075 | 12 | 0 | 0 | 6 | 96 | 5 | 1878 | 1174 | 17 | 0 | 0 | 6 | 48 | 3 | 732 | 1468 | 22 | 0 | 0 | 10 | 15 | 4 | 1681 | 1051 | 11 | 0 | 0 | 7 | 102 | 2 | 674 | 537 | 1 | 0 | 0 | 7 | 20 | 2 | 510 | 499 | 0 | 0 | 0 | 7 | 208.0 | -75.0 | 0 | 3 | 2 | 0 | -30.0 | 124.0 | -80.0 | 184.0 | 0 | 4 | 2 | 0 | 27.0 | 1459 | 0 | 1920 | 2047 | 50 | 63 | |||||
1430282290 | 7 | 13 | 4 | 1431 | 1090 | 8 | 1 | 0 | 8 | 27 | 2 | 629 | 552 | 0 | 0 | 1 | 7 | 30 | 3 | 884 | 927 | 0 | 1 | 0 | 8 | 72 | 3 | 925 | 1439 | 16 | 1 | 0 | 11 | 93 | 4 | 1482 | 880 | 7 | 0 | 0 | 8 | 26 | 3 | 704 | 586 | 1 | 0 | 2 | 9 | 69 | 3 | 1169 | 1665 | 20 | 1 | 0 | 7 | 22 | 3 | 1055 | 638 | 1 | 0 | 0 | 9 | 25 | 5 | 1815 | 1275 | 18 | 0 | 0 | 8 | 8 | 4 | 1119 | 904 | 6 | 0 | 1 | 7 | -21.0 | 1.0 | 6.0 | 166.0 | -81.0 | 181.0 | 1 | 4 | 2 | 0 | 46.0 | 182.0 | -80.0 | 225.0 | 6 | 3 | 3 | 0 | -16.0 | 2449 | 0 | 4 | 1974 | 3 | 63 | |
1430284186 | 1 | 11 | 5 | 1961 | 1461 | 19 | 0 | 1 | 6 | 20 | 2 | 441 | 686 | 4 | 0 | 0 | 5 | 28 | 4 | 1874 | 1438 | 22 | 0 | 0 | 4 | 25 | 2 | 528 | 800 | 1 | 1 | 0 | 9 | 65 | 3 | 799 | 785 | 6 | 0 | 1 | 6 | 55 | 3 | 847 | 785 | 7 | 0 | 1 | 7 | 52 | 2 | 455 | 967 | 2 | 1 | 0 | 11 | 3 | 2 | 279 | 916 | 0 | 1 | 0 | 10 | 73 | 5 | 2065 | 2565 | 26 | 0 | 0 | 13 | 48 | 5 | 2029 | 1781 | 29 | 0 | 0 | 8 | 78.0 | 1.0 | 7.0 | 35.0 | -85.0 | 182.0 | 5 | 4 | 2 | 1 | -27.0 | 2.0 | -86.0 | 212.0 | 4 | 4 | 4 | 0 | -43.0 | 1453 | 0 | 512 | 2038 | 0 | 63 |
Description of features
match_id
: match id in Datasetstart_time
: match start time (unixtime)lobby_type
: type of room with gamers- dataset for each gamers (team of the Radiant has prefix
rN
, Dire —dN
):r1_hero
: hero gamerr1_level
: max level of herouse (for first 5 minutes)r1_xp
: max expiriencer1_gold
: max goldr1_lh
: count of dead unitsr1_kills
: count of killed heroesr1_deaths
: count of heroes deathr1_items
: count of purchased items
- features of event "first blood"
first_blood_time
: game time of first bloodfirst_blood_team
: first blood team (0 — Radiant, 1 — Dire)first_blood_player1
: Player1 involved in an event "first blood"first_blood_player2
: Player2 involved in an event "first blood"
- features for each team (prefix
radiant_
anddire_
)radiant_bottle_time
: time of acquisition of item "bottle"radiant_courier_time
: time of acquisition of item "courier"radiant_flying_courier_time
: time of acquisition of item "flying_courier"radiant_tpscroll_count
: count of items "tpscroll" for first 5 minutesradiant_boots_count
: count of items "boots"radiant_ward_observer_count
: count of items "ward_observer"radiant_ward_sentry_count
: count of items "ward_sentry"radiant_first_ward_time
: The time of installation by the command of the first "observer", i.e. An object that allows you to see part of the playing field
- Match result
duration
: durationradiant_win
: if win the Radiant command then 1, else 0- Status of towers and barracks by the end of the match
tower_status_radiant
tower_status_dire
barracks_status_radiant
barracks_status_dire
Quality metric
As a quality metric, we will use the area under the ROC curve (AUC-ROC). Note that AUC-ROC is a quality metric for the algorithm that issues the first-class membership estimates. To do this, you need to get predictions using the function predict_proba. It returns two columns: the first contains ratings for belonging to the zero class, the second to the first class. We need values from the second column:
Pred = clf.predict_proba (X_test) [:, 1]
Gradien Boosting as method of possible solution
We will use method of Gradient Boosting. It is not very demanding on data, restores non-linear dependencies, and works well on many data sets, which determines its popularity. It is quite reasonable to try it first. May be next article we will try method of logistic regression.
Process of solution
- Read features from file
import pandas
X_train = pandas.read_csv('features.csv', index_col='match_id')
Remove features about end of match, such as duration, radiant_win, tower_status_radiant, tower_status_dire, barracks_status_radiant, barracks_status_dire
X_test = X_train.drop("duration",1).drop("radiant_win",1).drop("tower_status_radiant",1).drop("tower_status_dire",1).drop("barracks_status_radiant",1).drop("barracks_status_dire",1)
- Check the selection for skipped passes with the count () function, which for each column shows the number of filled values.
# find a max value
maxCount = 0
C = X_test.count()
for x in range(0, len(C)):
if C[x] > maxCount:
maxCount = C[x]
# OUT: maxCount = 97230
# features with skipped data
for idx,item in enumerate(C):
if item < maxCount:
print( C.keys()[idx],': ', maxCount-item)
# OUT:
first_blood_time : 19553
first_blood_team : 19553
first_blood_player1 : 19553
first_blood_player2 : 43987
radiant_bottle_time : 15691
radiant_courier_time : 692
radiant_flying_courier_time : 27479
radiant_first_ward_time : 1836
dire_bottle_time : 16143
dire_courier_time : 676
dire_flying_courier_time : 26098
dire_first_ward_time : 1826
Missing data is used event "first blood". If "first blood" have't been happened for 5 minutes, then values set a skipped value. For example, skipped data in first_blood_time and first_blood_team means nobody has been died for first 5 minutes.
# Replace skipped data to 0
X_test = X_test.fillna(0)
# Target value
y_test = features['radiant_win']
We will forget that there are categorical attributes in the sample, and we will try to teach a gradient boosting over trees on the existing matrix "objects-signs". Lock the split generator for cross-validation in 5 blocks (KFold), do not forget to shuffle the sample (shuffle = True), because the data in the table is sorted by time, and without undue effects you may encounter unwanted effects when evaluating the quality.
Evaluate the quality of GradientBoostingClassifier using this cross-validation, try different number of trees (10, 20, 30, 40) and measure elapsed time.
from sklearn.ensemble import GradientBoostingClassifier
import time
import datetime
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
# cross-validation
kf = KFold(n_splits=5, shuffle=True)
X = X_test.as_matrix()
y = y_test.as_matrix()
for ne in [10,20,30,40]:
start_time = datetime.datetime.now()
# Gradient Boosting
grd = GradientBoostingClassifier(n_estimators=ne, verbose=False, random_state=241)
# teach
grd.fit(X, y)
minQuality = 1;
for train_index, test_index in kf.split(X):
X_train, X_cross_test = X[train_index], X[test_index]
y_train, y_cross_test = y[train_index], y[test_index]
# predict
pred1 = grd.predict_proba(X_train)[:, 1]
# find quality
auc1 = roc_auc_score(y_train, pred1)
if auc1 < minQuality:
minQuality = auc1
print('estimators : ', ne)
print("minQuality : ", minQuality)
print('Time elapsed : ', datetime.datetime.now() - start_time)
# OUT:
estimators : 10
minQuality : 0.671158427379
Time elapsed : 0:00:07.285712
estimators : 20
minQuality : 0.690562120847
Time elapsed : 0:00:13.752695
estimators : 30
minQuality : 0.699039946513
Time elapsed : 0:00:20.728346
estimators : 40
minQuality : 0.704060457972
Time elapsed : 0:00:26.331095
Is the optimum achieved at the tested values of the parameter n_estimators, or is the quality likely to continue to grow with its further increase?
if we take 60 or 80 trees, quality will increase insignificantly.
estimators : 60; min quality : 0.7128636129
estimators : 70; min quality : 0.715895358026
estimators : 80; min quality : 0.718241911848
Conclusions
We can predict the outcome of a match with an accuracy of 71% using the method of gradient boosting. Next time I will use logistic regression for prediction and will check fineal model by new data. It is interesting to compare result of another method.
Links:
- The competition: Dota 2: Win Probability Prediction
- Repository with solution code Github.com
- Datasets
All rights reserved