Part 1. Predict Dota 2 match winner by the first 5 minutes of the game. Gradient Boosting.

Introduction

Dota 2 is a computer game in the MOBA (Multiplayer Online Battle Arena) genre. It is played by two teams, called Radiant and Dire which consist of five players each. The main goal of the game is to destroy other team's “Ancient", located at the opposite corners of the map. Matches are generated from a queue, taking into account the level of the game all players, known as MMR (Match Making Rank).

Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks').

Description of the match

1. Players select "heroes"

Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks'). Each hero has a set of features that define his role in the team and playstyle. Among these features there are his basic attribute (Strength, Agility or Intelligence) and unique set of 4 (or for some heroes even more) skills.

2. Main part

Players can receive gold and experience for killing other people's heroes or other units. The accumulated experience influences the level of the hero, which in turn makes it possible to improve abilities. For accumulated gold, players buy items that improve the characteristics of heroes or give them new abilities.

After death, the hero goes to the "tavern" and is reborn only after some time, so the team loses the player for a while, but the player can prematurely redeem the hero from the tavern for a certain amount of gold.

During the game, teams develop their heroes, defend their part of the field and attack the enemy.

3. End of the game

The game ends when one of the teams destroys a certain number of "towers" of the enemy and destroys the throne

For the first 5 minutes of the game, predict which of the teams will win: Radiant or Dire?

The training set consists of matches, for which all of the ingame events (like kills, item purchase etc.) as well as match outcome are know. You are given only the first 5 minutes of each match and you need to predict the likelihood of Radiant victory.

The data for this contest is based on the archive of public games by OpenDota.com (formerly YASP).

6 rows × 102 columns

start_time lobby_type r1_hero r1_level r1_xp r1_gold r1_lh r1_kills r1_deaths r1_items r2_hero r2_level r2_xp r2_gold r2_lh r2_kills r2_deaths r2_items r3_hero r3_level r3_xp r3_gold r3_lh r3_kills r3_deaths r3_items r4_hero r4_level r4_xp r4_gold r4_lh r4_kills r4_deaths r4_items r5_hero r5_level r5_xp r5_gold r5_lh r5_kills r5_deaths r5_items d1_hero d1_level d1_xp d1_gold d1_lh d1_kills d1_deaths d1_items d2_hero d2_level d2_xp d2_gold d2_lh d2_kills d2_deaths d2_items d3_hero d3_level d3_xp d3_gold d3_lh d3_kills d3_deaths d3_items d4_hero d4_level d4_xp d4_gold d4_lh d4_kills d4_deaths d4_items d5_hero d5_level d5_xp d5_gold d5_lh d5_kills d5_deaths d5_items first_blood_time first_blood_team first_blood_player1 first_blood_player2 radiant_bottle_time radiant_courier_time radiant_flying_courier_time radiant_tpscroll_count radiant_boots_count radiant_ward_observer_count radiant_ward_sentry_count radiant_first_ward_time dire_bottle_time dire_courier_time dire_flying_courier_time dire_tpscroll_count dire_boots_count dire_ward_observer_count dire_ward_sentry_count dire_first_ward_time duration radiant_win tower_status_radiant tower_status_dire barracks_status_radiant barracks_status_dire
1430198770 7 11 5 2098 1489 20 0 0 7 67 3 842 991 10 0 0 4 29 5 1909 1143 10 0 0 8 20 3 757 741 6 0 0 7 105 3 732 658 4 0 1 11 4 3 1058 996 12 0 0 6 42 4 1085 986 12 0 0 4 21 5 2052 1536 23 0 0 6 37 3 742 500 2 0 0 8 84 3 958 1003 3 1 0 9 7.0 1.0 9.0 134.0 -80.0 244.0 2 2 2 0 35.0 103.0 -84.0 221.0 3 4 2 2 -52.0 2874 1 1796 0 51 0
1430220345 0 42 4 1188 1033 9 0 1 12 49 4 1596 993 10 0 1 7 67 4 1506 1502 18 1 0 7 37 3 669 631 7 0 0 7 26 2 415 539 1 0 0 5 39 5 1960 1384 16 0 0 8 88 3 640 566 1 0 1 5 79 3 720 1350 2 2 0 12 7 2 440 583 0 0 0 7 12 4 1470 1622 24 0 0 9 54.0 1.0 7.0 173.0 -80.0 2 0 2 0 -20.0 149.0 -84.0 195.0 5 4 3 1 -5.0 2463 1 1974 0 63 1
1430227081 7 33 4 1319 1270 22 0 0 12 98 3 1314 775 6 0 0 6 20 3 1297 909 0 1 0 6 27 5 2360 2096 26 1 1 6 4 3 1395 1627 27 0 0 9 22 5 2305 2028 19 1 1 10 66 3 1024 959 19 0 1 10 86 3 755 620 3 0 0 8 29 4 1319 667 4 0 0 7 80 3 1350 1512 25 0 0 7 224.0 0.0 3.0 63.0 -82.0 2 5 2 1 -39.0 45.0 -77.0 221.0 3 4 3 1 13.0 2130 0 0 1830 0 63
1430263531 1 29 4 1779 1056 14 0 0 5 30 2 539 539 1 0 0 6 75 5 2037 1139 15 0 0 6 37 2 591 499 0 0 0 6 41 3 712 1075 12 0 0 6 96 5 1878 1174 17 0 0 6 48 3 732 1468 22 0 0 10 15 4 1681 1051 11 0 0 7 102 2 674 537 1 0 0 7 20 2 510 499 0 0 0 7 208.0 -75.0 0 3 2 0 -30.0 124.0 -80.0 184.0 0 4 2 0 27.0 1459 0 1920 2047 50 63
1430282290 7 13 4 1431 1090 8 1 0 8 27 2 629 552 0 0 1 7 30 3 884 927 0 1 0 8 72 3 925 1439 16 1 0 11 93 4 1482 880 7 0 0 8 26 3 704 586 1 0 2 9 69 3 1169 1665 20 1 0 7 22 3 1055 638 1 0 0 9 25 5 1815 1275 18 0 0 8 8 4 1119 904 6 0 1 7 -21.0 1.0 6.0 166.0 -81.0 181.0 1 4 2 0 46.0 182.0 -80.0 225.0 6 3 3 0 -16.0 2449 0 4 1974 3 63
1430284186 1 11 5 1961 1461 19 0 1 6 20 2 441 686 4 0 0 5 28 4 1874 1438 22 0 0 4 25 2 528 800 1 1 0 9 65 3 799 785 6 0 1 6 55 3 847 785 7 0 1 7 52 2 455 967 2 1 0 11 3 2 279 916 0 1 0 10 73 5 2065 2565 26 0 0 13 48 5 2029 1781 29 0 0 8 78.0 1.0 7.0 35.0 -85.0 182.0 5 4 2 1 -27.0 2.0 -86.0 212.0 4 4 4 0 -43.0 1453 0 512 2038 0 63

Description of features

  • match_id: match id in Dataset
  • start_time: match start time (unixtime)
  • lobby_type: type of room with gamers
  • dataset for each gamers (team of the Radiant has prefix rN, Dire — dN):
    • r1_hero: hero gamer
    • r1_level: max level of herouse (for first 5 minutes)
    • r1_xp: max expirience
    • r1_gold: max gold
    • r1_lh: count of dead units
    • r1_kills: count of killed heroes
    • r1_deaths: count of heroes death
    • r1_items: count of purchased items
  • features of event "first blood"
    • first_blood_time: game time of first blood
    • first_blood_team: first blood team (0 — Radiant, 1 — Dire)
    • first_blood_player1: Player1 involved in an event "first blood"
    • first_blood_player2: Player2 involved in an event "first blood"
  • features for each team (prefix radiant_ and dire_)
    • radiant_bottle_time: time of acquisition of item "bottle"
    • radiant_courier_time: time of acquisition of item "courier"
    • radiant_flying_courier_time: time of acquisition of item "flying_courier"
    • radiant_tpscroll_count: count of items "tpscroll" for first 5 minutes
    • radiant_boots_count: count of items "boots"
    • radiant_ward_observer_count: count of items "ward_observer"
    • radiant_ward_sentry_count: count of items "ward_sentry"
    • radiant_first_ward_time: The time of installation by the command of the first "observer", i.e. An object that allows you to see part of the playing field
  • Match result
    • duration: duration
    • radiant_win: if win the Radiant command then 1, else 0
    • Status of towers and barracks by the end of the match
      • tower_status_radiant
      • tower_status_dire
      • barracks_status_radiant
      • barracks_status_dire

Quality metric

As a quality metric, we will use the area under the ROC curve (AUC-ROC). Note that AUC-ROC is a quality metric for the algorithm that issues the first-class membership estimates. To do this, you need to get predictions using the function predict_proba. It returns two columns: the first contains ratings for belonging to the zero class, the second to the first class. We need values from the second column:

Pred = clf.predict_proba (X_test) [:, 1]

Gradien Boosting as method of possible solution

We will use method of Gradient Boosting. It is not very demanding on data, restores non-linear dependencies, and works well on many data sets, which determines its popularity. It is quite reasonable to try it first. May be next article we will try method of logistic regression.

Process of solution

  1. Read features from file
import pandas
X_train = pandas.read_csv('features.csv', index_col='match_id')

Remove features about end of match, such as duration, radiant_win, tower_status_radiant, tower_status_dire, barracks_status_radiant, barracks_status_dire

X_test = X_train.drop("duration",1).drop("radiant_win",1).drop("tower_status_radiant",1).drop("tower_status_dire",1).drop("barracks_status_radiant",1).drop("barracks_status_dire",1)

  1. Check the selection for skipped passes with the count () function, which for each column shows the number of filled values.
# find a max value
maxCount = 0
C = X_test.count()
for x in range(0, len(C)):
    if C[x] > maxCount:
        maxCount = C[x]
# OUT: maxCount = 97230

# features with skipped data
for idx,item in enumerate(C):
    if item < maxCount:
        print( C.keys()[idx],': ', maxCount-item)
# OUT:
first_blood_time :  19553
first_blood_team :  19553
first_blood_player1 :  19553
first_blood_player2 :  43987
radiant_bottle_time :  15691
radiant_courier_time :  692
radiant_flying_courier_time :  27479
radiant_first_ward_time :  1836
dire_bottle_time :  16143
dire_courier_time :  676
dire_flying_courier_time :  26098
dire_first_ward_time :  1826

Missing data is used event "first blood". If "first blood" have't been happened for 5 minutes, then values set a skipped value. For example, skipped data in first_blood_time and first_blood_team means nobody has been died for first 5 minutes.

# Replace skipped data to 0
X_test = X_test.fillna(0)

# Target value
y_test = features['radiant_win']

We will forget that there are categorical attributes in the sample, and we will try to teach a gradient boosting over trees on the existing matrix "objects-signs". Lock the split generator for cross-validation in 5 blocks (KFold), do not forget to shuffle the sample (shuffle = True), because the data in the table is sorted by time, and without undue effects you may encounter unwanted effects when evaluating the quality.

Evaluate the quality of GradientBoostingClassifier using this cross-validation, try different number of trees (10, 20, 30, 40) and measure elapsed time.

from sklearn.ensemble import GradientBoostingClassifier
import time
import datetime
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold

# cross-validation
kf = KFold(n_splits=5, shuffle=True)

X = X_test.as_matrix()
y = y_test.as_matrix()

for ne in [10,20,30,40]:
    start_time = datetime.datetime.now()
    # Gradient Boosting
    grd = GradientBoostingClassifier(n_estimators=ne, verbose=False, random_state=241)
    # teach
    grd.fit(X, y)
    minQuality = 1;
    for train_index, test_index in kf.split(X):
        X_train, X_cross_test = X[train_index], X[test_index]
        y_train, y_cross_test = y[train_index], y[test_index]
        # predict
        pred1 = grd.predict_proba(X_train)[:, 1]
        # find quality
        auc1 = roc_auc_score(y_train, pred1)
        
        if auc1 < minQuality:
            minQuality = auc1
    print('estimators : ', ne)
    print("minQuality : ", minQuality)
    print('Time elapsed : ', datetime.datetime.now() - start_time)
# OUT:
estimators :  10
minQuality :  0.671158427379
Time elapsed :  0:00:07.285712
estimators :  20
minQuality :  0.690562120847
Time elapsed :  0:00:13.752695
estimators :  30
minQuality :  0.699039946513
Time elapsed :  0:00:20.728346
estimators :  40
minQuality :  0.704060457972
Time elapsed :  0:00:26.331095

Is the optimum achieved at the tested values of the parameter n_estimators, or is the quality likely to continue to grow with its further increase?

if we take 60 or 80 trees, quality will increase insignificantly.

estimators :  60;  min quality :  0.7128636129
estimators :  70; min quality :  0.715895358026
estimators :  80; min quality :  0.718241911848

Conclusions

We can predict the outcome of a match with an accuracy of 71% using the method of gradient boosting. Next time I will use logistic regression for prediction and will check fineal model by new data. It is interesting to compare result of another method.


Links:

  1. The competition: Dota 2: Win Probability Prediction
  2. Repository with solution code Github.com
  3. Datasets