Đã đăng vào thg 4 20, 2017 4:00 CH 6 phút đọc

1.8K

Part 1. Predict Dota 2 match winner by the first 5 minutes of the game. Gradient Boosting.

Bài đăng này đã không được cập nhật trong 9 năm

Introduction

Dota 2 is a computer game in the MOBA (Multiplayer Online Battle Arena) genre. It is played by two teams, called Radiant and Dire which consist of five players each. The main goal of the game is to destroy other team's “Ancient", located at the opposite corners of the map. Matches are generated from a queue, taking into account the level of the game all players, known as MMR (Match Making Rank).

Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks').

Description of the match

1. Players select "heroes"

Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks'). Each hero has a set of features that define his role in the team and playstyle. Among these features there are his basic attribute (Strength, Agility or Intelligence) and unique set of 4 (or for some heroes even more) skills.

2. Main part

Players can receive gold and experience for killing other people's heroes or other units. The accumulated experience influences the level of the hero, which in turn makes it possible to improve abilities. For accumulated gold, players buy items that improve the characteristics of heroes or give them new abilities.

After death, the hero goes to the "tavern" and is reborn only after some time, so the team loses the player for a while, but the player can prematurely redeem the hero from the tavern for a certain amount of gold.

During the game, teams develop their heroes, defend their part of the field and attack the enemy.

3. End of the game

The game ends when one of the teams destroys a certain number of "towers" of the enemy and destroys the throne

For the first 5 minutes of the game, predict which of the teams will win: Radiant or Dire?

The training set consists of matches, for which all of the ingame events (like kills, item purchase etc.) as well as match outcome are know. You are given only the first 5 minutes of each match and you need to predict the likelihood of Radiant victory.

The data for this contest is based on the archive of public games by OpenDota.com (formerly YASP).

6 rows × 102 columns

start_time	lobby_type	r1_hero	r1_level	r1_xp	r1_gold	r1_lh	r1_kills	r1_deaths	r1_items	r2_hero	r2_level	r2_xp	r2_gold	r2_lh	r2_deaths	r2_items	r3_hero	r3_level	r3_xp	r3_gold	r3_lh	r3_kills	r3_items	r4_hero	r4_level	r4_xp	r4_gold	r4_lh	r4_kills	r4_deaths	r4_items	r5_hero	r5_level	r5_xp	r5_gold	r5_lh	r5_deaths	r5_items	d1_hero	d1_level	d1_xp	d1_gold	d1_lh	d1_kills	d1_deaths	d1_items	d2_hero	d2_level	d2_xp	d2_gold	d2_lh	d2_kills	d2_deaths	d2_items	d3_hero	d3_level	d3_xp	d3_gold	d3_lh	d3_kills	d3_items	d4_hero	d4_level	d4_xp	d4_gold	d4_lh	d4_items	d5_hero	d5_level	d5_xp	d5_gold	d5_lh	d5_kills	d5_deaths	d5_items	first_blood_time	first_blood_team	first_blood_player1	radiant_bottle_time	radiant_courier_time	radiant_flying_courier_time	radiant_tpscroll_count	radiant_boots_count	radiant_ward_observer_count	radiant_ward_sentry_count	radiant_first_ward_time	dire_bottle_time	dire_courier_time	dire_flying_courier_time	dire_tpscroll_count	dire_boots_count	dire_ward_observer_count	dire_ward_sentry_count	dire_first_ward_time	duration	radiant_win	tower_status_radiant	tower_status_dire	barracks_status_radiant	barracks_status_dire
1430198770	7	11	5	2098	1489	20	0	0	7	67	3	842	991	10	0	4	29	5	1909	1143	10	0	8	20	3	757	741	6	0	0	7	105	3	732	658	4	1	11	4	3	1058	996	12	0	0	6	42	4	1085	986	12	0	0	4	21	5	2052	1536	23	0	6	37	3	742	500	2	8	84	3	958	1003	3	1	0	9	7.0	1.0	9.0	134.0	-80.0	244.0	2	2	2	0	35.0	103.0	-84.0	221.0	3	4	2	2	-52.0	2874	1	1796	0	51	0
1430220345	0	42	4	1188	1033	9	0	1	12	49	4	1596	993	10	1	7	67	4	1506	1502	18	1	7	37	3	669	631	7	0	0	7	26	2	415	539	1	0	5	39	5	1960	1384	16	0	0	8	88	3	640	566	1	0	1	5	79	3	720	1350	2	2	12	7	2	440	583	0	7	12	4	1470	1622	24	0	0	9	54.0	1.0	7.0	173.0	-80.0		2	0	2	0	-20.0	149.0	-84.0	195.0	5	4	3	1	-5.0	2463	1	1974	0	63	1
1430227081	7	33	4	1319	1270	22	0	0	12	98	3	1314	775	6	0	6	20	3	1297	909	0	1	6	27	5	2360	2096	26	1	1	6	4	3	1395	1627	27	0	9	22	5	2305	2028	19	1	1	10	66	3	1024	959	19	0	1	10	86	3	755	620	3	0	8	29	4	1319	667	4	7	80	3	1350	1512	25	0	0	7	224.0	0.0	3.0	63.0	-82.0		2	5	2	1	-39.0	45.0	-77.0	221.0	3	4	3	1	13.0	2130	0	0	1830	0	63
1430263531	1	29	4	1779	1056	14	0	0	5	30	2	539	539	1	0	6	75	5	2037	1139	15	0	6	37	2	591	499	0	0	0	6	41	3	712	1075	12	0	6	96	5	1878	1174	17	0	0	6	48	3	732	1468	22	0	0	10	15	4	1681	1051	11	0	7	102	2	674	537	1	7	20	2	510	499	0	0	0	7				208.0	-75.0		0	3	2	0	-30.0	124.0	-80.0	184.0	0	4	2	0	27.0	1459	0	1920	2047	50	63
1430282290	7	13	4	1431	1090	8	1	0	8	27	2	629	552	0	1	7	30	3	884	927	0	1	8	72	3	925	1439	16	1	0	11	93	4	1482	880	7	0	8	26	3	704	586	1	0	2	9	69	3	1169	1665	20	1	0	7	22	3	1055	638	1	0	9	25	5	1815	1275	18	8	8	4	1119	904	6	0	1	7	-21.0	1.0	6.0	166.0	-81.0	181.0	1	4	2	0	46.0	182.0	-80.0	225.0	6	3	3	0	-16.0	2449	0	4	1974	3	63
1430284186	1	11	5	1961	1461	19	0	1	6	20	2	441	686	4	0	5	28	4	1874	1438	22	0	4	25	2	528	800	1	1	0	9	65	3	799	785	6	1	6	55	3	847	785	7	0	1	7	52	2	455	967	2	1	0	11	3	2	279	916	0	1	10	73	5	2065	2565	26	13	48	5	2029	1781	29	0	0	8	78.0	1.0	7.0	35.0	-85.0	182.0	5	4	2	1	-27.0	2.0	-86.0	212.0	4	4	4	0	-43.0	1453	0	512	2038	0	63

Description of features

match_id: match id in Dataset
start_time: match start time (unixtime)
lobby_type: type of room with gamers
dataset for each gamers (team of the Radiant has prefix rN, Dire — dN):
- r1_hero: hero gamer
- r1_level: max level of herouse (for first 5 minutes)
- r1_xp: max expirience
- r1_gold: max gold
- r1_lh: count of dead units
- r1_kills: count of killed heroes
- r1_deaths: count of heroes death
- r1_items: count of purchased items
features of event "first blood"
- first_blood_time: game time of first blood
- first_blood_team: first blood team (0 — Radiant, 1 — Dire)
- first_blood_player1: Player1 involved in an event "first blood"
- first_blood_player2: Player2 involved in an event "first blood"
features for each team (prefix radiant_ and dire_)
- radiant_bottle_time: time of acquisition of item "bottle"
- radiant_courier_time: time of acquisition of item "courier"
- radiant_flying_courier_time: time of acquisition of item "flying_courier"
- radiant_tpscroll_count: count of items "tpscroll" for first 5 minutes
- radiant_boots_count: count of items "boots"
- radiant_ward_observer_count: count of items "ward_observer"
- radiant_ward_sentry_count: count of items "ward_sentry"
- radiant_first_ward_time: The time of installation by the command of the first "observer", i.e. An object that allows you to see part of the playing field
Match result
- duration: duration
- radiant_win: if win the Radiant command then 1, else 0
- Status of towers and barracks by the end of the match
  - tower_status_radiant
  - tower_status_dire
  - barracks_status_radiant
  - barracks_status_dire

Quality metric

As a quality metric, we will use the area under the ROC curve (AUC-ROC). Note that AUC-ROC is a quality metric for the algorithm that issues the first-class membership estimates. To do this, you need to get predictions using the function predict_proba. It returns two columns: the first contains ratings for belonging to the zero class, the second to the first class. We need values from the second column:

Pred = clf.predict_proba (X_test) [:, 1]

Gradien Boosting as method of possible solution

We will use method of Gradient Boosting. It is not very demanding on data, restores non-linear dependencies, and works well on many data sets, which determines its popularity. It is quite reasonable to try it first. May be next article we will try method of logistic regression.

Process of solution

Read features from file

import pandas
X_train = pandas.read_csv('features.csv', index_col='match_id')

Remove features about end of match, such as duration, radiant_win, tower_status_radiant, tower_status_dire, barracks_status_radiant, barracks_status_dire

X_test = X_train.drop("duration",1).drop("radiant_win",1).drop("tower_status_radiant",1).drop("tower_status_dire",1).drop("barracks_status_radiant",1).drop("barracks_status_dire",1)

Check the selection for skipped passes with the count () function, which for each column shows the number of filled values.

# find a max value
maxCount = 0
C = X_test.count()
for x in range(0, len(C)):
    if C[x] > maxCount:
        maxCount = C[x]
# OUT: maxCount = 97230

# features with skipped data
for idx,item in enumerate(C):
    if item < maxCount:
        print( C.keys()[idx],': ', maxCount-item)

# OUT:
first_blood_time :  19553
first_blood_team :  19553
first_blood_player1 :  19553
first_blood_player2 :  43987
radiant_bottle_time :  15691
radiant_courier_time :  692
radiant_flying_courier_time :  27479
radiant_first_ward_time :  1836
dire_bottle_time :  16143
dire_courier_time :  676
dire_flying_courier_time :  26098
dire_first_ward_time :  1826

Missing data is used event "first blood". If "first blood" have't been happened for 5 minutes, then values set a skipped value. For example, skipped data in first_blood_time and first_blood_team means nobody has been died for first 5 minutes.

# Replace skipped data to 0
X_test = X_test.fillna(0)

# Target value
y_test = features['radiant_win']

We will forget that there are categorical attributes in the sample, and we will try to teach a gradient boosting over trees on the existing matrix "objects-signs". Lock the split generator for cross-validation in 5 blocks (KFold), do not forget to shuffle the sample (shuffle = True), because the data in the table is sorted by time, and without undue effects you may encounter unwanted effects when evaluating the quality.

Evaluate the quality of GradientBoostingClassifier using this cross-validation, try different number of trees (10, 20, 30, 40) and measure elapsed time.

from sklearn.ensemble import GradientBoostingClassifier
import time
import datetime
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold

# cross-validation
kf = KFold(n_splits=5, shuffle=True)

X = X_test.as_matrix()
y = y_test.as_matrix()

for ne in [10,20,30,40]:
    start_time = datetime.datetime.now()
    # Gradient Boosting
    grd = GradientBoostingClassifier(n_estimators=ne, verbose=False, random_state=241)
    # teach
    grd.fit(X, y)
    minQuality = 1;
    for train_index, test_index in kf.split(X):
        X_train, X_cross_test = X[train_index], X[test_index]
        y_train, y_cross_test = y[train_index], y[test_index]
        # predict
        pred1 = grd.predict_proba(X_train)[:, 1]
        # find quality
        auc1 = roc_auc_score(y_train, pred1)
        
        if auc1 < minQuality:
            minQuality = auc1
    print('estimators : ', ne)
    print("minQuality : ", minQuality)
    print('Time elapsed : ', datetime.datetime.now() - start_time)

# OUT:
estimators :  10
minQuality :  0.671158427379
Time elapsed :  0:00:07.285712
estimators :  20
minQuality :  0.690562120847
Time elapsed :  0:00:13.752695
estimators :  30
minQuality :  0.699039946513
Time elapsed :  0:00:20.728346
estimators :  40
minQuality :  0.704060457972
Time elapsed :  0:00:26.331095

Is the optimum achieved at the tested values of the parameter n_estimators, or is the quality likely to continue to grow with its further increase?

if we take 60 or 80 trees, quality will increase insignificantly.

estimators :  60;  min quality :  0.7128636129
estimators :  70; min quality :  0.715895358026
estimators :  80; min quality :  0.718241911848

Conclusions

We can predict the outcome of a match with an accuracy of 71% using the method of gradient boosting. Next time I will use logistic regression for prediction and will check fineal model by new data. It is interesting to compare result of another method.

Links:

The competition: Dota 2: Win Probability Prediction
Repository with solution code Github.com
Datasets

Machine Learning Prediction Learning