Predict independent values with text data using linear regression

The theme of Machine Learning is quite popular and sought after now. Machine Learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields.

Data everywhere

This article is intended for those who are just beginning to learn the methods and approaches to solve problems. We consider one of the simplest methods, it is the method of linear regression for text data and prediction independent features


Get prediction employees salary, based on the job description.

Data set with job descriptions and respective annual salaries are presented in file salary-train.csv

FullDescription LocationNormalized ContractTime SalaryNormalized
0 International Sales Manager London ****k ****... London permanent 33000
1 An ideal opportunity for an individual that ha... London permanent 50000
2 Online Content and Brand Manager// Luxury Reta... South East London permanent 40000
3 A great local marketleader is seeking a perman... Dereham permanent 22500
4 Registered Nurse / RGN Nursing Home for Young... Sutton Coldfield NaN 20355
5 Sales and Marketing Assistant will provide adm... Crawley NaN 22500
6 Vacancy Ladieswear fashion Area Manager / Regi... UK permanent 32000

We have to predict salary for the two examples

FullDescription LocationNormalized ContractTime SalaryNormalized
0 We currently have a vacancy for an HR Project ... Milton Keynes contract NaN
1 A Web developer opportunity has arisen with an... Manchester permanent NaN

Method of linear regression

Linear methods are well suited for working with sparse data, for example, when working with texts. This can be explained by a high rate of training and a small number of parameters, making possible to avoid retraining. Linear regression is one of the basic and most simple machine learning methods. The method is used to restore the relationship between independent and dependent variables.


Python and packages such as Pandas, SciPy, and Scikit-Learn, which realizes many algorithms machine learning.

	import pandas
	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.feature_extraction import DictVectorizer
	from scipy.sparse import hstack
	from sklearn.linear_model import Ridge


Load the data set with the job description and relevant annual salary from the file

	train = pandas.read_csv('salary-train.csv')

Data are presented in DataFrame It allows us comfortable work with data. In total, the training set of 240 thousand records.

Data Preparation

Before you start training, you first need to prepare the data because classifiers only work with numerical values.

Text comfortably encoded using "bag of words". It is formed as features of how many unique words in the text, and the meaning of each feature is the number of occurrences in the document corresponding to the word. It is clear that the total number of different words in a Text Set can reach tens of thousands, and it is only a small part of them will occur in a particular text. We can encode texts smarter. It doesn't record the number of occurrences of the word in the text, but TF-IDF. This figure, which is the product of two numbers: TF (term frequency) and IDF (inverse document frequency). The first is the ratio of the number of occurrences of a word in a document to the total length of the document. The second value depends on that word in the document how many samples. The more of these documents, the smaller the IDF. Thus, TF-IDF value will be high for those words that occur many times herein and in other rare. In Scikit-Learn sklearn.feature_extraction.text.TfidfVectorizer implemented in the class. Conversion training set needs to be done with the help of fit_transform function, test set with using a transform.


# For beginning, transform train['FullDescription'] to lowercase using text.lower()

# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)

# Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5)
X_tfidf = vectorizer.fit_transform(train['FullDescription']) 


# Replace NaN in LocationNormalized и ContractTime rows to special string 'nan'. 
train['LocationNormalized'].fillna('nan', inplace=True)
train['ContractTime'].fillna('nan', inplace=True)

Transform Data:

LocationNormalized и ContractTime features have string type, and we are not able to work with it directly. We used DictVectorizer to do binary one-hot (aka one-of-K) coding. It means one boolean-valued feature is constructed for each of the possible string values that the feature can take on.

enc = DictVectorizer()
X_train_categ = enc.fit_transform(train[['LocationNormalized', 'ContractTime']].to_dict('records'))

# Take a sequence of arrays and stack them horizontally to make a single array. 
# Rebuild arrays divided by scipy.sparse.hstack. 
# Note that matrices are sparse. 
# In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. 

X = hstack([X_tfidf,X_train_categ])

Okay, we have got a matrix for fit.

Ridge Regression

Ridge Regression is a remedial measure taken to alleviate multicollinearity amongst regression predictor variables in a model. Often predictor variables used in a regression are highly correlated. When they are, the regression coefficient of any one variable depend on which other predictor variables are included in the model, and which ones are left out.

Train Ridge regression with parameters alpha=1 and random_state=241.

# Classifier: 
clf = Ridge(alpha=1.0, random_state=241)

# The target value (algorithm has to predict) is SalaryNormalized
y = train['SalaryNormalized']

# train model on data, y) 


Get forecast for two examples from file salary-test-mini.csv. We have to do Data Preparation for the test Dataset and predict new

	rslt = clf.predict(X_test)

Answer has 2 numbers [ 56555.61500155 , 37188.32442618]. This is solution of task.


FullDescription LocationNormalized ContractTime SalaryNormalized
0 We currently have a vacancy for an HR Project ... Milton Keynes contract 56556
1 A Web developer opportunity has arisen with an... Manchester permanent 37188

There are easy way to predict variables.


  1. Source code GitHub
  2. How to Make a Prediction - Intro to Deep Learning #1
  3. Scikit-Learn
  4. Scipy