Microsoft Data Science Azure Machine Learning Workshop
DEZE TEKST WORDT SPOEDIG VERTAALD!
Lab Setup and Instruction Guide
In this first Microsoft Data Science meetup, hosted by Infi and with guest speaker Jeroen ter Heerdt from Microsoft, we also organized a workshop to get the basics of machine learning on the Azure platform.
In this lab, as part of the Microsoft Data Science meetup community, you will learn how to build a Human Activity Classifier with Azure Machine Learning. This classifier predicts somebody’s activity class (sitting, standing up, standing, sitting down, walking) based on the use of wearable sensors. The point of this lab is to introduce you to the basics of creating and deploying a machine learning model in Azure ML, it is not intended to be a deep-dive into model design, validation and improvement.
This lab environment contains the following tasks:
- Setup your Azure ML environment
- Get the data
- Build your model
- Publish your model
What You’ll Need
To perform the tasks, you will need the following:
- A Windows, Linux, or Mac OSX
- A web browser and Internet
1. Setup your Azure ML environment
There are several options to start with Azure ML: https://azure.microsoft.com/en-us/services/machine-learning/
If you don’t have an Azure account already, we recommend you to use the Free Workspace option. Therefore, you would have to sign up for a Microsoft account. If you don’t have one already, you can sign up for one at https://signup.live.com/.
- Get the data This classifier predicts somebody’s activity class (sitting, standing up, standing, sitting down, walking). It is based on the Human Activity Recognition dataset. Human Activity Recognition (HAR) is an active research area, results of which have the potential to benefit the development of assistive technologies in order to support care of the elderly, the chronically ill and people with special needs. Activity recognition can be used to provide information about patients’ routines to support the development of e-health systems. Two approaches are commonly used for HAR: image processing and use of wearable sensors. In this case we will use information generated by wearable sensors (Ugulino et al, 2012).
Understand the data source
In this lab we use the Human Activity Recognition Data from its source: http://groupware.les.inf.puc-rio.br/har#ixzz2PyRdbAfA. More info can also be found on the UCI repository. You can download the data from http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip and extract the downloaded zip file to a convenient folder on your local computer.
The data has been collected during 8 hours of activities, 2 hours with each of the 2 men and 2 women, all adults and healthy. These people were wearing 4 accelerometers from LiliPad Arduino, respectively positioned in the waist, left thigh, right ankle, and right arm. This resulted in a dataset with 165634 rows and 19 columns.
- user (text)
- gender (text)
- age (integer)
- how_tall_in_meters (real)
- weight (int)
- body_mass_index (real)
- x1 (type int, value of the axis ‘x’ of the 1st accelerometer, mounted on waist)
- y1 (type int, value of the axis ‘y’ of the 1st accelerometer, mounted on waist)
- z1 (type int, value of the axis ‘z’ of the 1st accelerometer, mounted on waist)
- x2 (type int, value of the axis ‘x’ of the 2nd accelerometer, mounted on the left thigh)
- y2 (type int, value of the axis ‘y’ of the 2nd accelerometer, mounted on the left thigh)
- z2 (type int, value of the axis ‘z’ of the 2nd accelerometer, mounted on the left thigh)
- x3 (type int, value of the axis ‘x’ of the 3rd accelerometer, mounted on the right ankle)
- y3 (type int, value of the axis ‘y’ of the 3rd accelerometer, mounted on the right ankle)
- z3 (type int, value of the axis ‘z’ of the 3rd accelerometer, mounted on the right ankle)
- x4 (type int, value of the axis ‘x’ of the 4th accelerometer, mounted on the right upper-arm)
- y4 (type int, value of the axis ‘y’ of the 4th accelerometer, mounted on the right upper-arm)
- z4 (type int, value of the axis ‘z’ of the 4th accelerometer, mounted on the right upper-arm)
- class (text, ‘sitting-down’ ,’standing-up’, ‘standing’, ‘walking’, and ‘sitting’)
2. Build the Human Activity Classifier
Prepare the data
Before you can use it to train a classification model you must prepare and upload the data:
- Azure ML works with comma separated files. The original data file contains ‘;’ as separator and will therefore be not suitable for uploading. We first have to open the downloaded csv file and convert it to a csv with a ‘,’ as a separator. Make sure you also have ‘.’ for your decimals.
If you have trouble creating such file, you can start with this starting experiment: https://gallery.cortanaintelligence.com/Experiment/Human-Activity-Classifier-Step-1-Load-data. This will open a window where you have to sign in into Azure ML.
- Open a browser and browse to https://studio.azureml.net. Then sign in using the Microsoft account associated with your Azure ML.
- Create a new blank experiment by clicking on the + NEW button in the left of your browser, and select EXPERIMENT, and subsequently BLANK EXPERIMENT. You can change the generated name into Human Activity Classifier.
- Upload the csv file to Azure ML and name it HAR dataset. To do this, you have to click on the + NEW button in the left lower corner of your browser, and select DATASET, and subsequently FROM LOCAL FILE.
- In the Human Activity Classifier experiment, go to My Datasets under Saved datasets, and drag the HAR dataset on the canvas, and click on RUN (menu below). You will have to wait until the model is finished running before you continue with the next step.
- To visualize the output of the dataset, right-click on the output port of the data module and select Visualize.
Now you can review the data it contains. Note that the dataset contains the following variables:
- user (string)
- gender (string)
- age (numeric)
- how_tall_in_meters (numeric)
- weight (numeric)
- body_mass_index (numeric)
- x1 (numeric)
- y1 (numeric)
- z1 (numeric)
- x2 (numeric)
- y2 (numeric)
- z2 numeric)
- x3 (numeric)
- y3 (numeric)
- z3 (numeric)
- x4 (numeric)
- y4 (numeric)
- z4 (string) !!!
- class (string)
- Ups, something went wrong: ‘z4’ has been processed as a ‘string’ instead of an ‘integer’. You can change this by using a few lines of R with the Execute R Script Drag the Execute R Script module on the canvas and use this code to convert ‘z4’ to a numeric:
# Map 1-based optional input ports to variables
df <- maml.mapInputPort(1) # class: data.frame
df$z4 <- as.numeric(df$z4)
# Select data.frame to be sent to the output Dataset port
You might ask yourself: why don’t you use the Edit Metadata module for that. Well, if we try that we will get an error that Azure cannot convert specific strings to an integer.
- After converting ‘z4’ to an integer, we have to inspect the data if we miss any. Therefore, click on the Results dataset1 (left) output of the Execute R Script Although the UCI repository states that there are no missing values, we find that the ‘z4’ column has 1 missing value.
- We will delete this row with the Clean Missing Data Set the properties as follows:
- Columns to be cleaned: all
- Minimum missing value ratio: 0
- Maximum missing value ratio: 1
- Cleaning mode: entire row
- After cleaning the data, we can inspect the data. We start with some descriptive statistics using the Summarize Data module.
- Besides, we can inspect the correlation between the numeric columns using the using the Select Columns in Dataset module. Drag this module on the canvas and connect the output port of the Cleaning Missing Data module to the input port of the Select Columns in Dataset module. Now we have to select the numeric columns, using the WITH RULES, and starting with NO COLUMNS, and subsequently select Include, column types, Numeric:
- Now we can add the Compute Linear Correlation module to calculate the (Pearson’s) correlation. Observe that there is a strong correlation between length (how_tall_in_meters), weight (weight) and b.m.i. (body_mass_index). This is not surprising as b.m.i is calculated based on length and weight.
- Based on prior logic, we will remove ‘body_mass_index’ using the Select Columns in Dataset Here we also exclude ‘user’, as we don’t need this identifier later on in our model. Select the Select Columns in Dataset module, and in the Properties pane launch the column selector. Then use the column selector to exclude the following columns:
You can use the WITH RULES page of the column selector to accomplish this as shown here:
- Now we transform gender to be a categorical variable by adding an Edit Metadata module to the experiment, and connect the Select Columns in Dataset output to its input. Set the properties of the Edit Metadata module as follows:
- Column: gender
- Data type: Unchanged
- Categorical: Make categorical
- Fields: Features
- New column names: Leave blank
- We will do a likewise transformation with our dependent variable ‘class’, and set it to a categorical variable and define it as our label. Add an Edit Metadata module to the experiment, and connect the Edit Metadata output to its input. Set the properties of the Edit Metadata module as follows:
- Column: Edit Metadata class
- Data type: Unchanged
- Categorical: Make categorical
- Fields: Label
- New column names: Leave blank
- When the experiment has finished running, visualize the output of the Edit Metadata module and verify that:
- The columns you specified have been removed.
- All numeric columns now have a Feature Type of Numeric Feature.
- All string columns now have a Feature Type of Categorical Feature.
Create and Evaluate a Classification Model
Now that you have prepared the data, you will construct and evaluate a classification model. The goal of this model is to identify a human activity and to find out if somebody is ‘sitting-down’, ‘standing-up’, ‘standing’, ‘walking’, or ‘sitting’.
- We are now ready to split the data into separate training and test We will train the model with the training dataset, and test the model with the test dataset. Therefore, add a Split Data module to the Human Activity Classifier experiment, and connect the output of the Edit Metadata module to the input of the Split Data module. Set the properties of the Split Data module as follows:
- Splitting mode: Split Rows
- Fraction of rows in the first output dataset: 0.7
- Randomized split: Checked
- Random seed: 123
- Stratified split: False
- Add a Train Model module to the experiment, and connect the Results dataset1 (left) output of the Split Data module to the Dataset (right) input of the Train Model In the Properties pane for the Train Model module, use the column selector to select the class column. This sets the label column that the classification model will be trained to predict.
- Add a Multiclass Decision Forest module to the experiment, and connect the output of the Multiclass Decision Forest module to the Untrained model (left) input of the Train Model This specifies that the classification model will be trained using the multiclass decision forest algorithm.
- Set the properties of the Multiclass Decision Forest module as follows:
- Resampling method: Bagging
- Create trainer mode: Single Parameter
- Number of decision trees: 8
- Maximum depth of decision trees: 32
- Number of random splits per node: 128
- Minimum number of samples per leaf: 1
- Allow unknown categorical levels: Checked
- Add a Score Model module to the experiment. Then connect the output of the Train Model module to the Trained model (left) input of the Score Model module, and connect the Results dataset2 (right) output of the Split Data module to the Dataset (right) input of the Score Model module.
- On the Properties pane for the Score Model module, ensure that the Append score columns to output checkbox is selected.
- Add an Evaluate Model module to the experiment, and connect the output of the Score model module to the Scored dataset (left) input of the Evaluate Model module.
- Verify that your experiment resembles the figure below, then save and run the experiment.
- When the experiment has finished running, visualize the output of the Score Model module, and compare the predicted values in the Scored Labels column with the actual values from the test data set in the class column.
- Visualize the output of the Evaluate Model module, and review the results (shown below). We see the score per class. Then review the Overall Accuracy figure for the model, which should be around 0.994. This indicates that the classifier model is correct 99% of the time, which is a good figure for an initial model, keeping in mind the original distribution of the classification (see below).
Detailed Accuracy from the original paper
Correctly Classified Instances 164662 .4144 %
Incorrectly Classified Instances 970 0.5856 %
Root mean squared error 0.0463
Relative absolute error 0.7938 %
Relative absolute error 0.7938 %
3. Publish your Human Activity Classifier
Publish the Model as a Web Service
- Make sure you have saved and ran the experiment. With the Human Activity Classifier experiment open, click the SET UP WEB SERVICE icon at the bottom of the Azure ML Studio page and click Predictive Web Service [Recommended]. A new Predictive Experiment tab will be automatically created.
- Verify that, with a bit of rearranging, the Predictive Experiment resembles this figure:
- We can now start to remove variables we don’t need for prediction. Besides eliminating ‘user’, and ‘bmi’ we can now also remove ‘class’, as we want that as output from the model. Therefore, you can drag the Select Columns in Dataset module up, add ‘class’ to be removed, and connect it to the original dataset and the output to the Execute R Script.
- Besides, we will make sure to use a numeric value for ‘z4’, so we can move the Webservice input and connect it directly to the Edit Metadata module where we make ‘gender’ categorical.
- For this experiment, we will also make sure to send complete records, so we remove the Clean Missing Data.
- Delete the connection between the Score Model module and the Web service output module.
- Add a Select Columns in Dataset module to the experiment, and connect the output of the Score Model module to its input. Then connect the output of the Select Columns in Dataset module to the input of the Web service output module.
- Select the Select Columns in Dataset module, and use the column selector to select only the Scored Labels This ensures that when the web service is called, only the predicted value is returned.
- Ensure that the predictive experiment now looks like the following, and then save and run the predictive experiment:
- When the experiment has finished running, visualize the output of the last Select Columns in Dataset module and verify that only the Scored Labels column is returned.
Deploy and Use the Web Service
- In the Human Activity Classifier [Predictive Exp.] experiment, click the Deploy Web Service icon at the bottom of the Azure ML Studio window.
- Wait a few seconds for the dashboard page to appear, and note the API key and Request/Response You will use these to connect to the web service from a client application.
- You have several options to connect to the webservice. To test this webservice, you can click on New Web Services Experience (preview). This will open a new browser.
- Here you have the option to test your model (Test endpoint option under BASICS):
- When clicking on Test endpoint, you have the option to enable the usage of sample data, which will generate a sample record to test your model with:
- After enabling this sample data, you will see the generated sample data:
- The final step would be pressing the Test Request-Response button: what kind of activity is this woman doing according to your model?
- Another option is to click on the blue TEST button.
- This will open a pop-up window, where you can fill out some test values:
- The last option is to open an Excel file, which will automatically create sample data. Opening this file will add the Azure Machine Learning add-in to the workbook. If that doesn’t work, or you don’t have Excel on your laptop, you could follow the next steps to make a workbook online:
- Open a new browser tab.
- In the new browser tab, navigate to https://office.live.com/start/Excel.aspx. If prompted, sign in with your Microsoft account (use the same credentials you use to access Azure ML).
- In Excel Online, create a new blank workbook.
- On the Insert tab, click Office Add-ins. Then in the Office Add-ins dialog box, select Store, search for Azure Machine Learning, and add the Azure Machine Learning add-in as shown below:
- After the add-in is installed, in the Azure Machine Learning pane on the right of the Excel workbook, click Add Web Service. Boxes for the URL and API key of the web service will appear.
- On the browser tab containing the dashboard page for your Azure ML web service, right-click the Request/Response link you noted earlier and copy the web service URL to the clipboard. Then return to the browser tab containing the Excel Online workbook and paste the URL into the URL box.
- On the browser tab containing the dashboard page for your Azure ML web service, click the Copy button for the API key you noted earlier to copy the key to the clipboard. Then return to the browser tab containing the Excel Online workbook and paste it into the API key box.
- Verify that the Azure Machine Learning pane in your workbook now resembles this, and click Add:
- After the web service has been added, in the Azure Machine Learning pane, it is opened on 2. Predict. Here you have the option to generate sample data by clicking on Use sample data. This enters some sample input values in the worksheet.
- Select the cells containing the input data (cells A1 to P6), and in the Azure Machine Learning pane, click the button to select the input range and confirm that it is ‘Sheet1′!A1:P6.
- Ensure that the My data has headers box is checked.
- In the Output box type Q1, and ensure the Include headers box is checked.
- Click the Predict button, and after a few seconds, view the predicted label in cell Q2.
- Change some values of row 2 and click Predict Then view the updated label that is predicted by the web service.
- Try changing a few of the input variables and predicting the human activity class. You can add multiple rows to the input range and try various combinations at once.
By completing this lab, you have prepared your environment and data, and built and deployed your own Azure ML model. We hope you enjoyed this introductory lab and that you will build many more machine learning solutions!
If you want to download this lab, you can find a pdf version here
Microsoft Data Science Azure Machine Learning Workshop (182 downloads)