Toto je statické zobrazenie, ak chcete Notebook spustiť, prihláste sa do prostredia Data Lab.
Assignment¶
- Load file
Accidents.csv, divide data into three tables according the values ofAccident_Severityattribute (for the first table select only values -Fatal, for the second -Seriousand for the third -Slight). (2p)
In [ ]:
# YOUR CODE HERE
- From the table with
Slightvalue randomly select 10% of examples using thesamplemethod. The following example demonstrates the usage of this method. (2p)
In [ ]:
# `frac` specifies how many examples should be selected (0.1 = 10%), `random_state` inicializing the generator of the
# random numbers, so the same selection can be replicated
sample_data = accidents_slight.sample(frac=0.1, random_state=1234)
In [ ]:
# YOUR CODE HERE
- Combine all three tables into a modified
Accidentstable, which will contain 10% ofSlightexamples and allFatalandSeriousseverity examples. After merging you should have 45,021 examples. (2p)
In [ ]:
# YOUR CODE HERE
- Join the modified
Accidentstable with theVehiclestable according to theAccident_Indexkey so that only accident vehicles from the modifiedAccidentstable are in the resulting table. After merging, you should get a reduced training set with fewer examples to use further for data analysis. As we have reduced the number of less severe examples, we have increased the weight of the more severe examples. (2p)
In [ ]:
# YOUR CODE HERE
- Select only the following attributes for further analysis:
Day_of_Week1st_Road_ClassRoad_TypeLight_ConditionsWeather_ConditionsRoad_Surface_ConditionsUrban_or_Rural_AreaVehicle_TypeSex_of_DriverAge_of_DriverEngine_Capacity_(CC)Age_of_VehicleAccident_Severity
We will do this selection of attributes in order to remove redundant attributes in the dataset. We will not use redundant attributes describing e.g. geolocation, etc., or we will remove attributes that cannot be used for prediction (they are not known before the occurrence of the accident itself). (2p)
In [ ]:
# YOUR CODE HERE
- Count the number of missing values for individual attributes. Fill in the missing values appropriately (note: missing values are marked with -1). (4p)
In [ ]:
# YOUR CODE HERE
- Using the contingency table, show the dependencies between the following attributes and the
Accident_Severitytarget attribute:
Day_of_WeekSex_of_DriverAge_of_Driver(necessary to discretize this attribute appropriately)
Use one of the visualizations in the seaborn library to graphically display these relationships. (5p)
In [ ]:
# YOUR CODE HERE
- Create a dataset in which you replace all nominal attributes with numeric or binary ones. (3p)
In [ ]:
# YOUR CODE HERE
- Divide the data into training and test sets in the ratio 70/30. Use the
Accident_Severityattribute as the target attribute. (2p)
In [ ]:
# YOUR CODE HERE
- Use the function
SelectKBestandmutual_score_infoto calculate the importance of individual attributes for prediction in the training set. Try to use this information in the preprocessing of data for some of the models. (3p)
In [ ]:
# YOUR CODE HERE
- Train different classification models for the prediction of
Accident_Severityattribute. Train the following models with preset parameters:
- k-nearest neighbors
- Decision trees
- Random forests
Test individual models with 10-fold cross-validation using the accuracy metric.
Attention - for individual models, choose a suitable method of pre-processing (possible modification of pre-processing is in the step 8). (6p)
In [ ]:
# YOUR CODE HERE
- Compare the trained models also using the ROC curve of the test set. Identify the model that gives the best results with the preset parameters. In this step, try to tune the model by finding the best fitting parameters using
GridSearchCV. Find and list the best combination of parameters. (4p)
In [ ]:
# YOUR CODE HERE
- Train the model with the best parameters on the entire training set. Test the model on the test set. Evaluate the model using
accuracy,precision, andrecallmetrics. Write ``confusion matrix'' for it. (3p)
In [1]:
# YOUR CODE HERE