Assignment.ipynb

Assignment¶

Load file Accidents.csv, divide data into three tables according the values of Accident_Severity attribute (for the first table select only values - Fatal, for the second - Serious and for the third - Slight). (2p)

In [ ]:

# YOUR CODE HERE

From the table with Slight value randomly select 10% of examples using the sample method. The following example demonstrates the usage of this method. (2p)

In [ ]:

# `frac` specifies how many examples should be selected (0.1 = 10%), `random_state` inicializing the generator of the
# random numbers, so the same selection can be replicated
sample_data = accidents_slight.sample(frac=0.1, random_state=1234)

In [ ]:

# YOUR CODE HERE

Combine all three tables into a modified Accidents table, which will contain 10% of Slight examples and all Fatal and Serious severity examples. After merging you should have 45,021 examples. (2p)

In [ ]:

# YOUR CODE HERE

Join the modified Accidents table with the Vehicles table according to the Accident_Index key so that only accident vehicles from the modified Accidents table are in the resulting table. After merging, you should get a reduced training set with fewer examples to use further for data analysis. As we have reduced the number of less severe examples, we have increased the weight of the more severe examples. (2p)

In [ ]:

# YOUR CODE HERE

Select only the following attributes for further analysis:
- Day_of_Week
- 1st_Road_Class
- Road_Type
- Light_Conditions
- Weather_Conditions
- Road_Surface_Conditions
- Urban_or_Rural_Area
- Vehicle_Type
- Sex_of_Driver
- Age_of_Driver
- Engine_Capacity_(CC)
- Age_of_Vehicle
- Accident_Severity

We will do this selection of attributes in order to remove redundant attributes in the dataset. We will not use redundant attributes describing e.g. geolocation, etc., or we will remove attributes that cannot be used for prediction (they are not known before the occurrence of the accident itself). (2p)

In [ ]:

# YOUR CODE HERE

Count the number of missing values for individual attributes. Fill in the missing values appropriately (note: missing values are marked with -1). (4p)

In [ ]:

# YOUR CODE HERE

Using the contingency table, show the dependencies between the following attributes and the Accident_Severity target attribute:

Day_of_Week
Sex_of_Driver
Age_of_Driver (necessary to discretize this attribute appropriately)

Use one of the visualizations in the seaborn library to graphically display these relationships. (5p)

In [ ]:

# YOUR CODE HERE

Create a dataset in which you replace all nominal attributes with numeric or binary ones. (3p)

In [ ]:

# YOUR CODE HERE

Divide the data into training and test sets in the ratio 70/30. Use the Accident_Severity attribute as the target attribute. (2p)

In [ ]:

# YOUR CODE HERE

Use the function SelectKBest and mutual_score_info to calculate the importance of individual attributes for prediction in the training set. Try to use this information in the preprocessing of data for some of the models. (3p)

In [ ]:

# YOUR CODE HERE

Train different classification models for the prediction of Accident_Severity attribute. Train the following models with preset parameters:

k-nearest neighbors
Decision trees
Random forests

Test individual models with 10-fold cross-validation using the accuracy metric.

Attention - for individual models, choose a suitable method of pre-processing (possible modification of pre-processing is in the step 8). (6p)

In [ ]:

# YOUR CODE HERE

Compare the trained models also using the ROC curve of the test set. Identify the model that gives the best results with the preset parameters. In this step, try to tune the model by finding the best fitting parameters using GridSearchCV. Find and list the best combination of parameters. (4p)

In [ ]:

# YOUR CODE HERE

Train the model with the best parameters on the entire training set. Test the model on the test set. Evaluate the model using accuracy, precision, and recall metrics. Write ``confusion matrix'' for it. (3p)

In [1]:

# YOUR CODE HERE