Exam 2

Exam rules

This exam takes place during regular class time on Thursday 11th December 2025 from 09:00-10:45 in classroom 2.05.

This is an open exam. This means that you may consult any resources during the exam, on condition that you observe good citation practice. This means that you must link to any sources you use, such as documentation, Stack Overflow, blog postings, etc. by providing a link to the URL in question. Failure to cite your sources may be considered plagiarism.

If you use Large Language Models, such as ChatGPT, you must link to your entire dialogue or provide a copy of it in your repo.

During the exam you may not communicate with any other person.

Submission

You should make a git repository to hold your work and push it to the AUC Forgejo server in the same way as you have done so for each of your weekly assignments.

Part of your grade will be for good git discipline. Each commit should be a single, logical unit of work. Commit often, commit small. Use clear and descriptive commit messages.

Push frequently to avoid losing work in the event of a problem with your laptop.

Your grade will be based on the work you push to the Forgejo server before the end of the exam.

Instructions

Data set

The salary data set

Salary data was collected from the US census bureau database in 1994 with a view to predicting whether a person makes over $50k a year based on demographic variables.

The input variables are:

age
the person's age
workclass
the person's working class
education
the highest level of education completed
years-of-education
the number of years of education followed
marital-status
whether the person is married, single, divorced, etc.
occupation
one of a number of categories of employment
race
the person's race
sex
the person's sex
hours-per-week
the number of hours per week worked
native-country
the person's country of origin

The output variable is >50k indicating whether the person earns more than $50k dollars per year.

Task

Your task is to develop a random forest model from the data in order to predict whether a person will earn more than $50k dollars per year and to write a report on your findings.

Your report should use data-driven arguments to convince the reader of the soundness and validity of your model and of the decisions you took along the way. It should also describe the quality of the model and any limitations it may have.

You may choose to follow some or all of the following steps.

  1. Analyse the dataset, identifying characteristics that might be useful or indeed problematic for modelling.
  2. Clean the data where necessary. In particular, where you find missing data, make an attempt to impute the missing values.
  3. Examine each feature of the dataset, making observations about it which might have a bearing on your modelling.
  4. Select a performance metric and motivate your choice.
  5. Make explicit any assumptions you make about the data.
  6. Select and apply transformations of features to make them more amenable to modelling.
  7. Split the data into training and test sets.
  8. Analyse the performance of your model.
  9. Fine tune the model by optimizing its hyperparameters.
  10. Identify which variables are most important in predicting the outcome.
  11. Deliver a final model together with its performance metric.
  12. If you have ideas for improving the model but do not have the time to carry them out, write them in a section on future work.

Author: Breanndán Ó Nualláin <o@uva.nl>

Date: 2025-12-11 Thu 08:56