Assignment Task
Important note about libraries
For this assessment, you are free to use any standard Python libraries, as well the libraries we have covered in subject contents. In fact, you must use pandas appropriately to fulfill the requirements of this assessment. You may, if it allows you to write more efficient or effective code, use additional libraries, provided these libraries are included in the standard Anaconda installation. You may not use any libraries that need to be installed separately (e.g., via conda or pip).
Detailed instructions
Your program will allow users to load a DataFrame from a CSV file, clean the data in various ways, display statistics, and create visualisations.
When the program runs, the user will see an introductory message (you are welcome to determine this as you see you fit, but make sure to include your name). For example:
- Welcome to The DataFrame Statistician!
- Programmed by Ada Lovelace
After the welcome message, the user will be presented with the following menu:
Please choose from the following options:
- Load data from a file
- View data
- Clean data
- Analyse data
- Visualise data
- Save data to a file
- Quit
Option 7 will exit the program; every other option will do some task and then display the menu again until the user chooses 7 from this menu.
If the user enters anything other than a value between 1 and 7, display an appropriate error message (e.g., Invalid selection!), then get the user to enter another choice.
Menu option 1 – load data from a file
When the user chooses option 1, they will be asked for a filename to load, which is expected to be in the same directory as the program (no need for path information). Your program should use the exact filename as stated. Do not append .csv or any other extension – although the contents of the file will be expected to be CSV, a CSV file could be stored under any extension, or no extension.
Your program should be able to handle any file in a format like the following:
day,min_temp,max_temp,rainfall,humidity 1,11,23,3,55
1,11,23,3,55
2,13,25,0,60
3,9,19,17,80
4,9,18,36,85
5,,,,50
6,12,22,,60
7,13,23,0,65
So, the first row should be the names of the columns, and the following rows should consist of the data. Your program should not be hard coded to deal with the example weather data above; it should work with any CSV file where all the column values are numeric and it can be loaded as a DataFrame. Your program should work for any number of rows or columns.
There are two problems your program may encounter here.
- the file does not exist or cannot be opened
- pandas cannot interpret the data as a DataFrame
In both of these cases your program should display an appropriate error message (e.g., “File not found”, “Unable to load data”) then return control to the main menu.
Your program only needs to handle one DataFrame in the system at a time. If a DataFrame was previously loaded, it should be replaced.
After the file loads successfully, the program should display the names of the columns, and ask the user if they want to set any of the columns as an index. Valid input in this case will consist of either one of the column names, or the blank string (user just presses `Enter`). If the input is not valid, loop until the user enters a valid column name or blank.
The program should then set the DataFrame’s index to the selected column or skip this if the user entered the blank string.
Menu option 2 – View data
This option simply prints the DataFrame to the screen. In the following example, day was set as the index when the DataFrame was loaded.
day |
min_temp | max_temp | rainfall | humidity |
1 | 11.0 | 23.0 | 3.0 | 55 |
1 | 11.0 | 23.0 | 3.0 | 55 |
2 | 13.0 | 25.0 | 0.0 | 60 |
3 | 9.0 | 19.0 | 17.0 | 80 |
3 | 9.0 | 19.0 | 17.0 | 80 |
4 | 9.0 | 18.0 | 36.0 | 85 |
5 | NaN | NaN | NaN | 50 |
6 | 12.0 | 22.0 | NaN | 60 |
7 | 13.0 | 23.0 | 0.0 | 65 |
Menu option 3 – Clean data
This option will enter a submenu offering various cleaning operations.
Cleaning data:
- Drop rows with missing values
- Fill missing values
- Drop duplicate rows
- Drop column
- Rename column
- Finish cleaning
Menu option 4 – Analyse data
For each of the columns in the DataFrame, produce a report like the one below. Make sure to use pandas functions as appropriate.
humidity | |
number of values (n): | 7 |
minimum: | 50.00 |
maximum: | 85.00 |
mean: | 65.00 |
median: | 60.00 |
standard deviation: | 12.91 |
std. err. of mean: | 4.88 |
Display each statistic to two decimal places (except for number of values, which is always a whole number). After displaying the statistics reports, finish by displaying a table of correlations like the one below (hint: you don’t have to write your own code to compute correlations, search the pandas docs).
min_temp | max_temp | rainfall | humidity | |
min_temp | 1.000000 | 0.916131 | -0.795016 | -0.845247 |
max_temp | 0.916131 | 1.000000 | -0.882108 | -0.920701 |
rainfall | -0.795016 | -0.882108 | 1.000000 | 0.882754 |
humidity | -0.845247 | -0.920701 | 0.882754 | 1.000000 |
Menu option 5 – Visualise data
In this case, ask the user:
- If they want a bar graph, line graph, or boxplot (repeat until they give a valid selection)
- Whether they want to use subplots
- For a title (skip if they leave it blank)
- For an x-axis label (skip if they leave it blank)
- For a y-axis label (skip if they leave it blank)
Menu option 6 – Save data to a file
Ask the user for a filename, including file extension (e.g., data.csv). Use the exact filename given including the extension – if the user wants to save with no extension or a non-standard one, let them do so.
f the user enters blank, cancel the saving operation. Otherwise, save the file in CSV format. If the DataFrame has a named index, save the file with the index. Otherwise, don’t save with the index.
Sample Output
It should be clear which parts below are user input (not printed, but entered by the user). The output below is not intended to be exhaustive/complete, but you can discern how the program should run fairly clearly from what is demonstrated here. E.g., you can see that all
invalid inputs are handled and that unless blank entries are meaningful (e.g., quitting the menu option or skipping an entry), invalid inputs lead to a repeat of the input.
Challenge Tasks
Quality control
Plastyco manufactures sheets of plastic and has a process that scans for defects in 100×100 grid locations. Sheets with defects can still be salvaged so long as the following conditions are met:
- There are no more than 8 defects overall
- There are no more than 4 defects in any 3×3 square
The image below shows 4 defects in a 3×3 square. A sheet with this set of defects could pass quality control.
Combining CSVs
Suppose we have a folder containing a large number of CSV files. Each CSV should have a column called key that should be used as the index. If the file does not have a key column, it should be skipped.
Most of the files will have a similar set of columns, with some variations. We need to combine all the data from each file into a single DataFrame.
Write a function combine_csv_files(directory, output_filename).
- directory is the folder location where the files can be found
- output_filename is where the combined DataFrame should be saved
The function should check through directory for all files with a .csv extension. Each of these files should be loaded as a DataFrame with key as its index. If there are .csv files that cannot be loaded as a DataFrame, these should be skipped. Do not allow the function to crash because there is a .csv file with an invalid format. If a .csv file does not contain a key column, it should also be skipped.
The DataFrames should then be concatenated vertically into a single DataFrame. E.g., if we had a 10-row frame and a 20-row frame these would produce a 30-row frame.
Reorder the columns in alphabetical order. If any column has more than 50% nulls, it should be dropped. In the case of unmatched columns (in some CSVs but not others), leave values with nulls until/unless these columns get dropped after all the frames have been concatenated. Reorder the rows on their index, then save the DataFrame as a CSV file to the given output_filename – include the header, and the index (key).
Note: assume both file paths, directory and output_filename, are relative to the working directory of the code. Do not use os.chdir or any other method to change the current working directory. Your code does not have to handle issues with the directory or output_filename – if they are not valid, it is fine for the function to crash.
You have been provided with sample input in the form of a folder of CSV files (zipped for convenience), csvdir.zip, and task2_out.csv which is the sample output your code should produce from these input files. As part of the challenge of this task, you will need to consider create your own test cases to make sure your function meets all of the requirements above.
Creating simple HTML pages
Suppose we want to create some very basic static HTML pages from data in CSV files.
Assume the CSV files are in a folder called htmlplots. In this folder is a file called index.txt
that lists the CSV files that should be turned into HTML pages. Write a function called make_html_files()
For each of these files create an HTML file (.html) that contains:
- A header with the name of the dataset (CSV filename with .csv removed)
- A link to the previous HTML file (unless this is the first)
- A link to the next HTML file (unless this is the last)
- The data in table format (pandas has a function that will help)
- A PNG image of a line plot of the data (matplotlib can save plots to a file)
All the HTML files and corresponding PNG images should be saved in the htmlplots folder. The pages should be linked in the order the CSV files appear in index.txt.
An HTML file is just a text file with particular content. Use the HTML files in the sample output as a starting point, but you may ‘upgrade’ your HTML to something more advanced as long as it provides the same required functionality.