The main requirement of Project One is to analyze a structured data set by applying machine learning techniques covered in this course. You can remove the gray sentences and fill this document using your information.
Student Name:
Data Source: Data sourced through Kaggle. Data was available as a single data set, so it did not have to be combined. There were a total of 753 observations, or rows, in the data set, with 18 variables, or columns. The data does not have to remain confidential.
Python Code:
Describe the dataset: The dataset obtained contains information regarding the work and household characteristics of married individuals in the United States, and originated from a 1976 panel study of income. Data set contains a mix of integers, floats, and objects. Dimension of the dataset is (753, 18) and it did not require any imputation due to missing values.
Apply two types of machine learning techniques to analyze the data: Decision tree classification and k-means clustering.
Specify analytic questions for each technique: Regarding k-means clustering, utilize education and experience to group similar individuals together — with these clusters, additional analysis could be performed to understand differences and/or similarities regarding income and/or presence of child in the household. The purpose of applying the decision tree classification is to better understand the differentiating points of whether the individual works.
Elaborate results from both statistics and practical perspectives: For the k-means clustering analysis, four final clusters were formed on the basis of the wife’s education (in years) and wife’s experience (in years). The elbow method shows the availability of roughly 3-5 clusters, but four proved to have the best allocation of entries in each cluster. Clusters include individuals with fewer years of education as well as fewer years of experience; individuals with mid-level education, and fewer years of experience; individuals with mid- to high-levels of experiences, and generally mid-level education; and individuals with high education and low- to mid-levels of experience.
For the decision tree classification, goal was to gain insights into decisions that may play into whether or not the wife in the household works. The first step of this decision tree is whether there is presence of a child who is less than six years old; if no child in the household, then next step is to determine whether the wife is less than or equal to 46.5 years old; if true, then the number of years in education the wife has is used — where the wife has more than 10.5 years of education, she does work. Although multiple variables were inputted for this analysis, only the child attributes (child6 and child618), age of wife, and education of wife, were used to determine decision of the value of work (0/no or 1/yes). Accuracy of this model was roughly 66%, so some caution should be used when analyzing the results.
Summary: A dataset containing over 750 observations in 18 different variables was obtained that provides insights into the work and household characteristics of married couples in the 1976. This dataset was available as one set, contained a mix of variables, and did not have any missing variables. Upon reviewing this dataset, I was interested in exploring relationships between the working status of the wife and her educational background. I initially sought out to uncover the relationship between the wife’s educational status (educw), and her parent’s educational status (educwm (wife’s mom), educwf (wife’s father), but was unable to statistically provide any direct correlation. As a result, I adjusted my exploration to uncover insights related to the wife’s education, work experience, work status (working or not working), and how might the presence of children in the household impact the wife’s working status.
For this analysis, both k-means clustering and decision tree classification were executed. In utilizing the k-means clustering, I sought to uncover similar groups in terms of the relationship between the number of years of education and the number of years in working experience, both for the wife. I found a good number of clusters at four, where distribution seemed fairly even, without becoming so granular that insights were degraded. This resulted in the following clusters: individuals with fewer years of education as well as fewer years of experience; individuals with mid-level education and fewer years of experience; individuals with mid- to high-levels of experiences and generally mid-level education; and individuals with high education and low- to mid-levels of experience. These clusters could be used for additional analysis, for example, analyzing income patterns within those clusters.
The second analysis performed was decision tree classification. A number of variables were utilized within this classification that resulted in a decision node, although two did not — education of the wife’s mother and wife’s father — these did not result in a dedicated decision node. I was slightly surprised to find that, again, the education of the wife’s mother and father does not appear to have statistically significant impact nor prediction on the wife’s (or, daughter of mother and father) working status.
The initial decision node of this tree is the presence of a single (<=0.5) child of less than 6 years old. If true, meaning there is no presence of a child less than 6 years old, the next decision node is based on the wife’s age, whether she is 46.5 years old or younger. If there is the presence of a single child of less than 6 years old, meaning the first decision node was false, the next decision node is based on the number of years in the wife’s education. This node is to determine whether the wife has more than 14.5 years of education. Continuing down this path, if the wife does have more than 14.5 years of education, meaning this node of false, the three then looks, again, for the presence of a child less than 6 years old. If true, the final leaf node indicates the wife does in fact work.