Ranchana's Portfolio

OKCupid Date a Scientist

Is there any relationship between zodiac signs and ones’ life? Some people believe that individuals of each sign share some similar characteristics. This project is to generate the sign for a user given other information.

1. Exploratory Data Analysis

The dataset in this project is from the only one file profiles.csv having 59,946 records. There are 31 columns with only three numeric features: age, height, and income. The rest are categorical features and open-ended text. All the columns are as followings:

age numeric variable of ages of users
body_type nominal categorical variable of users' body shapes such as average, thin, a little extra, etc.
diet categorical variable of users' kinds of foods they have such as anything, vegetarian, vegan, halal, etc.
drinks categorical variable that shows how often a user drink
drugs categorical variable that shows how often a user take in drugs
education categorical variable of users' highest education
essay0 - essay9 These ten columns provide short answers of questions below by user
- essay0 My self summary
- essay1 What I’m doing with my life
- essay2 I’m really good at
- essay3 The first thing people usually notice about me
- essay4 Favorite books, movies, show, music, and food
- essay5 The six things I could never do without
- essay6 I spend a lot of time thinking about
- essay7 On a typical Friday night I am
- essay8 The most private thing I am willing to admit
- essay9 You should message me if…
height numeric variable of users' height in inches
income numeric variable of users' income
job categorical variable of user's job
last_online date and time when a user is last online
location categorical variable of the area in which a user lives in
offspring categorical variable that shows kids' status of a user and their thoughts about having more kids.
orientation categorical variable of user's gender orientation e.g. straight, gay, bisexual
pets categorical variable that shows pets' status of a user and their thoughts about petting animal.
religion categorical variable of user's religion
sex categorical variable of user's sex including 'm' for male and 'f' for female.
sign categorical variable of user's zodiac signs such as aries, taurus, gemini
smokes categorical variable of how often a user smokes
speaks a short answer of user's languages
status categorical variable of member's status such as single, available, married

Missing Values

To check the presence of data in each column, the visualization was created by python library called missingno.

The column which has the greatest number of null values is offspring (around 59%). Meanwhile, 41% and 34% of records in the columns diet and religion are missing. The number of missing values in each column is shown below.

age 0
body_type 5296
diet 24395
drinks 2985
drugs 14080
education 6628
essay0 5488
essay1 7572
essay2 9638
essay3 11476
essay4 10537
essay5 10850
essay6 13771
essay7 12451
essay8 19225
essay9 12603
ethnicity 5680
height 3
income 0
job 8198
last_online 0
location 0
offspring 35561
orientation 0
pets 19921
religion 20226
sex 0
sign 11056
smokes 5512
speaks 50
status 0
dtype: int64

1.1 Numeric Features

Exploring more about numeric data, only three columns are numeric: age, height, and income. OKCupid members’ ages ranged from 18-110 years old. With half of them were between 26-37 years old. The distribution of age is shown below.

There were some anomalies as height should not be low as 1 inch and around 80% had income equal to -1.

1.2 Categorical Features

There are some categorical features in which choices are very detailed, ie. religion and sign. In the column religion, the data includes both religion and level of seriousness. For example, people who are agnosticism may be very serious about it, not too serious about it, or laughing about it. Thus, columns religion_cleaned and sign_cleaned were created.

In this project, the features chosen to predict the variable sign_cleaned includes body_type, diet, drinks, drugs, education, job, offspring, pets, religion_cleaned, and smokes. Each feature has unique values excluding 'nan' as followings:


body_type, 12 distinct values: ['a little extra' 'average' 'thin' 'athletic' 'fit' 'skinny' 'curvy'
                             'full figured' 'jacked' 'rather not say' 'used up' 'overweight']
diet, 18 distinct values: ['strictly anything' 'mostly other' 'anything' 'vegetarian'
                             'mostly anything' 'mostly vegetarian' 'strictly vegan'
                             'strictly vegetarian' 'mostly vegan' 'strictly other' 'mostly halal'
                             'other' 'vegan' 'mostly kosher' 'strictly halal' 'halal'
                             'strictly kosher' 'kosher']
drinks, 6 distinct values: ['socially' 'often' 'not at all' 'rarely' 'very often' 'desperately']
drugs, 3 distinct values: ['never' 'sometimes' 'often']
education, 32 distinct values: ['working on college/university' 'working on space camp'
                             'graduated from masters program' 'graduated from college/university'
                             'working on two-year college' 'graduated from high school'
                             'working on masters program' 'graduated from space camp'
                             'college/university' 'dropped out of space camp'
                             'graduated from ph.d program' 'graduated from law school'
                             'working on ph.d program' 'two-year college'
                             'graduated from two-year college' 'working on med school'
                             'dropped out of college/university' 'space camp'
                             'graduated from med school' 'dropped out of high school'
                             'working on high school' 'masters program' 'dropped out of ph.d program'
                             'dropped out of two-year college' 'dropped out of med school'
                             'high school' 'working on law school' 'law school'
                             'dropped out of masters program' 'ph.d program'
                             'dropped out of law school' 'med school']
job, 21 distinct values: ['transportation' 'hospitality / travel' 'student'
                             'artistic / musical / writer' 'computer / hardware / software'
                             'banking / financial / real estate' 'entertainment / media'
                             'sales / marketing / biz dev' 'other' 'medicine / health'
                             'science / tech / engineering' 'executive / management'
                             'education / academia' 'clerical / administrative'
                             'construction / craftsmanship' 'rather not say' 'political / government'
                             'law / legal services' 'unemployed' 'military' 'retired']
offspring, 15 distinct values: ['doesn’t have kids, but might want them'
                             'doesn’t want kids' 'doesn’t have kids, but wants them'
                             'doesn’t have kids' 'wants kids' 'has a kid' 'has kids'
                             'doesn’t have kids, and doesn’t want any'
                             'has kids, but doesn’t want more'
                             'has a kid, but doesn’t want more' 'has a kid, and wants more'
                             'has kids, and might want more' 'might want kids'
                             'has a kid, and might want more' 'has kids, and wants more']
pets, 15 distinct values: ['likes dogs and likes cats' 'has cats' 'likes cats'
                             'has dogs and likes cats' 'likes dogs and has cats'
                             'likes dogs and dislikes cats' 'has dogs' 'has dogs and dislikes cats'
                             'likes dogs' 'has dogs and has cats' 'dislikes dogs and has cats'
                             'dislikes dogs and dislikes cats' 'dislikes cats'
                             'dislikes dogs and likes cats' 'dislikes dogs']
religion_cleaned, 9 distinct values: ['agnosticism' 'atheism' 'christianity' 'other' 'catholicism'
                             'buddhism' 'judaism' 'hinduism' 'islam']
sign_cleaned, 12 distinct values: ['gemini' 'cancer' 'pisces' 'aquarius' 'taurus' 'virgo' 'sagittarius'
                             'leo' 'aries' 'libra' 'scorpio' 'capricorn']
smokes, 5 distinct values: ['sometimes' 'no' 'when drinking' 'yes' 'trying to quit']

2. Machine Learning Models Training

Four machine learning models were trained for this multi-labels classification.

K Nearest Neighbors
Random Forest
Support Vector Machine
Multinomial Naive Bayes

The first three models will use the same data preparation. Only Naive Bayes model is used for the essay columns features. The dataset used for training and testing models had a lot of null values. After dropping null values the dataframe have 7,404 rows in total. The remaining dataset was divided into training data and testing data with ratio 80:20.

2.1 K-Nearest Neighbors

The K-Nearest Neighbors classifier was trained starting using default k value equal to 5. Compared to true training labels the predicted classes quite differ. The average F1-score is at around 33%. The 5-folded cross-validation was done as follows:

print(cross_val_score(KNN_classifier, X_prepared, y_train, scoring='f1_macro', cv=5))

[0.08563802 0.08084741 0.08077205 0.06490556 0.08670129]

The scores show that this KNN classifier does not work well on the validation dataset. The average of F1-score on the validation dataset is 8% which were not different from random guessing.

Besides, training the model with different k values cannot do better. The model's scores by iterating different k-values through for-loop got worse when the k-values increased.

2.2 Random Forest

It is clear that the model overfitted the training data as it got too perfect scores. Re-evaluating the model with cross-validation method, the model works not better than random guesses. The average F1-score is at around 8%

As it was maybe because some hyperparameters were inappropriate. For example, the trees within the classifier were created with too many depths which ranged from 44-74. The function GridSearchCV was used to tune some hyperparameters: max_depth, max_features, min_samples_split, n_estimators and bootstrap.

Even though the hyperparameters were changed to other values in GridSearchCV, the best score from the best parameters tried was still around 8%. The Random Forest model is not good enough for sign prediction.

2.3 Support Vector Machine

Similar to KNN model, the F1-score on training data and cross-validation data were 38% and 8%, respectively.

2.4 Multinomial Naive Bayes

For Multinomial Naive Bayes, only essay columns were used for training and prediction. The preprocessing method is 'Term Frequency-Inverse Document Frequency' or TF-IDF. The dataset for training and testing includes totally 26,117 samples.

From the classification report above, 'Cancer' and 'Gemini' seemed to be predicted well. There are some classes which the model is very bad at predicting. Especially when doing cross validation, the f1_score was worse than other models.

3. Conclusion

All four machine learning models cannot predict signs for users very well. For training data, K-Nearest Neighbors, Random Forest, SVM, and Multinomial Naive Bayes had f1-score at 33%, 100%, 38% and 31% in turn. However, this is because of overfitting data. After testing on validation data via using cross-validation all the models cannot do better than guessing which is around 8%

Future Work

We can investigate more if ‘Cancer’ and ‘Gemini’ can be really predicted well. Moreover, since the models are based on features provided, to increase the performance of prediction we should discuss which features should be collected more. Perhaps, consult with astrologers!