Is there any relationship between zodiac signs and ones’ life? Some people believe that individuals of each sign share some similar characteristics. This project is to generate the sign for a user given other information.
1. Exploratory Data Analysis
The dataset in this project is from the only one file profiles.csv
having 59,946 records. There are 31 columns with
only three numeric features: age, height, and income. The rest are categorical features and open-ended text. All
the columns are as followings:
- age numeric variable of ages of users
- body_type nominal categorical variable of users' body shapes such as average, thin, a little extra, etc.
- diet categorical variable of users' kinds of foods they have such as anything, vegetarian, vegan, halal, etc.
- drinks categorical variable that shows how often a user drink
- drugs categorical variable that shows how often a user take in drugs
- education categorical variable of users' highest education
- essay0 - essay9 These ten columns provide short answers of questions below by user
- essay0 My self summary
- essay1 What I’m doing with my life
- essay2 I’m really good at
- essay3 The first thing people usually notice about me
- essay4 Favorite books, movies, show, music, and food
- essay5 The six things I could never do without
- essay6 I spend a lot of time thinking about
- essay7 On a typical Friday night I am
- essay8 The most private thing I am willing to admit
- essay9 You should message me if…
- height numeric variable of users' height in inches
- income numeric variable of users' income
- job categorical variable of user's job
- last_online date and time when a user is last online
- location categorical variable of the area in which a user lives in
- offspring categorical variable that shows kids' status of a user and their thoughts about having more kids.
- orientation categorical variable of user's gender orientation e.g. straight, gay, bisexual
- pets categorical variable that shows pets' status of a user and their thoughts about petting animal.
- religion categorical variable of user's religion
- sex categorical variable of user's sex including 'm' for male and 'f' for female.
- sign categorical variable of user's zodiac signs such as aries, taurus, gemini
- smokes categorical variable of how often a user smokes
- speaks a short answer of user's languages
- status categorical variable of member's status such as single, available, married
Missing Values
To check the presence of data in each column, the visualization was created by python library called missingno
.

The column which has the greatest number of null values is offspring
(around 59%). Meanwhile, 41% and 34% of records in the columns
diet
and religion
are missing. The number of missing values in each column is shown below.
age 0
body_type 5296
diet 24395
drinks 2985
drugs 14080
education 6628
essay0 5488
essay1 7572
essay2 9638
essay3 11476
essay4 10537
essay5 10850
essay6 13771
essay7 12451
essay8 19225
essay9 12603
ethnicity 5680
height 3
income 0
job 8198
last_online 0
location 0
offspring 35561
orientation 0
pets 19921
religion 20226
sex 0
sign 11056
smokes 5512
speaks 50
status 0
dtype: int64
1.1 Numeric Features
Exploring more about numeric data, only three columns are numeric: age
, height
, and income
. OKCupid members’ ages ranged
from 18-110 years old. With half of them were between 26-37 years old. The distribution of age is shown below.
There were some anomalies as height should not be low as 1 inch and around 80% had income equal to -1.
1.2 Categorical Features
There are some categorical features in which choices are very detailed, ie. religion
and sign
. In the column
religion
, the data includes both religion and level of seriousness. For example, people who are agnosticism may be
very serious about it, not too serious about it, or laughing about it. Thus, columns religion_cleaned
and
sign_cleaned
were created.
In this project, the features chosen to predict the variable sign_cleaned
includes body_type
, diet
,
drinks
, drugs
, education
, job
, offspring
,
pets
, religion_cleaned
, and smokes
. Each feature has unique values excluding 'nan' as followings:
body_type, 12 distinct values: ['a little extra' 'average' 'thin' 'athletic' 'fit' 'skinny' 'curvy'
'full figured' 'jacked' 'rather not say' 'used up' 'overweight']
diet, 18 distinct values: ['strictly anything' 'mostly other' 'anything' 'vegetarian'
'mostly anything' 'mostly vegetarian' 'strictly vegan'
'strictly vegetarian' 'mostly vegan' 'strictly other' 'mostly halal'
'other' 'vegan' 'mostly kosher' 'strictly halal' 'halal'
'strictly kosher' 'kosher']
drinks, 6 distinct values: ['socially' 'often' 'not at all' 'rarely' 'very often' 'desperately']
drugs, 3 distinct values: ['never' 'sometimes' 'often']
education, 32 distinct values: ['working on college/university' 'working on space camp'
'graduated from masters program' 'graduated from college/university'
'working on two-year college' 'graduated from high school'
'working on masters program' 'graduated from space camp'
'college/university' 'dropped out of space camp'
'graduated from ph.d program' 'graduated from law school'
'working on ph.d program' 'two-year college'
'graduated from two-year college' 'working on med school'
'dropped out of college/university' 'space camp'
'graduated from med school' 'dropped out of high school'
'working on high school' 'masters program' 'dropped out of ph.d program'
'dropped out of two-year college' 'dropped out of med school'
'high school' 'working on law school' 'law school'
'dropped out of masters program' 'ph.d program'
'dropped out of law school' 'med school']
job, 21 distinct values: ['transportation' 'hospitality / travel' 'student'
'artistic / musical / writer' 'computer / hardware / software'
'banking / financial / real estate' 'entertainment / media'
'sales / marketing / biz dev' 'other' 'medicine / health'
'science / tech / engineering' 'executive / management'
'education / academia' 'clerical / administrative'
'construction / craftsmanship' 'rather not say' 'political / government'
'law / legal services' 'unemployed' 'military' 'retired']
offspring, 15 distinct values: ['doesn’t have kids, but might want them'
'doesn’t want kids' 'doesn’t have kids, but wants them'
'doesn’t have kids' 'wants kids' 'has a kid' 'has kids'
'doesn’t have kids, and doesn’t want any'
'has kids, but doesn’t want more'
'has a kid, but doesn’t want more' 'has a kid, and wants more'
'has kids, and might want more' 'might want kids'
'has a kid, and might want more' 'has kids, and wants more']
pets, 15 distinct values: ['likes dogs and likes cats' 'has cats' 'likes cats'
'has dogs and likes cats' 'likes dogs and has cats'
'likes dogs and dislikes cats' 'has dogs' 'has dogs and dislikes cats'
'likes dogs' 'has dogs and has cats' 'dislikes dogs and has cats'
'dislikes dogs and dislikes cats' 'dislikes cats'
'dislikes dogs and likes cats' 'dislikes dogs']
religion_cleaned, 9 distinct values: ['agnosticism' 'atheism' 'christianity' 'other' 'catholicism'
'buddhism' 'judaism' 'hinduism' 'islam']
sign_cleaned, 12 distinct values: ['gemini' 'cancer' 'pisces' 'aquarius' 'taurus' 'virgo' 'sagittarius'
'leo' 'aries' 'libra' 'scorpio' 'capricorn']
smokes, 5 distinct values: ['sometimes' 'no' 'when drinking' 'yes' 'trying to quit']
2. Machine Learning Models Training
Four machine learning models were trained for this multi-labels classification.
- K Nearest Neighbors
- Random Forest
- Support Vector Machine
- Multinomial Naive Bayes
The first three models will use the same data preparation. Only Naive Bayes model is used for the essay columns features. The dataset used for training and testing models had a lot of null values. After dropping null values the dataframe have 7,404 rows in total. The remaining dataset was divided into training data and testing data with ratio 80:20.
2.1 K-Nearest Neighbors

The K-Nearest Neighbors classifier was trained starting using default k value equal to 5. Compared to true training labels the predicted classes quite differ. The average F1-score is at around 33%. The 5-folded cross-validation was done as follows:
print(cross_val_score(KNN_classifier, X_prepared, y_train, scoring='f1_macro', cv=5))
[0.08563802 0.08084741 0.08077205 0.06490556 0.08670129]
The scores show that this KNN classifier does not work well on the validation dataset. The average of F1-score on the validation dataset is 8% which were not different from random guessing.
Besides, training the model with different k values cannot do better. The model's scores by iterating different k-values through for-loop got worse when the k-values increased.

2.2 Random Forest

It is clear that the model overfitted the training data as it got too perfect scores. Re-evaluating the model with cross-validation method, the model works not better than random guesses. The average F1-score is at around 8%
As it was maybe because some hyperparameters were inappropriate. For example, the trees within the classifier
were created with too many depths which ranged from 44-74. The function GridSearchCV
was used to tune some hyperparameters:
max_depth, max_features, min_samples_split, n_estimators and bootstrap.
Even though the hyperparameters were changed to other values in GridSearchCV, the best score from the best parameters tried was still around 8%. The Random Forest model is not good enough for sign prediction.
2.3 Support Vector Machine

Similar to KNN model, the F1-score on training data and cross-validation data were 38% and 8%, respectively.
2.4 Multinomial Naive Bayes
For Multinomial Naive Bayes, only essay columns were used for training and prediction. The preprocessing method is 'Term Frequency-Inverse Document Frequency' or TF-IDF. The dataset for training and testing includes totally 26,117 samples.
From the classification report above, 'Cancer' and 'Gemini' seemed to be predicted well. There are some classes which the model is very bad at predicting. Especially when doing cross validation, the f1_score was worse than other models.
3. Conclusion
All four machine learning models cannot predict signs for users very well. For training data, K-Nearest Neighbors, Random Forest, SVM, and Multinomial Naive Bayes had f1-score at 33%, 100%, 38% and 31% in turn. However, this is because of overfitting data. After testing on validation data via using cross-validation all the models cannot do better than guessing which is around 8%
Future Work
We can investigate more if ‘Cancer’ and ‘Gemini’ can be really predicted well. Moreover, since the models are based on features provided, to increase the performance of prediction we should discuss which features should be collected more. Perhaps, consult with astrologers!