Data Manipulation and Features Engineering with Pandas. The “No For Loops” guide.

Mohamed Gaber
13 min readDec 7, 2020

This post is part 3 of the series about building a recommendation system for CareerVillage.com. You can find the outline of the series here.

Quick Summary

CareerVillage is a website that is similar to Quora and Stackoverflow. They allow students to post questions about anything related to their future careers. Volunteers with expertise in various fields then answer their questions. Currently, the website lets professionals follow specific topics (programming, finance, engineering, Python, sports…etc). Then, when students post questions, they specify the topic related to their question and once the question is posted, an email notification is sent to the professional who is following that topic. The website is looking for a more efficient way to send these recommendation questions. The goal of the challenge is to develop a method to recommend relevant questions to the professionals who are most likely to answer them (fast).

Introduction

In this post, I’m going to do all the data manipulation and cleaning needed to get the data into the final form that can be used to build a machine learning model. I will then use the data to train a model that can predict the probability that a question will get answered if it was sent by email to a specific professional.

This challenge is a great chance to practice the power of pandas and NumPy in data manipulation and preparation. The company provided 15 different datasets that are all connected through IDs but contain different information. You can find the datasets here. The data gives information about the professionals, the students, the emails that were sent to the professionals, and the questions contained in these emails, and the answers that were provided, the tags (topics) of the questions. Our main goal is to get a model that can take as the input the information of the professional (his interests, response rate…etc) and the information about the question (the question topics, date…etc) and the model should output the probability that this individual is going to answer the question or not. We can then use that model to decide to whom we should send the questions. Given a set of questions, we can use the model to find those professionals with the highest probability (and potentially the fastest) to answer those questions. From that description, we can imagine the dataset that we need to build such a model -we need exactly two datasets: professionals_dataset and questions_dataset. The professonals_dataset should contain the following information for each professional as a row:

  • Professional’s IDs
  • The response rate (the number of questions he answered over the total number of questions sent to him).
  • The average time taken to answer the questions.
  • The total number of questions answered.
  • The tags he is following (tags are topics that people can follow on the website e.g. “physics”, “coding”, “education”…etc). These show the interest and experience of the individual. All tags should be saved in a list for that one professional (this means we will have a column of lists in that dataset).
  • All the tags of the questions that this individual answered previously. We expect that he might have answered questions with topics that are not indicated in the topics that he follows. We can use this data to measure the similarity of the previously answered questions with the new question.
  • The average score of his answers (people can vote for answers and each vote raises the score of the answer up, we will take the average of all the answers provided by one professional).
  • The number of groups the professional is following (besides the topics, the professionals can follow groups that talk about specific topics). We expect that this number might be an indication of how active the professional is.
  • The number of schools the professional is following. Similar to the groups, there are schools that the professional can follow. This can also be an indication of active individuals.
  • The ratio of the number of questions he answered to the total number of emails sent to him. Whenever a new question gets posted, the individuals who follow at least one topic of the question get an email notification. Some individuals are too active and they answer more questions than those sent to them. So this ratio should be an indication of how active people are (the more the better).

The second dataset we need to make is the questions dataset. This should contain this information in each row:

  • The ID of the email (we will track the questions using the emails that those questions were sent in). The email ID should be unique since each email is sent to one specific professional but the same questions might have been sent to several professionals.
  • The professional ID (the one who received the email).
  • The data at which the email was sent.
  • The ID of the question which was attached to the email.
  • The ID of the answer that the professional added to that question (only if the professional answered the question, NA if not).
  • The data of the answer (NA if no answer).
  • The body of the answer.
  • The time the professional took to answer that question
  • The tags (topics) of the question
  • Finally, our target variable: 1 if the question was answered by that professional and 0 otherwise.

Having the two datasets above, we can train a machine learning model to learn the patterns and find what makes professionals answer a question. We can then use that model to choose the best professionals that are highly likely to answer a question. So let’s start building that dataset from what we have. The most important thing to keep in mind here is that the dataset is really large and things could get computationally complicated quickly. To avoid that, we will make sure to pandas and NumPy only and avoid the use of any for loops.

Data Preparation and Manipulation

First, let’s start by getting the emails that were sent immediately after a question got posted. Looking at the emails dataset:

We see there are daily and weekly newsletters but we don’t want that. We want the unique questions emails labeled as “email_notification_immediate”.

Now, we need to find the questions that were sent inside these emails. That information can be found in the “matches” dataset:

To get the questions, we can use the pandas “isin”:

Now we can merge these together to get this nice dataset:

Now, let’s see if each of these professionals added an answer to the question or not, and get those answers to our dataset.

To do that, we can use the “answers” dataset:

and then match that dataset with the one we have on “professional_Id”:

Now we have that big dataset:

Note that some professionals provided more than one answer to the same question. So this means the email Id will be repeated because it has two answers.

Now that we have this dataset, we can make our target variable: 1 if the question was answered and 0 if it wasn’t:

Now, since one of the features that we need is the average time taken by each professional, let’s process the time to get that variable. First, let’s turn the email sent date into a DateTime value instead of a string:

Ok, now the date of the sent email looks good:

Let’s do the same for the date at which the answer was added:

Now, we can subtract these two to get the time taken by each professional to answer the question:

If we now examine the time taken, we notice something very interesting:

That’s a negative time!! this means that these people answered the question BEFORE the email notification was sent to them. Let’s take a look at one of them:

The person replied to the question 6 minutes before the email was sent to him. This means one of three things: this person is really active, the email had a lag time to send, the person who collected this data missed it for these few individuals. In any case, it’s better to replace that time with a 0 which indicates that these people were too active and replied to the question right away (I expect that this will push them forward in the model prediction).

Now, the other variable we need is the tags (topics) of each question. The tags dataset gives the id and the name of the tags:

We can match that to the tag id of each question to get the tag names:

And finally, we have the dataset that contains the full information about all the questions

full_data.columns

Now, we can get to the second and final dataset we need: the professionals' dataset. Let’s start with the ids given to us:

Now, let’s get the response rate column. This is just the number of questions that were answered by that professional after the email was sent to him:

Now, we will get the average time that each professional took to answer the question. We get that by grouping the dataset with respect to the professional ID and then taking the average of the grouped “answer_times”:

Now, let’s get the total number of questions that each professional answered:

We can now get all the tags that each professional is following. We will add those to a list so it will make a column of lists:

Similarly, we get the tags of all the questions that each professional answered before:

As we mentioned earlier, when a professional posts an answer, they get a score on that answer. We are suspecting that the scores they get might affect their probability of answering new questions (get high scores encourages them). So we will also get that feature:

We can also get the number of groups and schools that each professional is following as we explained above:

Finally, we get the answers to the email ratio:

Now we are finally done with data preparation. We have the two datasets we need to build the model:

full_data.columns
professionals_dataset.columns

Machine Learning Model

Now, we are ready to move on to building the machine learning model that can find the patterns in that data and help us choose the best professionals to send questions to. The first step to do is to merge the two datasets above in a way that makes each row has information about the professional and the question that was sent to him. We need to find metrics that relate the question to the professional. We thought of three interesting metrics to use:

  • Intersection of topics: We calculate the intersection between the topics that the professional follows and the topics of the question. E.g. if the professional follow [“education”, “jobs”, “physics”] and the question’s topic are [“college”, “work”] then the intersection here is 2 since “education” is relevant to “college” and “jobs” is relevant to “work”. So the algorithm basically counts the number of relevant combinations of the two lists. To learn more about how this works, read the NLP part of this series here. We expect that higher values for the intersection will make it more likely for this professional to answer the question.
  • Intersection over Union: This is the same intersection from above but it’s divided over the total number of unique topics in the two lists. This should give a score between 0 and 1. 1 would be the case when the question’s topics and the professional’s topics are the same. We also expect the higher this value the more likely the professional to answer the question.
  • Intersection over Interest: Here we divide the intersection over the number of topics that the professional is following. So this number is also between 0 and 1 but it reflects how many of the topics the professional follows are in the question being asked, the more topics the more likely he is to answer. This is different from the metric above because there might be more topics in the question that he is not following (but the question still includes all of those that he is following).
  • Previous Intersection: this is the intersection between the question’s topic and the topics of the questions that the professional answered before.

Now, let’s get the new dataset that contains these metrics:

Some questions don’t have topics tagged at all. We label those as “Na” since it might be the case that some professionals like to answer the untagged questions:

Now, we can remove the unneeded features like the IDs to make the data suitable for the model:

We still have three columns with a few NAs. It makes sense to fill the missing number of schools and groups with 0 since it just means the person is not following any of them:

And we can take a look at the average time taken:

So it seems that we have a few outliers since some people take a much longer time to answer questions. Then we better fill the NAs with the median:

Looking at the target variable:

We see a huge imbalance in the dataset. Around 300,000 questions were not replied to after the email was sent and only 5000 questions were replied to. To fix that imbalance, we can do undersampling of the data:

Now it looks much better. We are finally ready to build the machine learning model. Let’s start with simple logistic regression and interpret the results:

Model Interpretation

The logistic regression obtained a score of 85%. This is actually very good for this dataset. To understand how the model reached that performance, let’s interpret the model with SHAP:

shap.summary_plot(shap_values, x_test)

These are very interesting results. The response rate has the highest effect on the probability of professionals to answer the question. This was actually expected since a higher response rate means a more active individual. The second most significant feature is the total number of emails that were sent to the professional. It seems that the more questions sent, the less likely the professional was to answer the question!! this is surprising. I think what happens is that too many emails lead people to treat such emails as spam and stop reading them. Now we know that it was a good idea for the company to try to improve its recommendation system rather than spamming people.

The above plot looks closely at one of the professionals. It seems that the number of emails of 66 is a low number (some people received more than 2000 emails). So it’s pushing the probability of him answering the question up. Also, the Intersection over union and previous questions intersection is high which is pushing the probability up. The response rate is however low which is why the probability is low.

Now, that we understood what happens behind the model. Let’s actually try to improve the score we got using a more advanced model. Let’s use a sequential neural network:

The neural network was able to achieve a score of 88%. This is a 4% improvement from the logistic regression. We think this score is sufficient for this specific task since we just need the model to tell us who to send the emails to.

Now, whenever a new question gets received, a list of professionals who follow at least one topic of the question will be generated. The model will then get the probability of each professional to answer if we sent them an email. Those with a probability higher than a given threshold will be chosen to receive the emails. We perform a simulation to do exactly that and measure the performance of the model. You can read it here.

Summary

In this post, I did data preparation for the CareerVillage Challenge. I conducted all the needed data analysis and data cleaning. I did features engineering to extract interesting features like the response rate and the average time taken to answer questions by the professionals. I used these features to train two models: a Logistic Regression model and a neural network model. The neural network model achieved an accuracy of 88% overperforming the logistic regression by 5%. In the next post, we will conduct a more informative assessment of the model by implementing an event-driven simulation that will randomly generate questions and test the general performance of the model at finding the best people to answer them. Read the post here.

References

--

--

Mohamed Gaber

Data Science and Modeling and Simulation Enthusiast| Computational Sciences and Physics Student | gaber@minerva.kgi.edu