2 Replies Latest reply on May 24, 2018 11:16 PM by Intel Corporation

    Stackoverflow implicit feedback recommendation system


      I am trying to build a user - item recommendation engine based on the Stackoverflow favourite vote questions.


      The objective:

      To build a webpage / IDE plugin where the user receives his top N recommended questions based on:

           - his previous favourite votes on Stackoverflow

           - the programming language he is currently using (this will be a filter using the question tag, ex. only #java questions)


      The input data:

      I am using the Stackexchange data dump which can be found here: stackexchange directory listing; from there I've extracted the data that I thought would be useful:


           Votes table (each User - Question pair represents a favourite vote for the question from the user):

           UserId - QuestionId


           Tags table:

           QuestionId - TagId


      I also have a lot details about each user/question which would make sense in a content-based approach. The only content I used so far are the question tags.


      Problems/Properties of the data:

      - the data consists of implicit feedback -> a user either marked a question as favourite or he didn't (binary problem 0/1)

      - the data set is quite large, training and evaluating the a model takes a lot of time (votes CSV file has a few GB)


      Progress so far:

      So far I've tried a few different approaches, most of them are some sort of collaborative filtering:


      - the first thing I tried was using cosine similarity to get top N question - question  recommendations, just to test if the results are better than random

      - then I've tried using Spark's Alternating Least Squares Matrix Factorisation model but the results were also mediocre, because I am using implicit feedback data and the ALS technique is built for Explicit Data

      - I've also tried using another MF model with Bayesian Personalised Ranking loss function, which is better suited for implicit data. The library I used here is LightFM and the metric for evaluation is ROC AUC https://www.kaggle.com/iancuv/lightfm-demo?scriptVersionId=3670161



      Open questions / suggestions:

      Do you have any suggestions of some other approaches I should use?

      How would you approach this problem?

      What preprocessing of the data makes sense to achieve better results?

      Is any of the mentioned techniques a good choice for this problem?

      Would a only content-based approach make sense?

      If yes, how can I improve the results?



      I should also mention ( you probably figured it out ) that I'm a CS student, new to the AI/machine learning field. The only applications I've done in the past are related to either simple regression or classification, nothing as complicated as implicit feedback recommendation systems. I know the problem/questions I've mentioned above are very specific but any help is very much appreciated.



      Useful links:

      Welcome to LightFM’s documentation! — LightFM 1.14 documentation

      Welcome to Spark Python API Docs! — PySpark master documentation

      Alternating Least Squares – Data Science Made Simpler

      https://arxiv.org/pdf/1205.2618.pdf - Bayesian Personalised Ranking MF