Problem Statement As an information scientist for the marketing division at reddit.

Problem Statement As an information scientist for the marketing division at reddit.

i have to get the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this full situation will be fairly safe therefore I will utilize the precision score and set up a baseline of 63.3per cent to price success. Utilizing TFiDfVectorization, I’ll get the function value to find out which terms have actually the greatest forecast energy for the prospective factors. If effective, this model may be used to focus on other pages which have comparable regularity associated with the words that are same expressions.

Data Collection

See relationship-advice-scrape and dating-advice-scrape notebooks with this component.

After switching all of the scrapes into DataFrames, they were saved by me as csvs that you can get within the dataset folder with this repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are useless in my experience.
  • combined name and selftext column directly into one brand brand new columns that are all_text
  • exambined distributions of term counts for games and selftext column per post and contrasted the two subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 which means that if i select the value that develops most frequently, i will be appropriate 63.3% of that time period.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first set of scraping, pretty bad rating with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74percent

Merely increasing the data and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a cross val to 82.3 nevertheless, these rating disappeared.

I do believe Tfidf worked the greatest to reduce my overfitting due to variance issue because

we customized the end terms to just just take the ones away which were really too regular to be predictive. It was a success, nonetheless, with increased time Learn More we most likely could’ve tweaked them a little more to boost all ratings. Taking a look at both the solitary terms and words in categories of two (bigrams) ended up being the most readily useful param that gridsearch advised, nevertheless, each of my top most predictive words wound up being uni-grams. My initial set of features had a good amount of jibberish words and typos. Minimizing the # of that time period an expressed term ended up being necessary to show as much as 2, helped be rid of these. Gridsearch additionally proposed 90% max df rate which aided to get rid of oversaturated terms also. Finally, establishing max features to 5000 reduced cut down my columns to about 25 % of whatever they were to simply concentrate probably the most commonly used terms of that which was kept.

Summary and tips

Also though I wish to have higher train and test ratings, I happened to be in a position to effectively reduce the variance and you will find certainly a few terms which have high predictive energy

thus I think the model is willing to introduce a test. If marketing engagement increases, similar key phrases might be utilized to locate other possibly profitable pages. It was found by me interesting that taking right out the overly used words aided with overfitting, but brought the precision rating down. I believe there is certainly probably nevertheless space to relax and play around with the paramaters associated with Tfidf Vectorizer to see if various end words produce an or that is different


Used Reddit’s API, needs collection, and BeautifulSoup to clean articles from two subreddits: Dating guidance & union information, and trained a binary category model to anticipate which subreddit confirmed post originated in

Leave a Reply

Your email address will not be published. Required fields are marked *