hammadr.com

Introduction

People search for things every day, whether it's recipes or debugging code. Search engines drive today's world, but they often fall short when finding good movies. EZ-Movies aims to create a search engine specializing in finding the optimal movie based on user queries. We will compare the performance of two models on different queries and use the best model for our movie recommendation system. Our system also includes filters for genre, release year, and rating to optimize the user experience.

Description

EZ-Movies is a user-friendly way to find movies relevant to a user's query. The user interface features a search bar where users describe the kind of movie they want to watch. It returns a list of movies most relevant to the user's query, including important information such as the movie's description, rating, genre, and more. The project is divided into three main parts:

Data Collection and Pre-Processing

We used an IMDb dataset from Kaggle that provides information about thousands of movies, including title, genre, context, release year, actors, directors, and more. We pre-processed the dataset by removing stop-words, punctuation, numbers, and other irrelevant information. We also performed stemming to group similar words together.

Comparing Two Models

We compared the results of two models:

A vector-space model using Word2Vec.
A Probabilistic Text Retrieval model utilizing Jelinek-Mercer smoothing.

The better model was selected for our front-end system.

The Interactive System

This system allows users to search queries on an interactive site, returning movies most similar to the user's search results based on the best model. Users can also filter search results by genre, release year, and rating.

Evaluation

Comparing Two Models

Our text retrieval pipeline featured two searching capabilities: Word2Vec and Probabilistic Text Retrieval.

For the Word2Vec model, we cleaned the movie descriptions in the IMDb dataset and compared each Word2Vec-embedded word in the query with each embedded word in the movie descriptions. The top 10 movies along with their relevance scores were returned based on their descriptions.

from gensim.models import Word2Vec
 
# Training Word2Vec model
cleaned_descr = [...]  # preprocessed and cleaned descriptions
w2v_model = Word2Vec(cleaned_descr, vector_size=300, window=2, sg=1, min_count=1)
 
# Example query
query = "scary murder mystery"
query_words = query.split()
scores = []
 
for descr in cleaned_descr:
    score = sum(w2v_model.wv.similarity(query_word, word) for word in descr if word in w2v_model.wv) / len(descr)
    scores.append(score)
 
top_10_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:10]

For the Probabilistic Text Retrieval model, we built an inverted index using Hadoop and applied Jelinek-Mercer smoothing with a lambda value of 0.4 to return the top 10 movies based on different queries.

# Calculate document scores using Jelinek-Mercer smoothing
lambda_val = 0.4
scores = []
 
for query_vec in query_vectors:
    for doc_vec, doc_len in zip(document_vectors, document_lens):
        score = sum(np.log(1 + ((1 - lambda_val) / lambda_val) * (doc_count / (doc_len * vocab_prob)))
                    for doc_count, vocab_prob in zip(doc_vec, vocab_lm))
        scores.append(score)
 
top_10_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:10]

The Interactive System

After implementing both models, we compared the results. Our interface allows users to run queries and see results. Filters for genre, release year, and rating help refine search results.

Discussion

Interpreting the Results

We compared the results of a Word2Vec model and a Probabilistic Text Retrieval model. The Probabilistic Text Retrieval model yielded better results as it was able to convert all text data into a language model to yield results via search queries, unlike the Word2Vec model, which was limited by the availability of vector representations.

Limitations and Future Improvements

Some top results didn't match our initial expectations due to treating every word in the query independently. Future work could involve more advanced neural net models for better insight into the query as a whole.