Amazon Reviews Analysis: Parsing, Sentiment Analysis, and Clustering

Project Overview

This project is a data science and natural language processing (NLP) pipeline that extracts, processes, and analyzes Amazon reviews for specific products. The aim is to provide insights into customer sentiments, product strengths, and weaknesses, and make product comparisons more straightforward. This system uses a combination of web scraping, NLP techniques, and machine learning models to parse, cluster, and evaluate review sentiments with detailed statistical analysis.

Technical Workflow

The analysis pipeline for Amazon reviews involves several key stages:

  1. Data Collection: Using Selenium to automate browser actions, the system scrapes review data for specified products. Each review includes metadata such as review ID, rating, date, and review text.
  2. Text Preprocessing: The reviews are cleaned and tokenized by removing unnecessary characters, single characters, and extra spaces. Stop words are removed, and lemmatization is applied to reduce words to their root forms.
  3. Sentiment Analysis: Using the VADER sentiment analysis tool, the system computes compound sentiment scores for each review. Reviews are classified as positive, negative, or neutral based on predefined thresholds for compound scores.
  4. Clustering and Topic Modeling: K-means clustering and Principal Component Analysis (PCA) are used to identify distinct clusters of reviews based on term frequency-inverse document frequency (TF-IDF) vectors, allowing for exploration of different review themes.
  5. OpenAI GPT Integration: Summaries of clusters and major sentiment trends are generated using OpenAI's GPT-3/4 API. This provides an easily interpretable output, summarizing the key positive and negative points for each product.

Data Collection

Using the Selenium WebDriver, the script automates browser actions to scrape product reviews from Amazon. Each review's metadata is extracted, including:

Text Preprocessing

The raw review texts are processed using Python’s re and nltk libraries for NLP preprocessing. Steps include:

This ensures that the text is ready for analysis and minimizes noise.

Sentiment Analysis

Using the VADER sentiment analyzer, each review is scored for positive, neutral, and negative sentiment components. A compound score is computed, which classifies reviews as:

Additionally, the compound score is used to gauge overall sentiment intensity and polarity for further analysis.

Clustering and Topic Modeling

TF-IDF vectors are generated for each review, and K-means clustering is applied to group reviews based on thematic similarities. Principal Component Analysis (PCA) is then used for dimensionality reduction and visual representation.

Steps:

GPT Integration for Summarization

Using OpenAI’s GPT-3/4 API, the system summarizes each cluster of reviews, extracting top pros and cons. This step generates a readable summary of key themes in customer feedback, which helps in decision-making and understanding consumer preferences.

Results and Visualizations

The final outputs include:

Case Study: Hiccapop Product Reviews

The Hiccapop travel booster seat and related baby products were analyzed using the sentiment analysis and clustering techniques applied in this project. Key differences emerged between Hiccapop and a competitor’s product reviews, providing insight into customer preferences and areas for improvement.

Positive Sentiment Summary for Hiccapop

Hiccapop products received positive feedback for their:

Hiccapop Product

Hiccapop Product

Comparative Review Highlights

Compared to a similar product, Hiccapop was praised for its:

While both products received positive reviews, Hiccapop’s compact design, ease of use, and family-friendly features resonated particularly well with parents looking for reliable travel solutions for young children.

Conclusion

This project successfully demonstrates the use of advanced data science techniques for e-commerce analysis. The combination of NLP, sentiment analysis, clustering, and GPT-based summarization provides a comprehensive tool for analyzing customer sentiment on Amazon. The insights gained from this project can assist businesses in identifying product strengths and weaknesses, tailoring marketing strategies, and enhancing product features based on customer feedback.