Tripscraper

app
nlp
streamlit
An open source application for NLP playground
Published

July 10, 2023

source code


The application

The application uses a variety of NLP techniques, including sentiment analysis, word frequency analysis, word trends over time and aspect based sentiment analysis.

The Playground section is a sandbox where users can experiment with the NLP models that are used by the tool. This section allows users to try out different features of the models, to see how they work, and to learn more about how NLP can be used to analyze TripAdvisor reviews.

We use selenium package to scrape reviews from a TripAdvisor hotel page, nltk to tokenize and pre-processing text data and hugging face’s transformers package for pre-trained models.

Background

Text mining is the process of extracting information and insights from text. It can be used to identify patterns, trends, and sentiment in large amounts of text data.

For the purpose of this article we’re gonna explore how text mining can be an evalutation tool for Hotel owners in order to understand customer’s feelings and toughts.

For the sake of clarity we’re gonna focus on two of many techniques used in the application:

  • Bigrams analysis
  • Word trends analysis

Bigrams analysis

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for \(n=2\).

Wikipedia

When we talk about bigrams we may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time.

As one common visualization, we can arrange the words into a network, or “graph”. Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph is composed of three main components:

  • from: the node an edge is coming from
  • to: the node an edge is going towards
  • weight: A numeric value associated with each edge

Note that this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. So in this case the chain starting from “hotel” would suggest “staff” then “friendly/helpful” as next words, by following each word to the most common words that follow it. 1

Code
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from plotnine import *

def words_network_graph(dataset,
                        tuple,
                        raw,
                        frequency):
    
    # read data
    data = pd.read_csv(dataset)   
    def tupler(w):
        return (w.split(' ')[0],w.split(' ')[1])
    data[tuple] = data[raw].apply(lambda x: tupler(x)) 
    data = data[data['frequency']>=frequency]

    # create dictionary with counts
    d = data.set_index(tuple).T.to_dict('records')

    # network graph
    G = nx.Graph()

    # edges connections
    for k, v in d[0].items():
        G.add_edge(k[0], k[1], weight=(v*30))

    # nodes position
    pos = nx.spring_layout(G,k=2)

    # edges weight
    weights = nx.get_edge_attributes(G, 'weight').values()
    weights = list(weights)
    weights = list([w*0.0060 for w in weights])
    
    # plot
    
    blue_munsell = '#0085A1'
    eerie_black = '#242728'
    
    fig, ax = plt.subplots(figsize=(7,5))
    fig.set_facecolor(eerie_black)
    ax.set_axis_off()
    
    nx.draw_networkx(G, pos,
                     width=weights,
                     edge_color='white',
                     node_color=blue_munsell,
                     with_labels=False,
                     ax=ax,
                     node_size=50)
    
    # labels nudge
    def nudge(pos, x_shift, y_shift):
        return {n:(x + x_shift, y + y_shift) for n,(x,y) in pos.items()}
    pos_nodes = nudge(pos, 0.01, 0.1)
    nx.draw_networkx_labels(G, 
                            pos=pos_nodes, 
                            ax=ax,
                            font_color='white',
                            font_size=7)
1
Defining a tupler function for handling csv file.
2
Defining a nudge function for handling labels position.
Code
words_network_graph('reviews_bigrams.csv',
                    'bigram',
                    'bigrams',
                    10)

Figure 1: Bigrams network graph

Conclusions

As just an NLP exercise this article aim to show potentiality of text mining for business purposes, as well as it could be a powerful tool to gather insights on a product or a service.

You can find the source code on top of this article.