Blog

Building a Chatbot with TensorFlow and Keras

Sophie Turol

building-an-answerbot-with-keras-and-tensorflow-v11-icon

Digital assistants built with machine learning solutions are gaining their momentum. At TensorBeat 2017, one of the sessions covered how to deliver an answer bot with Keras and TensorFlow, what tools may help to address the issues, as well as tips on training a model and improving prediction results.

 

The challenges

Avkash Chauhan, Vice President at H2O.ai, outlined major challenges while developing an answer bot using Keras on top of TensorFlow:

  • Finding proper tags
  • Finding and removing not-safe-for-work (NSFW) words
  • Identifying sentiment in the question (positive or negative)
  • Setting priority to find the answer (low, medium, high, and critical)
  • Figuring out the gender of a questioner
  • Rating a question
  • Removing question duplicates

avkash-chaunhan-building-an-answer-bot-with-keras-and-tensorflow

 

Addressing the issues

To address the need of finding proper tags, one may employ word embeddings (word2vec). To find and remove not-safe-for-work words, brute-force search and the NLTK stop words are utilized.

Sentiment in the question can be identified by enabling a binomial classification (two classes) paired with tree-based algorithms (gradient boosting, random forest, or distributed random forest) or a neural network. With that technique, one can also figure out whether a questioner is male or female.

The above-mentioned algorithms coupled with multinomial classification (four classes) may help out to set priority while looking for an answer. Question rating is defined in the same fashion, but multinomial classification embodies N classes—i.e., as much as needed—to provide for 1–5 star rating.

Finding best available answers involves three major steps:

  1. Look for the tags and keywords through clustering and reduction
  2. Create tag and keywords weights for each question
  3. Match tags and keywords with their weights to find top probabilities

avkash-chaunhan-building-an-answer-bot-with-keras-and-tensorflow-at-tensorbeat-2017

Moving onto the issue of duplicated questions, Avkash exemplified the competition initiated by Quora. Its goal was to predict which of the provided pairs of questions contain two questions with the same meaning.

Then, Avkash demonstrated how to classify sentences to identify rating and sentiments using the following data set:

  • Real data available at StackOverflow, Community, Quora, etc.
  • Experimental data: 41 million reviews in 1–5 star category available at Yelp
  • Twitter sentiment (through searching or mining)

 

Training a model in a cloud

The technologies used in the course of training a model are as follows:

  • Keras on top of TensorFlow
  • the NLTK stop words
  • the GloVe algorithm (pre-trained word2ves data sets—400K words)
  • Sentiment: make-sentiment-model.py and PositiveNegative.ipynb
  • Rating: make-5star-model.py and 5StarReviews.ipynb
  • Prediction: PredictNow.py

Apart from being fast, Keras was chosen as it supports both convolutional and recurrent neural networks, as well as their combination.

building-an-answerbot-with-keras-and-tensorflow-v11

After the data preparation step, one has to create a data collection and remove stop words. Then, one moves on with tokenizing and uniforming this collection to deliver a final data set, which comprises sentences (sentences_per_record, length) and labels (label_per_recordm, length).

On splitting a data set to train and validate, a predefined word vector is loaded to find match words from the collection and further create an embedding matrix. After delivering and configuring an embedding layer, the training of a model is launched.

avkash-chaunhan-building-an-answer-bot-with-keras-and-tensorflow-tensorbeat-2017

In case you hit the same prediction, there are a few scenarios to help out:

  • Retrain a model.
  • Rebalance a data set by either upsampling a less frequent class or downsampling a more frequent one.
  • Adjust class weights by setting a higher class weight for a less frequent class. Thus, the network will focus on the downsampled class during the training process.
  • Increase the time of training so that the network concentrates on less frequent classes.

To enhance data processing, Avkash suggested using such models as doc2seq, sequence-to-sequence ones, and lda2vec.

You can find the source code of an answer bot demonstrated in Avkash’s GitHub repo.

 

Want details? Watch the video!

 

 

Related slides

 

Related reading

 

About the speaker

Avkash Chauhan is Vice President at H2O.ai. His responsibilities encompass working with the global enterprise customers to bring their machine and deep learning technical requirements to the engineering team and make sure these requirements are met within the delivered products. Prior to that, Avkash had his own Big Data Analytics for DevOps startup, which was acquired in 28 months after launch. You can also check out his GitHub profile.


To stay tuned with the latest updates, subscribe to our blog or follow @altoros.

Get new posts right in your inbox!

1 Comment
  • Looks like article was written by bot. Bag of random crap – sentiment analysis, rating of question, gender prediction . WTF? How is it related to bots, conversations and question answering?

Benchmarks and Research

Subscribe to new posts

Get new posts right in your inbox!