Chatbots have experienced a huge increase with ChatGPT and other AI chatbots. Creating my own chatbot is not easy due to the amount of data I have. However, using all the WhatsApp chats I had, I was able to create a decent chatbot that is even able to speak different languages like Catalan. Take a look at a conversation with my own chatbot.
of the chatbot depends on the quality of the data and the algorithm. Due to the massive amount of data that Google has and its highly skilled professionals, Google AI can engage in incredible and deep conversations.
However, when creating my own chatbot, which is based solely on a .txt file containing a WhatsApp conversation, I could see that the algorithm works, and I was able to have a basic level of conversation.
Next, we are going to see how the algorithm work:
1) Get all the different words of the message sent by the user.
2) Read all the lines of the WhatsApp conversation:
2.1) Count the similar words between the message and each line.
2.2) Save the lines with more similarities with the message.
2.3) Get the answer to the saved lines.
(Answer = next line of conversation)
2.4) Returns the most repeated answer.
Below I will show a video of a conversation with the chatbot and the link to the GitHub where you will be able to enter the “.txt” of your WhatsApp conversations and “talk to yourself”.
# Import
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
from math import *
from datetime import datetime as date
import matplotlib.pyplot as plt
from iteration_utilities import duplicates
from iteration_utilities import unique_everseen
# Read the data:
txtdata = pd.read_table(".txt", on_bad_lines='skip') # --> Enter the .txt flie here
# Data Preparation:
## Separate the txt file in different columns
# Column with all the information:
column_name = txtdata.columns[0]
# Separate the information in different columns
txtdata["Date"] = txtdata[column_name].str.split(" ").str.get(0).str.title()
txtdata["Hour"] = txtdata[column_name].str.split(" ").str.get(1).str.title()
txtdata["Name"] = txtdata[column_name].str.split(" ").str.get(2).str.title()
txtdata["Content"] = txtdata[column_name].str.split(":").str.get(3)
# Save the columns:
# columns to keep:
columns_save = ['Date', 'Hour', 'Name', 'Content']
# New df:
df = txtdata[columns_save]
# Clean the data:
# Delet "[" brakets "]" and ":"
df["Date"] = df["Date"].str.split("[").str.get(1)
df["Hour"] = df["Hour"].str.split("]").str.get(0)
df["Name"] = df["Name"].str.split(":").str.get(0)
## Delet the Nan values
# Save the df without any nan values
df = df[~df["Date"].isna()]
# Delet the missatges that rows thatr contains the imatge or audio text
df = df[df["Content"].str.contains("omitted") == False]
# Merge
# Crate a new column to merge the conversations:
df["Change"] = ""
df["Count"] = ""
# Max num of rows to do the while loop:
max_fun = df["Name"].count()
max_fun = max_fun - 2
# Create a loop that detects when it change writer, and count all the changes
# Get and Error but sill work
i = 0
coun = 0
while i <= max_fun:
if (df["Name"].iloc[i+1] == df["Name"].iloc[i]):
i = i + 1
df["Change"].iloc[i]= 1
df["Count"].iloc[i ] = coun
else:
i = i + 1
df["Change"].iloc[i] = 0
coun = coun + 1
df["Count"].iloc[i] = coun
# Find value more repetitive value on a list
def most_frequent(List):
return max(set(List), key = List.count)
# Group the conversations:
# Group the mensatgess of the same writer together until it change
df_gr = df.groupby(df["Count"]).first()
df_gr = df_gr[:-1]
# Chat Bot:
while True:
# read our mesatge and get all the words
msg = input("Me:")
list_msg = msg.split(" ")
# Save the possible answers
Posiibles_respostes = []
# Read all the mesatges to detect the once are similars are our msg
max_fun = df_gr["Name"].count()
max_fun = max_fun - 1
i = 0
max_count = 0
while i < max_fun:
try:
#Get the content
content = df_gr["Content"].iloc[i]
# Counter of similar words of our msg and the content
count_simitud= 0
# Loop for all the words in our msg
for paraula in list_msg:
# If the word in the msg is in the content add 1 in to the counter
if paraula in content:
count_simitud = count_simitud +1
# If the number of similar word is the maximum delet all the oder saved answares
if count_simitud > max_count:
Posiibles_respostes = []
# If the number of similar word is the maximum or eaqual save the maximum, the position of the maximum, and all the possible answares:
if count_simitud >= max_count:
max_count = count_simitud
num_max = i
resposta_provisional = num_max + 1
# List with the possible answars:
resposta = df_gr["Content"].iloc[resposta_provisional]
Posiibles_respostes.append(resposta)
except:
pass
# Add 2 because we want to read the content of the same person
i = i + 2
# Get the most frequant answare:
resposta_def = most_frequent(Posiibles_respostes)
resposta_def = resposta_def.replace("baby", "")
# Print the answare of the Bot
print("Bot:" , resposta_def)
print("")