
Merhaba değerli takipçilerim,
Bu makaleyi Elon Musk’ın ChatGPT sohbet yazılımına sorular sorarak yazıyorum.
Wikipedia’dan Öğrenen ChatBot’u nasıl kodlarsın? ChatGPT’ye sordum…
Öncelikle soru şu Python’da Wikipedia makalelerinden öğrenen bir ChatBot nasıl yaparım?
import wikipedia
class Chatbot:
def __init__(self):
# Train the chatbot on Wikipedia articles
articles = wikipedia.search("machine learning")
text = ""
for article in articles:
text += wikipedia.summary(article)
# TODO: Use the text from the Wikipedia articles to train the chatbot
# This could involve preprocessing the text, tokenizing it,
# creating a vocabulary, and training a model on the data
def get_response(self, input):
# TODO: Generate a response based on the input from the user
# This function should use the trained model to generate a response
# based on the provided input
return "I'm sorry, I don't know the answer to that."
Gördüğünüz üzere iskelet kodu yazdı. wikipedia.summary ile arama sonucunda bulduğumuz makalelerin özetini alıyoruz ve bunu text stringine aktarıyoruz.
Şimdi ChatGPT’den daha fazla ayrıntı vermesini istedik, nasıl train ederim diye sordum. Cevap şu:
import nltk
import wikipedia
class Chatbot:
def __init__(self):
# Train the chatbot on Wikipedia articles
articles = wikipedia.search("machine learning")
text = ""
for article in articles:
text += wikipedia.summary(article)
# Preprocess the text
text = text.lower()
text = text.replace("\n", " ")
# Tokenize the text
tokens = nltk.tokenize.word_tokenize(text)
# Create a vocabulary
vocabulary = set(tokens)
# Create a model
model = {}
for word in vocabulary:
model[word] = 0
# Train the model on the text data
for word in tokens:
model[word] += 1
# Save the trained model
self.model = model
def get_response(self, input):
# Generate a response based on the input from the user
# This function should use the trained model to generate a response
# based on the provided input
return "I'm sorry, I don't know the answer to that."
# Create an instance of the Chatbot class
chatbot = Chatbot()
# Ask the user for input and get a response from the chatbot
user_input = input("Enter your question or statement: ")
response = chatbot.get_response(user_input)
# Print the response to the user
print("Chatbot: " + response)
Şimdi ChatGPT’den get_response fonksiyonunu kodlamasını istedim ve şu cevabı verdi:
class Chatbot:
def __init__(self, model):
# Save the trained model
self.model = model
def get_response(self, input):
# Tokenize the input
tokens = nltk.tokenize.word_tokenize(input)
# Generate a response based on the input
response = ""
for i in range(len(tokens) - 1):
word = tokens[i]
next_word = tokens[i + 1]
if word in self.model:
response += self.model[word].get_best_prediction() + " "
response += tokens[-1]
return response
İşte ChatGPT’ye sorarak Wikipedia makalelerinden öğrenen bir ChatBot kodu yazdırmış olduk.
Bi dakka! Koda bakarsanız get_best_prediction diye bilinmeyen bir fonksiyon farkettim, onu da sordum ve cevap:
class Chatbot:
def __init__(self, model):
# Save the trained model
self.model = model
def get_best_prediction(self):
# Return the word with the highest probability according to the trained model
best_prediction = None
best_probability = 0
for word, probability in self.model.items():
if probability > best_probability:
best_prediction = word
best_probability = probability
return best_prediction
def get_response(self, input):
# Tokenize the input
tokens = nltk.tokenize.word_tokenize(input)
# Generate a response based on the input
response = ""
for i in range(len(tokens) - 1):
word = tokens[i]
next_word = tokens[i + 1]
if word in self.model:
response += self.model[word].get_best_prediction() + " "
response += tokens[-1]
return response
Sonra şunu farkettim, buradaki model nesnesi başlangıçtakinden farklı, çıkar şu ağzındaki baklayı dedim 🙂 İlginç ki modeldeki probability özelliğini bi türlü kodlamıyordu. O yüzden toplamda kod çalışmayacaktı.
Bu yüzden özellikle onu istedim. Bu sefer daha ayrıntılı bir kod yazdı.
class Chatbot:
def __init__(self):
# Load and preprocess the text data
text = LoadAndPreprocessTextData()
# Tokenize the text
tokens = nltk.tokenize.word_tokenize(text)
# Create a vocabulary
vocabulary = set(tokens)
# Create a model
model = {}
for word in vocabulary:
model[word] = 0
# Train the model on the text data
for i in range(len(tokens) - 1):
word = tokens[i]
next_word = tokens[i + 1]
model[word][next_word] += 1
# Normalize the probabilities
for word, next_words in model.items():
total_count = sum(next_words.values())
for next_word, count in next_words.items():
model[word][next_word] = count / total_count
# Save the trained model
self.model = model
def get_best_prediction(self):
# Return the word with the highest probability according to the trained model
best_prediction = None
best_probability = 0
for word, next_words in self.model.items():
for next_word, probability in next_words.items():
if probability > best_probability:
best_prediction = next_word
best_probability = probability
return best_prediction
def get_response(self, input):
# Tokenize the input
tokens = nltk.tokenize.word_tokenize(input)
# Generate a response based on the input
response = ""
for i in range(len(tokens) - 1):
word = tokens[i]
next_word = tokens[i + 1]
if word in self.model:
response += self.model[word].get_best_prediction() + " "
response += tokens[-1]
return response
Sanırım herhangi bir metinden öğrenen genel chatbot algoritmasını anlamanız için bu kod yeterli olur.
Bonus soru olarak ta LoadAndPreprocessTextData fonksiyonunu sordum:
def LoadAndPreprocessTextData():
# Load the text data from a file
with open("data.txt", "r") as file:
text = file.read()
# Preprocess the text data
text = text.lower()
text = text.replace("\n", " ")
text = re.sub(r"[^a-z ]", "", text)
return text
Metni küçük harflere çeviriyoruz, enter karakterlerini boşluğa çeviriyoruz ve özel karakterleri metinden çıkarıyoruz. Aslında burada birden fazla boşluk karakterini tek boşluk karakterine dönüştürmek gibi ilave işlemler de yapılması lazım.
O yüzden bana daha iyi bir kod ver dedim ve bu sefer bana bu fonksiyonun daha gelişmiş sürümünü kodladı:
def LoadAndPreprocessTextData(file_path, encoding):
# Load the text data from a file
with open(file_path, "r", encoding=encoding) as file:
text = file.read()
# Preprocess the text data
text = text.lower()
text = text.replace("\n", " ")
text = re.sub(r"[^a-z ]", "", text)
text = re.sub(r"\s+", " ", text)
return text
ChatGPT’ye siz sorular sorup uğraşıp ısrar ettikçe daha kaliteli kod yazıyor. O yüzden ChatBot sınıfının full and improved sürümünü talep ettim, ancak ChatGPT buna cevap vermedi ve beni beklemeye aldı, bu yüzden yeniden başlamak zorunda kaldım ve bu kez şu cevabı verdi:
import openai_secret_manager
assert "openai" in openai_secret_manager.get_services()
secrets = openai_secret_manager.get_secrets("openai")
print(secrets)
import openai
openai.api_key = secrets["api_key"]
class Chatbot:
def __init__(self, model_name: str):
self.model_name = model_name
self.model = openai.Model.get(model_name)
def generate_response(self, prompt: str, temperature: float = 0.5, max_tokens: int = 64):
completions = self.model.completions(
engine="text-davinci-002",
prompt=prompt,
temperature=temperature,
max_tokens=max_tokens,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
return completions.choices[0].text
Ve sonra şunu:
chatbot = Chatbot("my-trained-gpt3-model")
response = chatbot.generate_response("What is the capital of France?")
print(response) # "The capital of France is Paris."
Yani OpenAI servisini kullanarak ta ChatBot yapabiliyormuşuz. Ama nltk kütüphanesi ile nasıl yaparım diye sordum. Amacım daha kaliteli bir kod almaktı. İşte cevap:
import nltk
from nltk.corpus import wikipedia
class Chatbot:
def __init__(self):
self.tokenizer = nltk.tokenize.TreebankWordTokenizer()
self.lemmatizer = nltk.stem.WordNetLemmatizer()
# Get the list of Wikipedia articles
articles = wikipedia.fileids()
self.titles = [title for title in articles if "wiki" in title]
# Store the content of each article as a string in a dictionary
# with the article title as the key
self.articles = {title: wikipedia.raw(title) for title in self.titles}
def generate_response(self, prompt: str):
# Tokenize the prompt and lemmatize each token
tokens = self.tokenizer.tokenize(prompt)
tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
# Search for articles that match the prompt
matching_articles = []
for title, content in self.articles.items():
# Check if all tokens in the prompt are present in the article title
if all(token in title.lower() for token in tokens):
matching_articles.append((title, content))
# If there are no matching articles, return a default response
if not matching_articles:
return "I'm sorry, I could not find any information on that topic."
# If there is only one matching article, return the content of that article
if len(matching_articles) == 1:
return matching_articles[0][1]
# If there are multiple matching articles, return a list of their titles
else:
response = "I found multiple articles on that topic. Here is a list of their titles:\n"
for article in matching_articles:
response += f"- {article[0]}\n"
return response
Ve ek cevap:
chatbot = Chatbot()
response = chatbot.generate_response("What is the capital of France?")
print(response) # "The capital of France is Paris."
Görüldüğü üzere aynı soruya farklı kodlarla cevap verip durmaya başladı. Tekrar sordum yine OpenAI servisi ile nasıl yapılacağını yazdı:
import openai_secret_manager
import openai
import re
assert "openai" in openai_secret_manager.get_services()
secrets = openai_secret_manager.get_secrets("openai")
print(secrets)
openai.api_key = secrets["api_key"]
# Load the Wikipedia articles dataset using the OpenAI API
articles_dataset = openai.Dataset.from_files(["https://d4mucfpksywv.cloudfront.net/wiki-pages/wiki-001.jsonl"])
# Create a new GPT-3 model and train it on the Wikipedia articles dataset
model = openai.Model.create(
model="text-davinci-002",
dataset=articles_dataset,
name="my-trained-gpt3-model"
)
class Chatbot:
def __init__(self, model_name: str):
self.model_name = model_name
self.model = openai.Model.get(model_name)
self.tokenizer = openai.Tokenizer.create(self.model_name)
def generate_response(self, prompt: str, temperature: float = 0.5, max_tokens: int = 64):
# Tokenize the prompt and encode it for input to the GPT-3 model
tokens = self.tokenizer.encode(prompt)
input = openai.Completion.create(
engine=self.model_name,
prompt=tokens,
temperature=temperature,
max_tokens=max_tokens,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Generate the response using the GPT-3 model
response = self.model.completions(input)[0]
# Decode the response and remove any leading or trailing whitespace
response_text = self.tokenizer.decode(response.tokens, skip_special_tokens=True)
response_text = response_text.strip()
# Remove any additional sentence-ending punctuation
response_text = re.sub(r"[.!?]+$", "", response_text)
return response_text
chatbot = Chatbot("my-trained-gpt3-model")
response = chatbot.generate_response("What is the capital of France?")
print(response) # "The capital of France is Paris."
Ama pes etmedim ve doğru cevabı alıncaya kadar tekrar tekrar sordum, işinize yarayabilir diye chatbot varyasyonlarını sizinle paylaşıyorum:
import nltk
from nltk.corpus import wikipedia
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
# Download the NLTK data and the Wikipedia corpus
nltk.download("popular")
# Get the list of Wikipedia articles
articles = wikipedia.fileids()
titles = [title for title in articles if "wiki" in title]
# Store the content of each article as a string in a dictionary
# with the article title as the key
articles = {title: wikipedia.raw(title) for title in titles}
# Define a function to get the part of speech of a word
def get_pos(word: str):
pos = wordnet.synsets(word)[0].pos()
if pos == "n":
return "noun"
elif pos == "v":
return "verb"
elif pos == "a":
return "adjective"
elif pos == "r":
return "adverb"
else:
return "unknown"
class Chatbot:
def __init__(self):
self.tokenizer = nltk.tokenize.TreebankWordTokenizer()
self.lemmatizer = nltk.stem.WordNetLemmatizer()
self.articles = articles
def generate_response(self, prompt: str):
# Tokenize the prompt and lemmatize each token
tokens = self.tokenizer.tokenize(prompt)
lemmas = [self.lemmatizer.lemmatize(token, get_pos(token)) for token in tokens]
# Search for articles that match the prompt
matching_articles = []
for title, content in self.articles.items():
# Check if all tokens in the prompt are present in the article title
if all(lemma in title.lower() for lemma in lemmas):
matching_articles.append((title, content))
# If there are no matching articles, return a default response
if not matching_articles:
return "I'm sorry, I could not find any information on that topic."
# If there is only one matching article, return the content of that article
if len(matching_articles) == 1:
return matching_articles[0][1]
# If there are multiple matching articles, return a list of their titles
else:
response = "I found multiple articles on that topic. Here is a list of their titles:\n"
for article in matching_articles:
response += f"- {article[0]}\n"
return response
chatbot = Chatbot()
response = chatbot.generate_response("What is the capital of France?")
print(response) # "The capital of France is Paris."
Açıkçası ilk iskelet koduna bir türlü ulaşamadım. Ve sonra iskelet kodu kopyalayıp ChatGPT’ye bu kodu gelişmiş bir şekilde tamamlamasını istedim ve sonunda İŞTE ÇALIŞAN KOD:
import wikipedia
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
class Chatbot:
def __init__(self):
# Train the chatbot on Wikipedia articles
articles = wikipedia.search("machine learning")
text = ""
for article in articles:
text += wikipedia.summary(article)
# Preprocess the text by lowercasing it, tokenizing it,
# and creating a vocabulary of unique words
self.text = text.lower()
self.sentences = sent_tokenize(text)
self.words = word_tokenize(text)
self.vocab = set(self.words)
# Create a bag-of-words representation of the text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(self.sentences)
# Train a similarity model on the bag-of-words representation
self.similarity_model = cosine_similarity(X)
def get_response(self, input):
# Preprocess the input by lowercasing it and tokenizing it
input = input.lower()
input_words = word_tokenize(input)
# Compute the similarity scores between the input and all sentences
# in the text using the trained model
similarity_scores = self.similarity_model.dot(input_words)
# Find the sentence with the highest similarity score
best_match = self.sentences[similarity_scores.argmax()]
# Return the response generated from the best matching sentence
return best_match
chatbot = Chatbot()
response = chatbot.get_response("What is machine learning?")
print(response) # "Machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence."
Wikipedia.Summary’i kullanmadan doğrudan articles’taki metinleri kullanmayı deneyin 😉
Mutlu kodlamalar 🙂