From Words to Numbers: The Secret Behind AI Language Models (with code example)

AndReda Mind
3 min readOct 16, 2024

--

Large language models (LLMs) are special types of machine learning algorithms designed to process and generate human language. You’ve likely heard of models like GPT-3, which is one of the most powerful language models. These models are used in a field called Natural Language Processing (NLP).

How do they work? Instead of actually “understanding” language like humans, these models work with something called word embeddings. Think of embeddings as a way to convert words into numbers. Machines can’t deal with words directly, so we represent each word in a sentence as a number. The position of these numbers in the sentence also matters, so the model keeps track of this too.

By looking at these numbers (or embeddings), the model can compare sentences and find patterns. It can say, “This sentence is similar to that one,” but it doesn’t really “understand” them. This is the same process used in image generation models, where the model compares a sentence to images and tries to generate something that matches.

Now, let’s talk about how these embeddings are created. The model is trained on huge amounts of text. It learns which sentences and words are similar and generates embeddings that reflect these relationships. When you give it a new sentence, it creates a set of numbers (embedding) that represents it, and compares it with other known embeddings to make predictions or generate responses.

In short, large language models use numbers to represent words and sentences. They compare these numbers to find patterns and generate new language or even images, without truly understanding the content like a human does.

Here’s a simple Python example using the BERT model to get embeddings for a sentence:

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sentence to get embedding for
sentence = "What is your most profound life insight?"

# Tokenize the sentence and convert it to a tensor
inputs = tokenizer(sentence, return_tensors='pt')

# Get the model's output (embeddings)
outputs = model(**inputs)

# The last hidden state is the embedding for the sentence
embedding = outputs.last_hidden_state

# Print the embedding (for simplicity, just show the shape here)
print(embedding.shape)



the output:

torch.Size([1, 10, 768])

The Python code provided uses a pre-trained BERT model from the Hugging Face library to generate embeddings for a sentence. Here’s a breakdown:

1. Import necessary libraries:

  • BertTokenizer: Converts words into tokens (numbers) that BERT can understand.
  • BertModel: The pre-trained BERT model that processes these tokens.
  • torch: Used to handle tensors, which are multi-dimensional arrays (needed for working with neural networks).

2. Load the pre-trained BERT model and tokenizer:

  • BertTokenizer.from_pretrained(‘bert-base-uncased’): Loads a tokenizer for the uncased BERT model, meaning it ignores case (so “Hello” and “hello” are treated the same).
  • BertModel.from_pretrained(‘bert-base-uncased’): Loads the BERT model that has already been trained on large amounts of data.

3. Input sentence:

“What is your most profound life insight?”

  • This is the sentence for which we want to generate an embedding.

4. Tokenize the sentence:

  • tokenizer(sentence): The tokenizer splits the sentence into tokens (smaller parts, often words or subwords) and converts them to numbers.
  • return_tensors=’pt’: Converts the tokens into a PyTorch tensor, which the model requires.

5. Get the model’s output:

  • model(**inputs): Feeds the tokenized sentence into the BERT model to process it.
  • The model returns outputs, which include the embeddings (hidden states) for each token.

6. Extract the embedding:

  • outputs.last_hidden_state: This is the final layer of BERT, which contains the embeddings for each word in the sentence. These embeddings capture the meaning of the sentence.

7. Print the shape of the embedding:

  • The shape of the embedding tells us how many tokens (words) were processed and how many numbers (dimensions) represent each token.

For example, the output shape might look like (1, 10, 768):

  • 1: The batch size (number of sentences, in this case, just 1 sentence).
  • 10: Number of tokens in the sentence.
  • 768: The size of the embedding for each token (BERT uses 768-dimensional vectors).

So this code turns the input sentence into a set of embeddings that the BERT model understands.

--

--

Responses (1)