GenAI – Fixed Size Chunking .


GenAI – Fixed Size Chunking

Table Of Contents:

  1. What Is Fixed Size Chunking ?
  2. When To Use Fixed Size Chunking ?
  3. Advantages Of Fixed Size Chunking.
  4. Disadvantages Of Fixed Size Chunking.
  5. Examples Of Fixed Size Chunking.

(1) What Is Fixed Size Chunking ?

(2) When To Use Fixed Size Chunking ?

(3) Advantages Of Fixed Size Chunking.

(4) Disadvantages Of Fixed Size Chunking.

(5) Examples Of Fixed Size Chunking.

Example – 1: Document Embedding for Semantic Search

  • Use case: Splitting a long article into 512-token chunks for embedding in a vector database.
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Your very long document goes here..."
tokens = tokenizer.encode(text)

chunk_size = 512
chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]

# Convert back to text chunks
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

Example – 2: Preprocessing PDF or Book for LLM Input

  • Example: A 20,000-character textbook is split into 1000-character chunks before feeding to GPT for summarization.
chunk_size = 1000
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

Example – 3: Chunking a Transcript

  • Use case: A meeting transcript is split into fixed-size segments of 30 lines for summary.
lines = transcript.split("\n")
chunk_size = 30
chunks = [lines[i:i+chunk_size] for i in range(0, len(lines), chunk_size)]

Leave a Reply

Your email address will not be published. Required fields are marked *