GenAI – Fixed Size Chunking
Table Of Contents:
- What Is Fixed Size Chunking ?
- When To Use Fixed Size Chunking ?
- Advantages Of Fixed Size Chunking.
- Disadvantages Of Fixed Size Chunking.
- Examples Of Fixed Size Chunking.
(1) What Is Fixed Size Chunking ?
(2) When To Use Fixed Size Chunking ?
(3) Advantages Of Fixed Size Chunking.
(4) Disadvantages Of Fixed Size Chunking.
(5) Examples Of Fixed Size Chunking.
Example – 1: Document Embedding for Semantic Search
- Use case: Splitting a long article into 512-token chunks for embedding in a vector database.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Your very long document goes here..."
tokens = tokenizer.encode(text)
chunk_size = 512
chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]
# Convert back to text chunks
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
Example – 2: Preprocessing PDF or Book for LLM Input
- Example: A 20,000-character textbook is split into 1000-character chunks before feeding to GPT for summarization.
chunk_size = 1000
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
Example – 3: Chunking a Transcript
- Use case: A meeting transcript is split into fixed-size segments of 30 lines for summary.
lines = transcript.split("\n")
chunk_size = 30
chunks = [lines[i:i+chunk_size] for i in range(0, len(lines), chunk_size)]

