GenAI – Fixed Size Chunking .

GenAI – Fixed Size Chunking

(1) What Is Fixed Size Chunking ?

(2) When To Use Fixed Size Chunking ?

(3) Advantages Of Fixed Size Chunking.

(4) Disadvantages Of Fixed Size Chunking.

(5) Examples Of Fixed Size Chunking.

Example – 1: Document Embedding for Semantic Search

Use case: Splitting a long article into 512-token chunks for embedding in a vector database.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Your very long document goes here..."
tokens = tokenizer.encode(text)

chunk_size = 512
chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]

# Convert back to text chunks
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

Example – 2: Preprocessing PDF or Book for LLM Input

Example: A 20,000-character textbook is split into 1000-character chunks before feeding to GPT for summarization.

chunk_size = 1000
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

Example – 3: Chunking a Transcript

Use case: A meeting transcript is split into fixed-size segments of 30 lines for summary.

lines = transcript.split("\n")
chunk_size = 30
chunks = [lines[i:i+chunk_size] for i in range(0, len(lines), chunk_size)]

Praudyog

GenAI – Fixed Size Chunking .

GenAI – Fixed Size Chunking

Table Of Contents:

(1) What Is Fixed Size Chunking ?

(2) When To Use Fixed Size Chunking ?

(3) Advantages Of Fixed Size Chunking.

(4) Disadvantages Of Fixed Size Chunking.

(5) Examples Of Fixed Size Chunking.

Example – 1: Document Embedding for Semantic Search

Example – 2: Preprocessing PDF or Book for LLM Input

Example – 3: Chunking a Transcript

Leave a Reply Cancel reply