GenAI – Semantic Based Chunking ?

GenAI – Semantic Chunking

(1) What Is Semantic Chunking ?

(2) When To Use Semantic Chunking ?

(3) How Semantic Chunking Works ?

(4) Tools & Libraries for Semantic Chunking

(5) Advantages Of Semantic Chunking.

(6) Disadvantages Of Semantic Chunking.

(6)Techniques Available For Semantic Chunking.

(7) Examples Of Semantic Chunking.

Example 1: Using LangChain’s RecursiveCharacterTextSplitter

This splits the text based on semantic boundaries like paragraphs, sentences, etc.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """Large Language Models like GPT need chunking when processing long documents. 
Splitting content semantically improves understanding. 
For example, splitting at paragraphs makes the context more coherent."""

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,                # approx token-based size
    chunk_overlap=20,
    separators=["\n\n", "\n", ".", "!", "?", " ", ""]
)

chunks = text_splitter.split_text(text)
print(chunks)

Example 2: Semantic Similarity-based Chunking Using Sentence Transformers

This example detects topic shifts using cosine similarity between sentence embeddings.
If the similarity between the sentence is more than 0.7 then it grouped it together or else it will create a new chunk.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

text = """Machine learning is used to build intelligent systems.
Supervised learning is a type of machine learning.
Natural language processing is a popular ML domain.
Transformers are widely used in NLP.
Stock markets use time-series forecasting.
Cryptography is used for security."""

sentences = text.split('\n')  # simulate paragraph-level chunks

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

# Set similarity threshold to define topic change (0.7 is moderate)
threshold = 0.7
chunks = []
chunk = [sentences[0]]

for i in range(1, len(sentences)):
    similarity = cosine_similarity([embeddings[i]], [embeddings[i-1]])[0][0]
    if similarity > threshold:
        chunk.append(sentences[i])
    else:
        chunks.append(" ".join(chunk))
        chunk = [sentences[i]]

chunks.append(" ".join(chunk))
print(chunks)

Example 3: Using spaCy for Sentence-based Chunking

spaCy can be used to split a document into sentences semantically.

import spacy

nlp = spacy.load("en_core_web_sm")
text = """GPT-4 is a powerful language model. It is used in chatbots, summarization, and translation. 
Traditional models required more training data. Now, few-shot learning is sufficient."""

doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
print(sentences)

Example 4: Using Unstructured.io for Document-aware Semantic Chunking

spaCy can be used to split a document into sentences semantically.

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename="sample.pdf")

# Extract semantic chunks (e.g., paragraphs, tables)
for element in elements:
    print(element.text)

Praudyog

GenAI – Semantic Based Chunking ?

GenAI – Semantic Chunking

Table Of Contents:

(1) What Is Semantic Chunking ?

(2) When To Use Semantic Chunking ?

(3) How Semantic Chunking Works ?

(4) Tools & Libraries for Semantic Chunking

(5) Advantages Of Semantic Chunking.

(6) Disadvantages Of Semantic Chunking.

(6)Techniques Available For Semantic Chunking.

(7) Examples Of Semantic Chunking.

Example 1: Using LangChain’s RecursiveCharacterTextSplitter

Example 2: Semantic Similarity-based Chunking Using Sentence Transformers

Example 3: Using spaCy for Sentence-based Chunking

Example 4: Using Unstructured.io for Document-aware Semantic Chunking

Leave a Reply Cancel reply