GenAI – Semantic Based Chunking ?


GenAI – Semantic Chunking

Table Of Contents:

  1. What Is Semantic Chunking ?
  2. When To Use Semantic Chunking ?
  3. Advantages Of Semantic Chunking.
  4. Disadvantages Of Semantic Chunking.
  5. Examples Of Semantic Chunking.
  6. Techniques Available For Semantic Chunking .

(1) What Is Semantic Chunking ?

(2) When To Use Semantic Chunking ?

(3) How Semantic Chunking Works ?

(4) Tools & Libraries for Semantic Chunking

(5) Advantages Of Semantic Chunking.

(6) Disadvantages Of Semantic Chunking.

(6)Techniques Available For Semantic Chunking.

(7) Examples Of Semantic Chunking.

Example 1: Using LangChain’s RecursiveCharacterTextSplitter
  • This splits the text based on semantic boundaries like paragraphs, sentences, etc.
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """Large Language Models like GPT need chunking when processing long documents. 
Splitting content semantically improves understanding. 
For example, splitting at paragraphs makes the context more coherent."""

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,                # approx token-based size
    chunk_overlap=20,
    separators=["\n\n", "\n", ".", "!", "?", " ", ""]
)

chunks = text_splitter.split_text(text)
print(chunks)
Example 2: Semantic Similarity-based Chunking Using Sentence Transformers
  • This example detects topic shifts using cosine similarity between sentence embeddings.
  • If the similarity between the sentence is more than 0.7 then it grouped it together or else it will create a new chunk.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

text = """Machine learning is used to build intelligent systems.
Supervised learning is a type of machine learning.
Natural language processing is a popular ML domain.
Transformers are widely used in NLP.
Stock markets use time-series forecasting.
Cryptography is used for security."""

sentences = text.split('\n')  # simulate paragraph-level chunks

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

# Set similarity threshold to define topic change (0.7 is moderate)
threshold = 0.7
chunks = []
chunk = [sentences[0]]

for i in range(1, len(sentences)):
    similarity = cosine_similarity([embeddings[i]], [embeddings[i-1]])[0][0]
    if similarity > threshold:
        chunk.append(sentences[i])
    else:
        chunks.append(" ".join(chunk))
        chunk = [sentences[i]]

chunks.append(" ".join(chunk))
print(chunks)
Example 3: Using spaCy for Sentence-based Chunking
  • spaCy can be used to split a document into sentences semantically.
import spacy

nlp = spacy.load("en_core_web_sm")
text = """GPT-4 is a powerful language model. It is used in chatbots, summarization, and translation. 
Traditional models required more training data. Now, few-shot learning is sufficient."""

doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
print(sentences)
Example 4: Using Unstructured.io for Document-aware Semantic Chunking
  • spaCy can be used to split a document into sentences semantically.
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename="sample.pdf")

# Extract semantic chunks (e.g., paragraphs, tables)
for element in elements:
    print(element.text)

Leave a Reply

Your email address will not be published. Required fields are marked *