Tag: GenAI – Custom RAG Tokenizer.


  • GenAI – Custom RAG Tokenizer.

    GenAI – Custom RAG Tokenizer. Table Of Content: Custom RAG Coadding.  (1) Python Code import logging import copy import datrie import math import os import re import string import sys from hanziconv import HanziConv from nltk import word_tokenize from nltk.stem import PorterStemmer, WordNetLemmatizer from api.utils.file_utils import get_project_base_directory class RagTokenizer: def key_(self, line): return str(line.lower().encode("utf-8"))[2:-1] def rkey_(self, line): return str(("DD" + (line[::-1].lower())).encode("utf-8"))[2:-1] def loadDict_(self, fnm): logging.info(f"[HUQIE]:Build trie from {fnm}") try: of = open(fnm, "r", encoding='utf-8') while True: line = of.readline() if not line: break line = re.sub(r"[rn]+", "", line) line = re.split(r"[ t]", line) k = self.key_(line[0]) F = int(math.log(float(line[1]) /

    Read More