how to split chunks without cutting the sentences? #11
-
How can I split the chunks based on the dots in sentences? I mean I dont want to have half of sentences in each chunk. Sometimes it splits one sentence and first half of it becomes part of one chunk, and the second part becomes the part of another chunk. Basically i want to split the chunk to < N tokens and finish one chunk at the closest from the end. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
I see - in that case you may want to use the let splitter = CharacterSplitter(withSeparator: ".")
let (chunks, _) = splitter.split(text: "First sentence. Second Sentence", chunkSize: 5) In this case, the public protocol TextSplitterProtocol {
/// Splits the input text into a tuple of chunks and optionally token ids.
///
/// - Parameters:
/// - text: The input text to be chunked.
/// - chunkSize: The number of tokens per chunk.
/// - overlapSize: The number of overlapping tokens between consecutive chunks.
/// - Returns: A tuple containing an array of chunked text and an optional array of token ids.
func split(text: String, chunkSize: Int, overlapSize: Int) -> ([String], [[String]]?)
} |
Beta Was this translation helpful? Give feedback.
I see - in that case you may want to use the
CharacterSplitter
with the separator set as an end of the sentence, like this:In this case, the
chunkSize
is referring to the whole sentences, so it's much smaller than it would be for tokens. If you need more custom functionality then that, you can try implementing your own TextSplitterProtocol, it just needs to conform to this protocol