How to split text by tokens
This guide assumes familiarity with the following concepts:
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.
js-tiktoken
β
js-tiktoken is a JavaScript
version of the BPE
tokenizer created by OpenAI.
We can use js-tiktoken
to estimate tokens used. It is tuned to OpenAI
models.
- How the text is split: by character passed in.
- How the chunk size is measured: by the
js-tiktoken
tokenizer.
You can use the
TokenTextSplitter
like this:
import { TokenTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";
// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();
const textSplitter = new TokenTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
});
const texts = await textSplitter.splitText(stateOfTheUnion);
console.log(texts[0]);
Madam Speaker, Madam Vice President, our
Note: Some written languages (e.g.Β Chinese and Japanese) have
characters which encode to 2 or more tokens. Using the
TokenTextSplitter
directly can split the tokens for a character
between two chunks causing malformed Unicode characters.
Next stepsβ
Youβve now learned a method for splitting text based on token count.
Next, check out the full tutorial on retrieval-augmented generation.