Expand description
Splitting strings on grapheme cluster, word, and sentence boundaries.
unicode-segmentation
provides iterators for splitting text according to Unicode Standard Annex #29 rules.
It handles the complexities of measuring Unicode text for display,
where a single user-perceived character may be composed of multiple codepoints.
The crate’s primary trait is UnicodeSegmentation
,
which provides methods for segmenting strings by grapheme clusters, words, and sentences.
Grapheme clusters represent what users think of as single characters,
which is essential for correctly counting characters,
truncating strings, or implementing text editors.
Note that while Unicode segmentation is a crucial algorithm, it is rarely the right tool for most software — it is mostly used by GUI toolkits for laying out text, or by software that needs to understand the human concepts of “words” and “sentences”.
For modern background on Unicode units see Let’s Stop Ascribing Meaning to Code Points my Manish Goregaokar.
§Examples
Count user-perceived characters correctly:
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let text = "Hello 👨👩👧👦!";
// Wrong: counting bytes
assert_eq!(text.len(), 32);
// Wrong: counting codepoints
assert_eq!(text.chars().count(), 14);
// Correct: counting grapheme clusters
assert_eq!(text.graphemes(true).count(), 8);
}
Split text into words:
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let text = "Hello, world! How are you?";
let words: Vec<&str> = text.unicode_words().collect();
assert_eq!(words, vec!["Hello", "world", "How", "are", "you"]);
}
Structs§
- Grapheme
Cursor - Cursor-based segmenter for grapheme clusters.
- Grapheme
Indices - External iterator for grapheme clusters and byte offsets.
- Graphemes
- External iterator for a string’s grapheme clusters.
- USentence
Bound Indices - External iterator for sentence boundaries and byte offsets.
- USentence
Bounds - External iterator for a string’s sentence boundaries.
- UWord
Bound Indices - External iterator for word boundaries and byte offsets.
- UWord
Bounds - External iterator for a string’s word boundaries.
- Unicode
Sentences - An iterator over the substrings of a string which, after splitting the string on sentence boundaries, contain any characters with the Alphabetic property, or with General_Category=Number.
- Unicode
Word Indices - An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with General_Category=Number. This iterator also provides the byte offsets for each substring.
- Unicode
Words - An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with General_Category=Number.
Enums§
- Grapheme
Incomplete - An error return indicating that not enough content was available in the provided chunk to satisfy the query, and that more content must be provided.
Constants§
- UNICODE_
VERSION - The version of Unicode that this version of unicode-segmentation is based on.
Traits§
- Unicode
Segmentation - Methods for segmenting strings according to Unicode Standard Annex #29.