Module unicode_segmentation

Source
Expand description

Splitting strings on grapheme cluster, word, and sentence boundaries.


unicode-segmentation provides iterators for splitting text according to Unicode Standard Annex #29 rules. It handles the complexities of measuring Unicode text for display, where a single user-perceived character may be composed of multiple codepoints.

The crate’s primary trait is UnicodeSegmentation, which provides methods for segmenting strings by grapheme clusters, words, and sentences. Grapheme clusters represent what users think of as single characters, which is essential for correctly counting characters, truncating strings, or implementing text editors.

Note that while Unicode segmentation is a crucial algorithm, it is rarely the right tool for most software — it is mostly used by GUI toolkits for laying out text, or by software that needs to understand the human concepts of “words” and “sentences”.

For modern background on Unicode units see Let’s Stop Ascribing Meaning to Code Points my Manish Goregaokar.

§Examples

Count user-perceived characters correctly:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let text = "Hello 👨‍👩‍👧‍👦!";

    // Wrong: counting bytes
    assert_eq!(text.len(), 32);

    // Wrong: counting codepoints
    assert_eq!(text.chars().count(), 14);

    // Correct: counting grapheme clusters
    assert_eq!(text.graphemes(true).count(), 8);
}

Split text into words:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let text = "Hello, world! How are you?";

    let words: Vec<&str> = text.unicode_words().collect();
    assert_eq!(words, vec!["Hello", "world", "How", "are", "you"]);
}

Structs§

GraphemeCursor
Cursor-based segmenter for grapheme clusters.
GraphemeIndices
External iterator for grapheme clusters and byte offsets.
Graphemes
External iterator for a string’s grapheme clusters.
USentenceBoundIndices
External iterator for sentence boundaries and byte offsets.
USentenceBounds
External iterator for a string’s sentence boundaries.
UWordBoundIndices
External iterator for word boundaries and byte offsets.
UWordBounds
External iterator for a string’s word boundaries.
UnicodeSentences
An iterator over the substrings of a string which, after splitting the string on sentence boundaries, contain any characters with the Alphabetic property, or with General_Category=Number.
UnicodeWordIndices
An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with General_Category=Number. This iterator also provides the byte offsets for each substring.
UnicodeWords
An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with General_Category=Number.

Enums§

GraphemeIncomplete
An error return indicating that not enough content was available in the provided chunk to satisfy the query, and that more content must be provided.

Constants§

UNICODE_VERSION
The version of Unicode that this version of unicode-segmentation is based on.

Traits§

UnicodeSegmentation
Methods for segmenting strings according to Unicode Standard Annex #29.