Functions for splitting strings
splitByChar
Splits a string into substrings separated by a specified character. Uses a constant string separator
which consists of exactly one character.
Returns an array of selected substrings. Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.
Syntax
Arguments
separator
— The separator must be a single-byte character. String.s
— The string to split. String.max_substrings
— An optionalInt64
defaulting to 0. Ifmax_substrings
> 0, the returned array will contain at mostmax_substrings
substrings, otherwise the function will return as many substrings as possible.
Returned value(s)
Empty substrings may be selected when:
- A separator occurs at the beginning or end of the string;
- There are multiple consecutive separators;
- The original string
s
is empty.
The behavior of parameter max_substrings
changed starting with ClickHouse v22.11. In versions older than that, max_substrings > 0
meant that max_substring
-many splits were performed and that the remainder of the string was returned as the final element of the list.
For example,
- in v22.10:
SELECT splitByChar('=', 'a=b=c=d', 2);
returned['a','b','c=d']
- in v22.11:
SELECT splitByChar('=', 'a=b=c=d', 2);
returned['a','b']
A behavior similar to ClickHouse pre-v22.11 can be achieved by setting
splitby_max_substrings_includes_remaining_string
SELECT splitByChar('=', 'a=b=c=d', 2) SETTINGS splitby_max_substrings_includes_remaining_string = 1 -- ['a', 'b=c=d']
Example
Result:
splitByString
Splits a string into substrings separated by a string. It uses a constant string separator
of multiple characters as the separator. If the string separator
is empty, it will split the string s
into an array of single characters.
Syntax
Arguments
separator
— The separator. String.s
— The string to split. String.max_substrings
— An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.
Returned value(s)
Empty substrings may be selected when:
- A non-empty separator occurs at the beginning or end of the string;
- There are multiple consecutive non-empty separators;
- The original string
s
is empty while the separator is not empty.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
Result:
Result:
splitByRegexp
Splits a string into substrings separated by a regular expression. It uses a regular expression string regexp
as the separator. If the regexp
is empty, it will split the string s
into an array of single characters. If no match is found for this regular expression, the string s
won't be split.
Syntax
Arguments
-
regexp
— Regular expression. Constant. String or FixedString. -
s
— The string to split. String. -
max_substrings
— An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible. Returned value(s) -
An array of selected substrings. Array(String). Empty substrings may be selected when:
-
A non-empty regular expression match occurs at the beginning or end of the string;
-
There are multiple consecutive non-empty regular expression matches;
-
The original string
s
is empty while the regular expression is not empty.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
Result:
Result:
splitByWhitespace
Splits a string into substrings separated by whitespace characters. Returns an array of selected substrings.
Syntax
Arguments
-
s
— The string to split. String. -
max_substrings
— An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible. Returned value(s)
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
Result:
splitByNonAlpha
Splits a string into substrings separated by whitespace and punctuation characters. Returns an array of selected substrings.
Syntax
Arguments
-
s
— The string to split. String. -
max_substrings
— An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible. Returned value(s)
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
arrayStringConcat
Concatenates string representations of values listed in the array with the separator. separator
is an optional parameter: a constant string, set to an empty string by default.
Returns the string.
Syntax
Example
Result:
alphaTokens
Selects substrings of consecutive bytes from the ranges a-z and A-Z.Returns an array of substrings.
Syntax
Alias: splitByAlpha
Arguments
s
— The string to split. String.max_substrings
— An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.
Returned value(s)
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
extractAllGroups
Extracts all groups from non-overlapping substrings matched by a regular expression.
Syntax
Arguments
text
— String or FixedString.regexp
— Regular expression. Constant. String or FixedString.
Returned values
- If the function finds at least one matching group, it returns
Array(Array(String))
column, clustered by group_id (1 to N, where N is number of capturing groups inregexp
). If there is no matching group, it returns an empty array. Array.
Example
Result:
ngrams
Splits a UTF-8 string into n-grams of ngramsize
symbols.
Syntax
Arguments
string
— String. String or FixedString.ngramsize
— The size of an n-gram. UInt.
Returned values
Example
Result:
tokens
Splits a string into tokens using the given tokenizer. The default tokenizer uses non-alphanumeric ASCII characters as separators.
Arguments
value
— The input string. String or FixedString.tokenizer
— The tokenizer to use. Valid arguments aredefault
,ngram
,split
, andno_op
. Optional, if not set explicitly, defaults todefault
. const Stringngrams
— Only relevant if argumenttokenizer
isngram
: An optional parameter which defines the length of the ngrams. If not set explicitly, defaults to3
. UInt8.separators
— Only relevant if argumenttokenizer
issplit
: An optional parameter which defines the separator strings. If not set explicitly, defaults to[' ']
. Array(String).
In case of the split
tokenizer: if the tokens do not form a prefix code, you likely want that the matching prefers longer separators first.
To do so, pass the separators in order of descending length.
For example, with separators = ['%21', '%']
string %21abc
would be tokenized as ['abc']
, whereas separators = ['%', '%21']
would tokenize to ['21ac']
(which is likely not what you wanted).
Returned value
- The resulting array of tokens from input string. Array.
Example
Using the default settings:
Result:
Using the ngram tokenizer with ngram length 3:
Result:
alphaTokens
Introduced in: v1.1
Selects substrings of consecutive bytes from the ranges a-z
and A-Z
and returns an array of the selected substrings.
Syntax
Arguments
s
— The string to split.String
max_substrings
— Optional. Whenmax_substrings > 0
, the number of returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.Int64
Returned value
Returns an array of selected substrings of s
. Array(String)
Examples
Usage example
arrayStringConcat
Introduced in: v1.1
Concatenates string representations of values listed in the array with the provided separator, which is an optional parameter set to an empty string by default.
Syntax
Arguments
arr
— The array to concatenate.Array(T)
separator
— Optional. Separator string. By default an empty string.const String
Returned value
Returns the concatenated string. String
Examples
Usage example
extractAllGroupsHorizontal
Introduced in: v20.5
Matches all groups of a string using the provided regular expression and returns an array of arrays, where each array contains all captures from the same capturing group, organized by group number.
Syntax
Arguments
s
— Input string to extract from.String
orFixedString
regexp
— Regular expression to match by.const String
orconst FixedString
Returned value
Returns an array of arrays, where each inner array contains all captures from one capturing group across all matches. The first inner array contains all captures from group 1, the second from group 2, etc. If no matches are found, returns an empty array. Array(Array(String))
Examples
Usage example
extractAllGroupsVertical
Introduced in: v20.5
Matches all groups of a string using a regular expression and returns an array of arrays, where each array includes matching fragments from every group, grouped in order of appearance in the input string.
Syntax
Arguments
s
— Input string to extract from.String
orFixedString
regexp
— Regular expression to match by.const String
orconst FixedString
Returned value
Returns an array of arrays, where each inner array contains the captured groups from one match. Each match produces an array with elements corresponding to the capturing groups in the regular expression (group 1, group 2, etc.). If no matches are found, returns an empty array. Array(Array(String))
Examples
Usage example
extractGroups
Introduced in: v20.5
Extracts all groups from non-overlapping substrings matched by a regular expression.
Syntax
Arguments
s
— Input string to extract from.String
orFixedString
regexp
— Regular expression. Constant.const String
orconst FixedString
Returned value
If the function finds at least one matching group, it returns Array(Array(String)) column, clustered by group_id (1
to N
, where N
is number of capturing groups in regexp). If there is no matching group, it returns an empty array. Array(Array(String))
Examples
Usage example
ngrams
Introduced in: v21.11
Splits a UTF-8 string into n-grams of ngramsize
symbols.
Syntax
Arguments
s
— Input string.String
orFixedString
ngram_size
— The size of an n-gram.const UInt8/16/32/64
Returned value
Returns an array with n-grams. Array(String)
Examples
Usage example
splitByChar
Introduced in: v1.1
Splits a string separated by a specified constant string separator
of exactly one character into an array of substrings.
Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.
Setting splitby_max_substrings_includes_remaining_string
(default: 0
) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0
.
Empty substrings may be selected when:
- A separator occurs at the beginning or end of the string
- There are multiple consecutive separators
- The original string
s
is empty
Syntax
Arguments
separator
— The separator must be a single-byte character.String
s
— The string to split.String
max_substrings
— Optional. Ifmax_substrings > 0
, the returned array will contain at mostmax_substrings
substrings, otherwise the function will return as many substrings as possible. The default value is0
.Int64
Returned value
Returns an array of selected substrings. Array(String)
Examples
Usage example
splitByNonAlpha
Introduced in: v21.9
Splits a string separated by whitespace and punctuation characters into an array of substrings.
Setting splitby_max_substrings_includes_remaining_string
(default: 0
) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0
.
Syntax
Arguments
s
— The string to split.String
max_substrings
— Optional. Whenmax_substrings > 0
, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible. Default value:0
.Int64
Returned value
Returns an array of selected substrings of s
. Array(String)
Examples
Usage example
splitByRegexp
Introduced in: v21.6
Splits a string which is separated by the provided regular expression into an array of substrings. If the provided regular expression is empty, it will split the string into an array of single characters. If no match is found for the regular expression, the string won't be split.
Empty substrings may be selected when:
- a non-empty regular expression match occurs at the beginning or end of the string
- there are multiple consecutive non-empty regular expression matches
- the original string string is empty while the regular expression is not empty.
Setting splitby_max_substrings_includes_remaining_string
(default: 0
) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0
.
Syntax
Arguments
regexp
— Regular expression. Constant.String
orFixedString
s
— The string to split.String
max_substrings
— Optional. Whenmax_substrings > 0
, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible. Default value:0
.Int64
Returned value
Returns an array of the selected substrings of s
. Array(String)
Examples
Usage example
Empty regexp
splitByString
Introduced in: v1.1
Splits a string with a constant separator
consisting of multiple characters into an array of substrings.
If the string separator
is empty, it will split the string s
into an array of single characters.
Empty substrings may be selected when:
- A non-empty separator occurs at the beginning or end of the string
- There are multiple consecutive non-empty separators
- The original string
s
is empty while the separator is not empty
Setting splitby_max_substrings_includes_remaining_string
(default: 0
) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0
.
Syntax
Arguments
separator
— The separator.String
s
— The string to split.String
max_substrings
— Optional. Whenmax_substrings > 0
, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible. Default value:0
.Int64
Returned value
Returns an array of selected substrings of s
Array(String)
Examples
Usage example
Empty separator
splitByWhitespace
Introduced in: v21.9
Splits a string which is separated by whitespace characters into an array of substrings.
Setting splitby_max_substrings_includes_remaining_string
(default: 0
) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0
.
Syntax
Arguments
s
— The string to split.String
max_substrings
— Optional. Whenmax_substrings > 0
, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible. Default value:0
.Int64
Returned value
Returns an array of the selected substrings of s
. Array(String)
Examples
Usage example
tokens
Introduced in: v21.11
Splits a string into tokens using the given tokenizer. The default tokenizer uses non-alphanumeric ASCII characters as separators.
In case of the split
tokenizer, if the tokens do not form a prefix code, you likely want that the matching prefers longer separators first.
To do so, pass the separators in order of descending length.
For example, with separators = ['%21', '%']
string %21abc
would be tokenized as ['abc']
, whereas separators = ['%', '%21']
would tokenize to ['21ac']
(which is likely not what you wanted).
Syntax
Arguments
value
— The input string.String
orFixedString
tokenizer
— The tokenizer to use. Valid arguments aredefault
,ngram
,split
, andno_op
. Optional, if not set explicitly, defaults todefault
.const String
ngrams
— Only relevant if argumenttokenizer
isngram
: An optional parameter which defines the length of the ngrams. If not set explicitly, defaults to3
.const UInt8
separators
— Only relevant if argumenttokenizer
issplit
: An optional parameter which defines the separator strings. If not set explicitly, defaults to[' ']
.const Array(String)
Returned value
Returns the resulting array of tokens from input string. Array
Examples
Default tokenizer
Ngram tokenizer