Proposal Details
The strings
package contains the function Split
that splits a string whereever the separator string occurs. Only one string can be specified.
There are use cases where one wants to split on any of a collection of characters. Often FieldsFunc
is recommended for this. However, FieldsFunc
has a bug in that it skips leading and trailing separators. This behaviour can not be fixed, just documented.
In order to make it possible to split a string on any of several characters there should be functions analogous to IndexAny
, namely SplitAny
and CountAny
.
SplitAny
would have the signature func SplitAny(s, chars string) []string
, while CountAny
would be func CountAny(s, chars string) int
.
SplitAny
splits the string on any character in chars
, while CountAny
returns how many times any of the characters in chars
appears in the supplied string.
There could be another function SplitAnyN
with the signature func SplitAnyN(s, chars string, n int []string
that limits the splitting to a maximum of n
strings.
I attach file split_any.zip where SplitAny
and CountAny
have been implemented as an example.
Comment From: ianlancetaylor
You mention FieldFunc
in the description but I assume you mean FieldsFunc
(with an "s").
Comment From: ianlancetaylor
Can you add a comment with an example or two showing the exact behavior difference? Thanks.
Comment From: xformerfhs
Yes, you are right. I meant FieldsFunc
. Sorry for the typo.
Here are some examples:
source := ":something,to:split-"
parts := strings.Split(source, ":")
// part is [ "" "something,to" "split-"]
separators := ":,-.;"
parts = strings.SplitAny(source, separators)
// parts is [ "" "something" "to" "split" "" ]
count := strings.CountAny(source, separators)
// count is 4 (for the 4 found characters ':', ',', ':' and '-'
separators = "o,t.;"
parts = strings.SplitAny(source, separators)
// parts is [ ":s" "me" "hing" "" "" ":split-" ]
parts = strings.SplitAnyN(source, separators, 2)
// parts is [ ":s" "mething,to:split-" ]
I hope this helps to clarify the proposal. I will be glad to provide any more information that is deemed necessary.
Comment From: jub0bs
@xformerfhs I'm wary of adding more Split*
functions that return a slice (as opposed to an iterator) in the standard library. In my experience, such functions tend to be misused (e.g. for splitting untrusted data); see https://nvd.nist.gov/vuln/detail/CVE-2025-22868, for instance.
Comment From: xformerfhs
Hi, @jub0bs, thanks four your comment.
I see that you have a reported a security vulnerability that was caused by using strings.Split
without checking, limiting or cleaning what is going to be splitted. It was fixed by using strings.Count
and only splitting after that returns the correct number of fields.
I agree that using Split
and the likes is dangerous when the programmer does not check the string to split. Splitting definitely has security implications.
However, a strings.SplitAny
function is missing. There ought to be a way of splitting a string on multiple different characters, not only on one separator.
What are the possible alternatives?
Function | Impact |
---|---|
SplitAny |
Programmers have to be warned they they ought to check the string to split if it has the correct format, count the fields with CountAny or remove unwanted characters. This should be documented. |
SplitAnyN |
This is much safer. If handled correctly, there is no vulnerability. However, setting n to a negative number will effectively turn SplitN into Split . |
SplitAnySeq |
This is the safest form, but can make the program more cumbersome and less readable. |
I think of my use case: The user specifies two encodings. One for the input file and one for the output file as a flag like e.g. -encodings win1252:utf8
. The separator may be :
or ,
. When I use SplitAnyN
this would look like this:
...
if len(encodingsFlagValue) < minEncodingLen || len(encodingsFlagValue) > maxEncodingLen {
return errors.New("invalid length of encodings")
}
encodings := strings.SplitAnyN(encodingsFlagValue, ":,", 3)
if len(encodings) > 2 {
return errors.New("invalid number of encodings")
}
var inputEncoding string
var outputEncoding string
inputEncoding = encodings[0]
if len(encodings) == 1 {
outputEncoding = inputEncoding
} else {
outputEncoding = encodings[1]
}
...
This is simple and straight-forward.
Now the same with an iterator:
...
var inputEncoding string
var outputEncoding string
var haveInputEncoding bool
for encoding := strings.SplitAnySeq(encodingsFlagValue, ":,") {
if !haveInputEncoding {
inputEncoding = encoding
haveInputEncoding = true
} else {
outputEncoding = encoding
break
}
}
if len(outputEncoding) == 0 {
outputEncoding = inputEncoding
}
...
This is much less readable and understandable.
So, I think SplitAnyN
is a sensible way to go. With the warning that one must not use an n
that is less than 1 and to check for an appropriate length.
Even SplitAny
would be a way to go with the clear warning that this may cause a security vulnerability if the source is not checked and that SplitAnyN
, and SplitAnySeq
are better alternatives.
Comment From: as
The alternative is to normalize the seperators into one seperator and then call the split function.
source := ":something,to:split-"
source = strings.ReplaceAll(source, ":", ",")
source = strings.ReplaceAll(source, "-", ",")
fmt.Printf("%q\n", strings.Split(source, ","))
Comment From: xformerfhs
The alternative is to normalize the seperators into one seperator and then call the split function.
While this yields the correct result, it has three disadvantages:
- It allocates memory for two additional strings. Allocations are slow.
- It copies the string two times, resulting in additional CPU overhead.
- It is cumbersome, hard to read and does not convey what is meant. Someone reading this would have to figure out why all this replacing takes place. This makes it harder to understand the meaning of the code.
Using strings.SplitAny(source, ":,-")
is short, simple and understandable at first glance. No unnecessary memory allocations, no unnecessary copying.