I'm trying to a write a generic function that operates on either string | []byte
. However, I'm unable to do so since the implementation of that function needs to depend on either utf8.DecodeRune
or utf8.DecodeRuneInString
, but I'm unable to express that as a simple expression without using a type switch.
I propose we add generic versions of:
func DecodeRune[Bytes []byte | string](p Bytes) (r rune, size int)
func DecodeLastRune[Bytes []byte | string](p Bytes) (r rune, size int)
func FullRune[Bytes []byte | string](p Bytes) bool
func RuneCount[Bytes []byte | string](p Bytes) int
func Valid[Bytes []byte | string](p Bytes) bool
It is unclear what the name should be since the simpler names are already taken by the non-generic variants.
Perhaps, we should have a v2 variant of utf8
that operates on either type.
Comment From: bcmills
Given the number of existing functions that operate on []byte
, and the fact that functions that accept a []byte
generally cannot assume they can modify it (because that can in general cause data races), I wonder if it would be more productive to have a single generic function (perhaps in the bytes
package?) like:
package bytes
func Of[Bytes ~[]byte | ~string](p Bytes) []byte
It seems to me that that would handle all of the cases where a function accepts a slice-or-string and returns only data parsed from it, including all of the utf8
functions you mentioned above. (However, it would not handle cases like TrimPrefix
where the return type of the function depends on the argument type.)
bytes.Of
could be implemented today using reflect
, although if it were in the standard library it would be permissible for the compiler to recognize and optimize it in a different way.
Comment From: aarzilli
What's the difference between bytes.Of and a cast to []byte?
Comment From: bcmills
@aarzilli, []byte(s)
for a string s
makes a mutable copy. bytes.Of
would not.
(Compare my unsafeslice.OfString
, although I'm becoming increasingly convinced that that operation should not actually be considered “unsafe”...)
Comment From: aarzilli
What would happen if someone did try to mutate the return value of bytes.Of? The cast already doesn't make copies sometimes IIRC, wouldn't it be better to improve that?
Comment From: dsnet
@bcmills Being able to mutate a string feels pretty unsafe to me.
The motivation for this proposal comes from the fact that we can't efficiently call a non-mutating string
-based argument with a []byte
without allocation (or vice-versa). To some degree, this is a generalization of the type of problem in #42429. Alternatively, if the compiler allowed implicit unsafe conversions of []byte
to string
(or string
to []byte
) when calling pure (non-mutating) functions, that would obviate the need for this in many applications.
The only reason I'm trying to write a generic function that operates on both string
and []byte
is because of performance.
Comment From: DeedleFake
You can almost write a function yourself to convert to a string
regardless of which you started with using the unreleased Go 1.20, but if you don't want to use a type switch than it still depends on implementation details that you're not supposed to:
// toString returns a string with the same backing data as v. It is
// not safe to hold the returned string if the original was a []byte
// and might be modified.
func toString[T ~string | ~[]byte](v T) string {
// This assumes that the first word of both the string and slice
// header structs is a pointer to the data. Unfortunately, there's
// no way to just get the data from either in an opaque way because
// indexes into strings aren't addressable, for good reason.
return unsafe.String(*(**byte)(unsafe.Pointer(&v)), len(v))
}
Comment From: dsnet
I just filed #57072 as a compiler optimization to somewhat obviate the need for this.
Comment From: gopherbot
Change https://go.dev/cl/469556 mentions this issue: encoding/json: unify encodeState.string and encodeState.stringBytes
Comment From: rsc
Sounds like this is on hold for better compiler optimizations before we can even consider whether this is a good API.
Comment From: dsnet
20881 is also another compiler optimization that would address the need for this.
If that was fixed, then we could always do:
utf8.DecodeRune([]byte(in))
regardless of whether in
was already a []byte
or a string
.
Comment From: rsc
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
Comment From: rsc
Re utf8.DecodeRune([]byte(in)), sure but it would be even nicer to write utf8.DecodeRune(in) and not worry about whether the conversion allocates.
Comment From: rsc
Placed on hold. — rsc for the proposal review group
Comment From: dsnet
Given that "math/rand/v2" set the precedence for v2 packages, one could now imagine a "unicode/utf8/v2" package that has similar API, but generic versions:
package utf8 // unicode/utf8/v2
const RuneError, RuneSelf, MaxRune, UTFMax = ...
func RuneLen(rune) int
func RuneStart(byte) bool
func ValidRune(run) bool
func AppendRune([]byte, rune) []byte
func EncodeRune([]byte, rune) int
func DecodeRune[Bytes ~[]byte | ~string](b Bytes) (rune, int)
func DecodeLastRune[Bytes ~[]byte | ~string](b Bytes) (rune, int)
func FullRune[Bytes ~[]byte | ~string](b Bytes) bool
func RuneCount[Bytes ~[]byte | ~string](b Bytes) int
The first 4 constants are identical to today.
The first 5 functions are identical to today.
The last 4 are generic versions of the ones today are are almost always a drop-in replacement.
The XxxString
equivalent of the latter 4 have been removed.
Overall, the functionality of the package has been stable and there's been relatively few proposals or changes to utf8. The latest being:
* Added AppendRune
in Go 1.18
* Added ValidRune
in Go 1.1
Most of the open issues about utf8 are related to performance, which I believe in an orthogonal issue.
Comment From: thepudds
In https://github.com/golang/go/issues/56948#issuecomment-1471000614 above, @dsnet observed:
https://github.com/golang/go/issues/20881 is also another compiler optimization that would address the need for this. If that was fixed, then we could always do:
utf8.DecodeRune([]byte(in))
FWIW, #20881 and #2205 are now closed. (#2205 introduced zero-copy string->[]byte conversions when the compile can prove it is safe to do so.)
It sounds like this issue might be more about API, including https://github.com/golang/go/issues/56948#issuecomment-1505670973, but wanted to at least note here that the other issues had been closed. (Sorry, mostly a quick drive-by comment.)