Code like the following is unfortunately very common:
r, n := rune(b[0]), 1
if r >= utf8.RuneSelf {
r, n = utf8.DecodeRune(b)
}
and we have several examples in the standard library itself: * fmt/format.go:347 * fmt/print.go:1101 * regexp/regexp.go:418 * regexp/regexp.go:465 * bytes/bytes.go:482 * bytes/bytes.go:560 * bytes/bytes.go:781 * bytes/bytes.go:836 * bufio/bufio.go:295 * strings/strings.go:741 * strconv/quote.go:43 * strconv/quote.go:266
All of these cases ultimate come down to the caller manually performing inlined logic for the ASCII case. We should make it such that DecodeRune
and DecodeLastRune
are inlineable for the ASCII case. Unfortunately my attempts to do so exceed the inline budget by ever so much:
// cannot inline DecodeRune: function too complex: cost 87 exceeds budget 80
func DecodeRune(p []byte) (r rune, size int) {
if len(p) > 0 && p[0] < RuneSelf {
return rune(p[0]), 1
}
return decodeRune(p)
}
// cannot inline DecodeLastRune: function too complex: cost 93 exceeds budget 80
func DecodeLastRune(p []byte) (r rune, size int) {
if len(p) > 0 && p[len(p)-1] < UTFMax {
return rune(p[len(p)-1]), 1
}
return decodeLastRune(p)
}
We should find a way to make these inlinable whether by special casing them, clever tricks to make the cost lower, by increasing the inline budget, etc.
\cc @CAFxX @josharian @mdempsky @martisch
Comment From: CAFxX
FWIW, I gave it a shot before: https://go-review.googlesource.com/c/go/+/332770
Unfortunately yes, I really can not see a way to make it work without messing with the inliner (either by changing the heuristics, or by adding special cases) or otherwise adding some kind of annotation.
Comment From: renthraysk
Seem to remember commenting on a similar issue, that couldn't do the same with binary.Uvarint().
31666
But now with for loop inlining since added to go, Uvarint() can be made to (replace range use) inline as whole without needing a separate func.
Comment From: zigo101
The inline cost of the following implementation is 83:
func DecodeRune(p []byte) (r rune, size int) {
if len(p) > 0 && p[0] < RuneSelf {
r, size = rune(p[0]), 1
} else {
r, size = decodeRune(p)
}
return
}
Comment From: zigo101
And this one is 81:
func DecodeRune(p []byte) (r rune, size int) {
if len(p) > 0 && p[0] < RuneSelf {
return rune(p[0]), 1
}
r, size = decodeRune(p)
return
}
Comment From: zigo101
Maybe the inline cost calculation should consider whether or not bound checks are eliminated.
Here, the bound checks for p[0]
are eliminated, so its cost should be some lower.
BTW, why the inline costs of function calls are so large?
Comment From: mengzhuo
The ExtraCall budget is 57 plus two arguments we had at least 64 budget, maybe we can force inline by asm code. However I've tried modified the ExtraCall budget and it appears only improve ASCII related DecodeRune and regression on slow path.
https://perf.golang.org/search?q=upload:20210922.6
name old time/op new time/op delta
RuneCountTenASCIIChars 9.27ns ± 1% 9.27ns ± 1% ~ (p=0.151 n=9+9)
RuneCountTenJapaneseChars 37.8ns ± 2% 38.2ns ± 1% +1.09% (p=0.022 n=10+8)
RuneCountInStringTenASCIIChars 9.26ns ± 0% 9.26ns ± 0% ~ (p=0.333 n=10+10)
RuneCountInStringTenJapaneseChars 39.3ns ± 1% 38.1ns ± 2% -3.07% (p=0.000 n=10+10)
EncodeASCIIRune 1.71ns ± 0% 1.71ns ± 0% ~ (all equal)
EncodeJapaneseRune 2.86ns ± 0% 2.83ns ± 0% -1.12% (p=0.000 n=10+9)
AppendASCIIRune 0.26ns ± 0% 0.26ns ± 0% -2.44% (p=0.000 n=7+10)
AppendJapaneseRune 3.09ns ± 0% 3.08ns ± 0% -0.02% (p=0.029 n=9+9)
DecodeASCIIRune 2.06ns ± 0% 0.26ns ± 0% -87.53% (p=0.000 n=8+8)
DecodeJapaneseRune 3.60ns ± 0% 3.73ns ± 0% +3.52% (p=0.000 n=10+8)
FullRune/ASCII 1.03ns ± 0% 1.18ns ± 3% +14.28% (p=0.000 n=10+10)
FullRune/Incomplete 2.18ns ± 2% 2.14ns ± 1% -2.00% (p=0.000 n=10+10)
FullRune/Japanese 1.03ns ± 0% 1.18ns ± 3% +15.03% (p=0.000 n=9+10)
Comment From: dsnet
However I've tried modified the ExtraCall budget and it appears only improve ASCII related DecodeRune and regression on slow path.
Isn't that expected? With inlining, we generally want to get the common case to be inlined (i.e., ASCII), and perform the function call for the less common case (i.e., multibyte Unicode).
Comment From: gopherbot
Change https://golang.org/cl/354769 mentions this issue: cmd/compile: zero cost of byte to rune conversion
Comment From: gopherbot
Change https://golang.org/cl/354770 mentions this issue: unicode/utf8: make DecodeRune inlined
Comment From: gopherbot
Change https://golang.org/cl/356769 mentions this issue: unicode/utf8: default ascii fast path for DecodeRune
Comment From: jub0bs
Building on @TapirLiu's latest attempt, the following implementation of DecodeRuneInString
is inlinable:
func DecodeRuneInString(s string) (r rune, size int) {
// Inlineable fast path for ASCII characters.
if s != "" && s[0] < RuneSelf {
return rune(s[0]), 1
} else {
r, size = decodeRuneInString(s)
}
return
}
$ go build -gcflags '-m=2' ./unicode/utf8 2>&1 | grep 'can inline DecodeRuneInString'
unicode/utf8/utf8.go:213:6: can inline DecodeRuneInString with cost 80 as [...]
Sadly, there's no equivalent trick for DecodeRune
. 🤷
Comment From: jub0bs
Happy days! After a lot of trial and error, I was able to get DecodeRune
under the inlining budget:
func DecodeRune(p []byte) (r rune, size int) {
for _, b := range p {
if b < RuneSelf {
return rune(b), 1
}
break
}
r, size = decodeRune(p)
return
}
$ go build -gcflags '-m=2' ./unicode/utf8 2>&1 | grep 'can inline DecodeRune with'
unicode/utf8/utf8.go:157:6: can inline DecodeRune with cost 79 as: [...]
The implementation is so weird that it likely deserves a clarifying implementation comment (perhaps linking to this issue). I'll open a first PR to make both DecodeRune{,InString}
inlineable; if/when it gets merged, I'll follow up with a PR to get rid of all instances of manual inlining of those functions in the standard library.
Comment From: gopherbot
Change https://go.dev/cl/699675 mentions this issue: unicode/utf8: make DecodeRune{,InString} inlineable
Comment From: magical
@jub0bs Looks like your latest DecodeRune solution would work for DecodeRuneInString too.
func DecodeRuneInString(s string) (r rune, size int) {
for _, b := range []byte(s) {
if b < RuneSelf {
return rune(b), 1
}
break
}
r, size = decodeRuneInString(s)
return
}
./decoderune.go:23:6: can inline DecodeRuneInString with cost 80 as: func(string) (rune, int) { for loop; r, size = rune(.autotmp_4), .autotmp_5; return }
./decoderune.go:35:21: s does not escape
./decoderune.go:23:25: s does not escape
./decoderune.go:24:27: ([]byte)(s) does not escape
./decoderune.go:24:27: zero-copy string->[]byte conversion
Comment From: jub0bs
@magical That works too. However, a simpler (less weird) implementation of DecodeRuneInString
also results in an inlining cost of 80; see https://github.com/golang/go/issues/48195#issuecomment-3224273708.
Perhaps there is an argument to keep DecodeRuneInString
's implementation in sync with DecodeRune
's, or perhaps the simplest possible implementation is preferrable. We'll see what the reviewers say.
Comment From: CAFxX
FWIW, I think the CL should also remove the manual fast paths peppered around the standard library, and ensure we don't regress there either (my old CL, that is likely stale now, did this).
Comment From: jub0bs
@CAFxX Ok, I've removed all of the manual inlining that you addressed in your abandoned CL. I've also removed two instances (in bytes/iter.go and strings/iter.go) that had since crept in.