Proposal Details
Background
The current implementations of the base32 and base64 codecs are not ideal for cryptographic use cases. The linked document describes practical timing side-channels against general purpose RFC 4648 codecs when used to encode/decode cryptographic secrets. You can verify the current implementation uses a table look-up in both directions, which introduces the risk of cache-timing attacks.
Specialized implementations of these codecs exist. @Sc00bz wrote a constant-time implementation of base64 in C, which I used as a basis for both base32 and base64 in PHP.
Proposed Change
New functions in encoding/base32
and encoding/base64
:
EncodeToStringConstTime
as an alternative toEncodeToString
DecodeStringConstTime
as an alternative toDecodeString
These functions could then be used in other packages that handle cryptographic secrets and auditors can be absolutely sure that timing leaks are not present.
(I'm happy to provide a patch for this, if accepted.)
Comment From: mateusz834
CC @golang/security
Comment From: tob-scott-a
Just in case there's any confusion: I do not think this qualifies as a security vulnerability (or I would have reported it as one, rather than filing an issue with the intent to follow up with a patch).
Comment From: Jorropo
👍 however do we want new functions ? It would be simpler for everyone if we made the current functions constant time. If you could send your patch already it would be nice, I would like to benchmark it.
Comment From: tob-scott-a
A constant-time alternative will certainly be slower than the general-purpose functions: It replaces a map look-up with many operations.
In the context of handling cryptographic secrets, this trade-off is worth it. However, making all use-cases for a base64 codec constant-time would slow most Go programs down unnecessarily.
It is for this reason that my opening offer was to implement new functions rather than change the current behavior.
Comment From: Jorropo
A single map lookup is far from cheap, I don't want to risk myself to guessing the performance impact, a benchmark would really help 🙂. Also in go we often do not mind trading a few % of performance if it means making the API easier to use.
Comment From: mateusz834
Also in go we often do not mind trading a few % of performance if it means making the API easier to use.
We can always do this in reverse, make the current API constant-time, but add unsafe variants (if needed) for speed. Then folks would actually use the constant-time APIs, otherwise we might have low usage of such (new) APIs.
Comment From: renthraysk
Doesn't the fact that stdlib have configurable alphabets for the base64/32 rule out replacing them?
Comment From: tob-scott-a
@renthraysk It's not actually a total show-stopper (although it does make the patch a bit trickier to write). See #73909 for a first pass at a patch that adheres to the suggestions provided in this discussion so far.
Comment From: gopherbot
Change https://go.dev/cl/676917 mentions this issue: encoding/base64: add constant-time behavior, enabled by default
Comment From: rolandshoemaker
I don't think we'd want to turn on constant-time behavior by default for all uses of encoding/{base32,base64}. I suspect the vast majority of their usage is for entirely benign inputs. That said we probably would want to use it by default for encoding/pem, where the contents are much more likely to be sensitive (of course we should probably still add some way to disable this, for people who i.e. need to parse large numbers of certificates etc).
I haven't thought through this too deeply, but based on my understanding of how people typically do constant time base64, could we add a generic decoder/encoder that generates the required bit shift offsets based on the alphabet, rather than needing to implement per-alphabet encoders and decoders? We'd then just need to add a WithConstantTime
(or something) method to Encoder
.
Comment From: rolandshoemaker
I'm not sure this is actually correct (threw it together very quickly), but something along the lines of this https://go.dev/play/p/4toQNnegoOh?
Comment From: tob-scott-a
I'm not sure this is actually correct (threw it together very quickly), but something along the lines of this https://go.dev/play/p/4toQNnegoOh?
The problem with that sort of check is that the RFC 4648 alphabets for base64 are not contiguous, which requires arithmetic like https://go-review.googlesource.com/c/go/+/676917/1/src/encoding/base64/mapping.go to implement without branches.
I don't think we'd want to turn on constant-time behavior by default for all uses of encoding/{base32,base64}. I suspect the vast majority of their usage is for entirely benign inputs.
I'm of the same opinion here.
In this case, these lines can be omitted and the corresponding logic moved into other places (e.g., encoding/pem) as needed.
Comment From: rolandshoemaker
The problem with that sort of check is that the RFC 4648 alphabets for base64 are not contiguous, which requires arithmetic like https://go-review.googlesource.com/c/go/+/676917/1/src/encoding/base64/mapping.go to implement without branches.
Can this arithmetic not be precomputed for the discontiguous ranges in the alphabet such that there don't need to be branches? Or is there something obvious I'm missing.
Comment From: tob-scott-a
Sorry, I misread the code. It looks like your approach will work:
Compare https://go.dev/play/p/EEz9V8fgsCC with https://go.dev/play/p/VHwr1LZpst9
EDIT: Looks like there's a bit of a bug with the logic for handling ret
and valid
. This fixes it: https://go.dev/play/p/vQeJ3QP2VWf
Comment From: Sc00bz
I wrote this awhile ago. I think it's easier to read. Also you can edit the constants for the different character sets. Although these decode functions have an optimization for a single character range (which is obvious to change). Oh also if you want Crockford's base32 decode or case insensitive hex or base32 decode then there's a few modifications like adding ch |= 'A' ^ 'a' // convert to lowercase
(wait unless EBCDIC... see footer) after any non-letter ranges. Then for Crockford's base32 decode check for o, i, and l for values 0, 1, and 1.
Note base64StandardDecodeCharFaster
should save 4 operations in x86 vs base64StandardDecodeChar
. Also due to instruction latency and throughput it should be the same speed but a little smaller code output.
Also for custom character sets there will be a point when scanning the character set will be faster. But I don't know when that is and I assume it's higher than common character sets (in ASCII). Basically unless the character set is randomised "for security" then doing ranges will leak info about the character set, but cases like that don't really matter because they will always find a way to break their scheme.
func base64StandardDecodeChar(ch int) int {
const LO1 = int('A') - 1
const HI1 = int('Z') + 1
const LO2 = int('a') - 1
const HI2 = int('z') + 1
const LO3 = int('0') - 1
const HI3 = int('9') + 1
const LO4 = int('+') - 1
const HI4 = int('+') + 1
const LO5 = int('/') - 1
const HI5 = int('/') + 1
const COUNT1 = 0
const COUNT2 = COUNT1 + HI1 - LO1 - 1
const COUNT3 = COUNT2 + HI2 - LO2 - 1
const COUNT4 = COUNT3 + HI3 - LO3 - 1
const COUNT5 = COUNT4 + HI4 - LO4 - 1
value := -1
// if ( LO < ch && ch < HI ) { value += COUNT + ch - LO }
value += (((LO1 - ch) & (ch - HI1)) >> 8) & (COUNT1 + ch - LO1)
value += (((LO2 - ch) & (ch - HI2)) >> 8) & (COUNT2 + ch - LO2)
value += (((LO3 - ch) & (ch - HI3)) >> 8) & (COUNT3 + ch - LO3)
value += (((LO4 - ch) & (ch - HI4)) >> 8) & (COUNT4 + 1) // ch - LO4 == 1
value += (((LO5 - ch) & (ch - HI5)) >> 8) & (COUNT5 + 1) // ch - LO5 == 1
return value
}
func base64StandardDecodeCharFaster(ch int) int {
const LO1 = int('A') - 1
const HI1 = int('Z') + 1
const LO2 = int('a') - 1
const HI2 = int('z') + 1
const LO3 = int('0') - 1
const HI3 = int('9') + 1
const CHAR4 = int('+')
const CHAR5 = int('/')
const COUNT1 = 0
const COUNT2 = COUNT1 + HI1 - LO1 - 1
const COUNT3 = COUNT2 + HI2 - LO2 - 1
const COUNT4 = COUNT3 + HI3 - LO3 - 1
const COUNT5 = COUNT4 + 1
var borrow uint
value := -1
// if ( LO < ch && ch < HI ) { value += ch - LO + COUNT }
value += (((LO1 - ch) & (ch - HI1)) >> 8) & (ch - LO1 + COUNT1)
value += (((LO2 - ch) & (ch - HI2)) >> 8) & (ch - LO2 + COUNT2)
value += (((LO3 - ch) & (ch - HI3)) >> 8) & (ch - LO3 + COUNT3)
// if ( ch == CHAR ) { value += COUNT + 1 }
_, borrow = bits.Sub(uint(ch ^ CHAR4), 1, 0); value += (COUNT4 + 1) & -int(borrow)
_, borrow = bits.Sub(uint(ch ^ CHAR5), 1, 0); value += (COUNT5 + 1) & -int(borrow)
return value
}
func base64StandardEncodeChar(src int) byte {
const LO1 = int('A')
const HI1 = int('Z')
const LO2 = int('a')
const HI2 = int('z')
const LO3 = int('0')
const HI3 = int('9')
const LO4 = int('+')
const HI4 = int('+')
const LO5 = int('/')
src += LO1
// if ( HI < src ) { src += LO_NEXT - HI - 1 }
src += ((HI1 - src) >> 8) & (LO2 - HI1 - 1)
src += ((HI2 - src) >> 8) & (LO3 - HI2 - 1)
src += ((HI3 - src) >> 8) & (LO4 - HI3 - 1)
src += ((HI4 - src) >> 8) & (LO5 - HI4 - 1)
return byte(src)
}
This all assumes that it's in ASCII. If go can compile to IBM mainframes then strings are likely in EBCDIC. Which will break things.
Comment From: doggedOwl
however do we want new functions ? It would be simpler for everyone if we made the current functions constant time. If you could send your patch already it would be nice, I would like to benchmark it.
no actually it would not be simpler for everyone. base64 encoding is used extensively in various format (json, xml) etc without the need to be cryptographically secure. it would be a big regression if all those cases got slowed down for no good reason.
Comment From: rolandshoemaker
Ignoring the actual implementation details for a moment, I think a reasonable proposal here is to make the following change to encoding/base{32,64}:
// WithConstantTime creates a new encoding identical to enc except that the
// Encode/Decode operations happen in constant time. This is useful when the
// inputs to these operations are sensitive, and measuring the timing of the
// operations could leak information about them. Note that Encode/Decode
// operations will still leak the length of the inputs.
func (enc Encoding) WithConstantTime() *Encoding
and the following changes to the existing encoding/pem functions:
// ...
// Encode uses constant-time base64 encoding, and will only leak the length of
// b.Bytes. To restore the old variable-time behavior, use the
// GODEBUG=vartimepem=0 environment variable.
func Encode(out io.Writer, b *Block) error
// ...
// EncodeToMemory uses constant-time base64 encoding, and will only leak the
// length of b.Bytes. To restore the old variable-time behavior, use the
// GODEBUG=vartimepem=0 environment variable.
func EncodeToMemory(b *Block) []byte
// ...
// Decode uses constant-time base64 decoding, and will only leak the length of
// p.Bytes. To restore the old variable-time behavior, use the
// GODEBUG=vartimepem=0 environment variable.
func Decode(data []byte) (p *Block, rest []byte)
and add these new functions to encoding/pem:
// VariableTimeEncode provides the same functionality as Encode, but uses
// variable-time base64 encoding. VariableTimeEncode should only be used when
// encoding non-sensitive values, or when the timing of VariableTimeEncode is
// not exposed.
func VariableTimeEncode(out io.Writer, b *Block) error
// VariableTimeEncodeToMemory provides the same functionality as EncodeToMemory,
// but uses variable-time base64 encoding. VariableTimeEncodeToMemory should
// only be used when encoding non-sensitive values, or when the timing of
// VariableTimeEncodeToMemory is not exposed.
func VariableTimeEncodeToMemory(b *Block) []byte
// VariableTimeDecode provides the same functionality as Decode, but uses
// variable-time base64 decoding. VariableTimeDecode should only be used when
// decoding non-sensitive values, or when the timing of VariableTimeDecode is
// not exposed.
func VariableTimeDecode(data []byte) (p *Block, rest []byte)
Another option would be to add a new field to Block, that lets setting variable time behavior for encoding, but we'd still need a VariableTimeDecode function. I think I like adding the new methods more, since it provides better symmetry in the API.
Comment From: tob-scott-a
@rolandshoemaker That does seem like a better user experience than the API changes in my hasty patch.
Comment From: neild
What's the use case for variable-time encoding in encoding/pem?
encoding/base{32,64} can't be constant-time by default (performance regression, most users don't need constant-time encoding). If encoding/pem.{Encode,Decode} become constant-time by default, then we're saying that most programs either need constant-time encoding or don't care about the performance penalty. That seems plausible to me. Are there programs that use PEM, don't need constant-time encoding, and care about the (hopefully small) performance cost of constant-time base64 encoding?
Comment From: tob-scott-a
Are there programs that use PEM, don't need constant-time encoding, and care about the (hopefully small) performance cost of constant-time base64 encoding?
Consider a use-case of a Key and/or Certificate Transparency system that accepts PEM-encoded public keys (which are not public) or certificates (which in turn only contain public information) in order to validate them before committing them to a Merkle tree.
In such a hypothetical system, slowing down the base64 decoding by default would add a performance bottleneck with no real upside for the system in question.
This is the sort of consideration I had in mind, and why I opened with not enabling it by default.
Comment From: neild
That's an example of a system that doesn't care about constant time encoding. Is PEM encoding/decoding going to be in critical performance path in a system like that, though? How much of a cost are we talking about here?
If we're going to double the API surface of the encoding/pem package and require all future users to spend time trying to decide whether they need constant-time encoding or not, than I think we should quantify what the actual performance difference is.
Comment From: rolandshoemaker
I think we need benchmarks to make reasonable decisions here about the trade-offs, but I will note that encoding/pem is currently only three functions. Doubling the API surface is perhaps not ideal, but concretely that comes down to adding three new functions, which isn't particularly horrific.
The point about requiring users to decide is valid, but I think the combination of naming and reasonable documentation makes it less scary. Additionally, since the existing Encode/Decode will become "safe" means it's unlikely people will unintentionally begin using the "unsafe" permutations unless they are explicitly looking for them for performance reasons.
Of course all of this is moot if the benchmarks show that the performance degradation is relatively low, hence why we need numbers before we can make a decision.
Comment From: aclements
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — aclements for the proposal review group
Comment From: aclements
I think we need a benchmark comparison between variable-time and constant-time base64 and base32 encoding to make any decisions here. If the regression is small, then I'd favor fewer configuration knobs in the API.
Comment From: tob-scott-a
@aclements Good point. I recall @Jorropo was planning to run their own benchmarks (based on this comment). If something changes and I need to write and share a small program (and the data collected from my own machines), I'm happy to do so.
Comment From: rolandshoemaker
Benchmark based on https://go.dev/cl/676917:
% benchstat vt-base64 ct-base64
goos: darwin
goarch: arm64
pkg: encoding/base64
cpu: Apple M1 Pro
│ vt-base64 │ ct-base64 │
│ sec/op │ sec/op vs base │
EncodeToString/2-10 13.35n ± 1% 17.94n ± 1% +34.47% (p=0.000 n=10)
EncodeToString/4-10 14.59n ± 1% 22.40n ± 0% +53.58% (p=0.000 n=10)
EncodeToString/8-10 18.58n ± 1% 31.44n ± 1% +69.19% (p=0.000 n=10)
EncodeToString/64-10 71.42n ± 1% 192.20n ± 1% +169.13% (p=0.000 n=10)
EncodeToString/8192-10 5.907µ ± 1% 19.849µ ± 1% +236.02% (p=0.000 n=10)
DecodeString/2-10 19.81n ± 4% 26.03n ± 0% +31.37% (p=0.000 n=10)
DecodeString/4-10 19.28n ± 2% 32.28n ± 1% +67.43% (p=0.000 n=10)
DecodeString/8-10 22.80n ± 1% 44.97n ± 1% +97.21% (p=0.000 n=10)
DecodeString/64-10 54.78n ± 1% 231.20n ± 0% +322.05% (p=0.000 n=10)
DecodeString/8192-10 4.181µ ± 2% 25.227µ ± 1% +503.43% (p=0.000 n=10)
NewEncoding-10 145.7n ± 0% 289.3n ± 1% +98.56% (p=0.000 n=10)
geomean 75.33n 168.3n +123.44%
│ vt-base64 │ ct-base64 │
│ B/s │ B/s vs base │
EncodeToString/2-10 142.9Mi ± 1% 106.3Mi ± 1% -25.61% (p=0.000 n=10)
EncodeToString/4-10 261.6Mi ± 1% 170.3Mi ± 0% -34.89% (p=0.000 n=10)
EncodeToString/8-10 410.6Mi ± 1% 242.7Mi ± 1% -40.90% (p=0.000 n=10)
EncodeToString/64-10 854.6Mi ± 1% 317.5Mi ± 1% -62.85% (p=0.000 n=10)
EncodeToString/8192-10 1322.5Mi ± 1% 393.6Mi ± 1% -70.24% (p=0.000 n=10)
DecodeString/2-10 192.6Mi ± 4% 146.6Mi ± 0% -23.90% (p=0.000 n=10)
DecodeString/4-10 395.7Mi ± 2% 236.3Mi ± 1% -40.28% (p=0.000 n=10)
DecodeString/8-10 502.0Mi ± 1% 254.5Mi ± 1% -49.29% (p=0.000 n=10)
DecodeString/64-10 1532.1Mi ± 1% 363.0Mi ± 0% -76.31% (p=0.000 n=10)
DecodeString/8192-10 2492.0Mi ± 2% 413.0Mi ± 1% -83.43% (p=0.000 n=10)
NewEncoding-10 1675.4Mi ± 0% 843.8Mi ± 1% -49.63% (p=0.000 n=10)
geomean 608.6Mi 272.4Mi -55.25%
Does not look great. I don't think we could reasonably accept slowing down non-sensitive base64 usages this much by default.
Didn't check the perf of @Sc00bz impl, nor did I do much investigation on whether there are easy perf optimizations still available for https://go.dev/cl/676917.
Comment From: aclements
Thanks for running those. I'm not surprised that the adding an indirect function call for every byte like https://go.dev/cl/676917 does slows things down dramatically. I think we'd have to find a different way to do this.
Just as an experiment, what happens if you inline that logic for a particular alphabet? (It might work to just use a direct function call, but be sure to check that the compiler actually inlines it if you try that.)
Comment From: rolandshoemaker
Inlining the calls improves the performance, but not massively:
% benchstat vt-base64 ct-inlined-base64
goos: darwin
goarch: arm64
pkg: encoding/base64
cpu: Apple M1 Pro
│ vt-base64 │ ct-inlined-base64 │
│ sec/op │ sec/op vs base │
EncodeToString/2-10 13.35n ± 1% 14.72n ± 1% +10.34% (p=0.000 n=10)
EncodeToString/4-10 14.59n ± 1% 19.08n ± 3% +30.82% (p=0.000 n=10)
EncodeToString/8-10 18.58n ± 1% 26.43n ± 1% +42.25% (p=0.000 n=10)
EncodeToString/64-10 71.42n ± 1% 153.00n ± 2% +114.24% (p=0.000 n=10)
EncodeToString/8192-10 5.907µ ± 1% 16.689µ ± 4% +182.52% (p=0.000 n=10)
DecodeString/2-10 19.81n ± 4% 23.62n ± 1% +19.23% (p=0.000 n=10)
DecodeString/4-10 19.28n ± 2% 29.69n ± 1% +53.97% (p=0.000 n=10)
DecodeString/8-10 22.80n ± 1% 41.86n ± 2% +83.60% (p=0.000 n=10)
DecodeString/64-10 54.78n ± 1% 213.35n ± 0% +289.47% (p=0.000 n=10)
DecodeString/8192-10 4.181µ ± 2% 24.479µ ± 1% +485.55% (p=0.000 n=10)
NewEncoding-10 145.7n ± 0% 297.7n ± 2% +104.32% (p=0.000 n=10)
geomean 75.33n 150.0n +99.13%
│ vt-base64 │ ct-inlined-base64 │
│ B/s │ B/s vs base │
EncodeToString/2-10 142.9Mi ± 1% 129.5Mi ± 1% -9.37% (p=0.000 n=10)
EncodeToString/4-10 261.6Mi ± 1% 199.9Mi ± 3% -23.56% (p=0.000 n=10)
EncodeToString/8-10 410.6Mi ± 1% 288.7Mi ± 1% -29.70% (p=0.000 n=10)
EncodeToString/64-10 854.6Mi ± 1% 398.9Mi ± 2% -53.33% (p=0.000 n=10)
EncodeToString/8192-10 1322.5Mi ± 1% 468.2Mi ± 4% -64.60% (p=0.000 n=10)
DecodeString/2-10 192.6Mi ± 4% 161.5Mi ± 1% -16.15% (p=0.000 n=10)
DecodeString/4-10 395.7Mi ± 2% 257.0Mi ± 1% -35.06% (p=0.000 n=10)
DecodeString/8-10 502.0Mi ± 1% 273.4Mi ± 2% -45.54% (p=0.000 n=10)
DecodeString/64-10 1532.1Mi ± 1% 393.3Mi ± 0% -74.33% (p=0.000 n=10)
DecodeString/8192-10 2492.0Mi ± 2% 425.6Mi ± 1% -82.92% (p=0.000 n=10)
NewEncoding-10 1675.4Mi ± 0% 820.0Mi ± 2% -51.06% (p=0.000 n=10)
geomean 608.6Mi 305.6Mi -49.78%
Because we are operating on 8 byte chunks at a time, I do wonder if a SIMD (or other multi op packing technique) could provide a significantly more performant implementation.
Comment From: randall77
For encoding at least, you only need a 64-byte lookup table. That can all fit on a single cache line. I would think if you properly aligned the lookup table it would be constant time as far as cache timing attacks go.
Comment From: randall77
Decoding not too hard to fit into a 64-byte table also. See CL 687396.
Comment From: rolandshoemaker
@randall77 I quite like the idea of using the fact that the LUT fits in a single cache line in theory, but in practice I'm not entirely comfortable with the concept of relying on CPU caching semantics for constant-time properties (I'm not sure I've ever seen a case where somebody has exploited mismatches in described behavior and implement behavior here, but it also wouldn't entirely shock me). If anything, the last four years or so have made me less willing to rely on CPU designers to not silently break it.
Comment From: Sc00bz
Here's base64 encode/decode 8 characters at a time with SIMD using standard 64 bit integers.
TL;DR I'm doing 8, 7 bit math operations in a 64 bit integer. There a separator bit to prevent negative numbers from bleeding over. Also the separator bit is use to make the 8, 7 bit conditional masks.
// Decode usage:
encoded := binary.BigEndian.Uint64(in)
data := base64StandardDecodeBlock(encoded)
out[0] = byte(data >> 40)
out[1] = byte(data >> 32)
out[2] = byte(data >> 24)
out[3] = byte(data >> 16)
out[4] = byte(data >> 8)
out[5] = byte(data >> 0)
// Encode usage:
data :=
(uint64(in[0]) << 40) |
(uint64(in[1]) << 32) |
(uint64(in[2]) << 24) |
(uint64(in[3]) << 16) |
(uint64(in[4]) << 8) |
(uint64(in[5]) << 0)
encoded := base64StandardEncodeBlock(data)
binary.BigEndian.PutUint64(out, encoded)
func base64StandardDecodeBlock(src uint64) uint64 {
const SEPARATORS uint64 = 0x8080808080808080
const LO1 int64 = int64('A')
const HI1 int64 = int64('Z')
const LO2 int64 = int64('a')
const HI2 int64 = int64('z')
const LO3 int64 = int64('0')
const HI3 int64 = int64('9')
const CHAR4 int64 = int64('+')
const CHAR5 int64 = int64('/')
const FULL_LO1 uint64 = uint64(LO1 << 56 | LO1 << 48 | LO1 << 40 | LO1 << 32 | LO1 << 24 | LO1 << 16 | LO1 << 8 | LO1)
const FULL_HI1 uint64 = uint64(HI1 << 56 | HI1 << 48 | HI1 << 40 | HI1 << 32 | HI1 << 24 | HI1 << 16 | HI1 << 8 | HI1) | SEPARATORS
const FULL_LO2 uint64 = uint64(LO2 << 56 | LO2 << 48 | LO2 << 40 | LO2 << 32 | LO2 << 24 | LO2 << 16 | LO2 << 8 | LO2)
const FULL_HI2 uint64 = uint64(HI2 << 56 | HI2 << 48 | HI2 << 40 | HI2 << 32 | HI2 << 24 | HI2 << 16 | HI2 << 8 | HI2) | SEPARATORS
const FULL_LO3 uint64 = uint64(LO3 << 56 | LO3 << 48 | LO3 << 40 | LO3 << 32 | LO3 << 24 | LO3 << 16 | LO3 << 8 | LO3)
const FULL_HI3 uint64 = uint64(HI3 << 56 | HI3 << 48 | HI3 << 40 | HI3 << 32 | HI3 << 24 | HI3 << 16 | HI3 << 8 | HI3) | SEPARATORS
const FULL_CHAR4 uint64 = uint64(CHAR4 << 56 | CHAR4 << 48 | CHAR4 << 40 | CHAR4 << 32 | CHAR4 << 24 | CHAR4 << 16 | CHAR4 << 8 | CHAR4)
const FULL_CHAR5 uint64 = uint64(CHAR5 << 56 | CHAR5 << 48 | CHAR5 << 40 | CHAR5 << 32 | CHAR5 << 24 | CHAR5 << 16 | CHAR5 << 8 | CHAR5)
const COUNT1 int64 = 0
const COUNT2 int64 = COUNT1 + HI1 - LO1 + 1
const COUNT3 int64 = COUNT2 + HI2 - LO2 + 1
const COUNT4 int64 = COUNT3 + HI3 - LO3 + 1
const COUNT5 int64 = COUNT4 + 1
const FULL_COUNT4 uint64 = uint64(COUNT4 << 56 | COUNT4 << 48 | COUNT4 << 40 | COUNT4 << 32 | COUNT4 << 24 | COUNT4 << 16 | COUNT4 << 8 | COUNT4)
const FULL_COUNT5 uint64 = uint64(COUNT5 << 56 | COUNT5 << 48 | COUNT5 << 40 | COUNT5 << 32 | COUNT5 << 24 | COUNT5 << 16 | COUNT5 << 8 | COUNT5)
const DIFF1 uint64 = uint64( (LO1 - COUNT1))
const DIFF2 uint64 = uint64( (LO2 - COUNT2))
const DIFF3 uint64 = uint64(-(LO3 - COUNT3)) // Note the sign
const FULL_DIFF1 uint64 = DIFF1 << 56 | DIFF1 << 48 | DIFF1 << 40 | DIFF1 << 32 | DIFF1 << 24 | DIFF1 << 16 | DIFF1 << 8 | DIFF1
const FULL_DIFF2 uint64 = DIFF2 << 56 | DIFF2 << 48 | DIFF2 << 40 | DIFF2 << 32 | DIFF2 << 24 | DIFF2 << 16 | DIFF2 << 8 | DIFF2
const FULL_DIFF3 uint64 = DIFF3 << 56 | DIFF3 << 48 | DIFF3 << 40 | DIFF3 << 32 | DIFF3 << 24 | DIFF3 << 16 | DIFF3 << 8 | DIFF3
const FULL_ONE uint64 = 0x0101010101010101
var value uint64
var mask uint64
src2 := src | SEPARATORS
err := src & SEPARATORS
// value += src >= LO1 && HI1 >= src ? src - DIFF1 : 0
mask = (src2 - FULL_LO1) & (FULL_HI1 - src) & SEPARATORS
mask = mask - (mask >> 7)
value += (src & mask) - (FULL_DIFF1 & mask)
err |= mask
// value += src >= LO2 && HI2 >= src ? src - DIFF2 : 0
mask = (src2 - FULL_LO2) & (FULL_HI2 - src) & SEPARATORS
mask = mask - (mask >> 7)
value += (src & mask) - (FULL_DIFF2 & mask)
err |= mask
// value += src >= LO3 && HI3 >= src ? src + DIFF3 : 0 // Sign of DIFF3 was flipped
mask = (src2 - FULL_LO3) & (FULL_HI3 - src) & SEPARATORS
mask = mask - (mask >> 7)
value += (src & mask) + (FULL_DIFF3 & mask) // Sign of DIFF3 was flipped
err |= mask
// value += src == CHAR4 ? COUNT4 : 0
mask = ((src2 ^ FULL_CHAR4) - FULL_ONE) & SEPARATORS ^ SEPARATORS
mask = mask - (mask >> 7)
value += FULL_COUNT4 & mask
err |= mask
// value += src == CHAR5 ? COUNT5 : 0
mask = ((src2 ^ FULL_CHAR5) - FULL_ONE) & SEPARATORS ^ SEPARATORS
mask = mask - (mask >> 7)
value += FULL_COUNT5 & mask
err |= mask
value =
((value & 0x3f00000000000000) >> (56 - 7 * 6)) |
((value & 0x003f000000000000) >> (48 - 6 * 6)) |
((value & 0x00003f0000000000) >> (40 - 5 * 6)) |
((value & 0x0000003f00000000) >> (32 - 4 * 6)) |
((value & 0x000000003f000000) >> (24 - 3 * 6)) |
((value & 0x00000000003f0000) >> (16 - 2 * 6)) |
((value & 0x0000000000003f00) >> ( 8 - 1 * 6)) |
((value & 0x000000000000003f) >> ( 0 - 0 * 6))
// err = err == 0x7f7f7f7f7f7f7f7f ? 0 : -1
_, err = bits.Sub64(err ^ 0x7f7f7f7f7f7f7f7f, 1, 0)
err = ^(0 - err)
return value | err // returns -1 on error
}
func base64StandardEncodeBlock(src uint64) uint64 {
const SEPARATORS uint64 = 0x8080808080808080
const LO1 int64 = int64('A')
const HI1 int64 = int64('Z')
const LO2 int64 = int64('a')
const HI2 int64 = int64('z')
const LO3 int64 = int64('0')
const HI3 int64 = int64('9')
const LO4 int64 = int64('+')
const HI4 int64 = int64('+')
const LO5 int64 = int64('/')
const FULL_LO1 uint64 = uint64(LO1 << 56 | LO1 << 48 | LO1 << 40 | LO1 << 32 | LO1 << 24 | LO1 << 16 | LO1 << 8 | LO1)
const DIFF1 uint64 = uint64( (LO2 - HI1 - 1) & 0x7f)
const DIFF2 uint64 = uint64( (LO3 - HI2 - 1) & 0x7f)
const DIFF3 uint64 = uint64(-(LO4 - HI3 - 1) & 0x7f) // avoid overflow with 0x71
const DIFF4 uint64 = uint64( (LO5 - HI4 - 1) & 0x7f)
const FULL_DIFF1 uint64 = DIFF1 << 56 | DIFF1 << 48 | DIFF1 << 40 | DIFF1 << 32 | DIFF1 << 24 | DIFF1 << 16 | DIFF1 << 8 | DIFF1
const FULL_DIFF2 uint64 = DIFF2 << 56 | DIFF2 << 48 | DIFF2 << 40 | DIFF2 << 32 | DIFF2 << 24 | DIFF2 << 16 | DIFF2 << 8 | DIFF2
const FULL_DIFF3 uint64 = DIFF3 << 56 | DIFF3 << 48 | DIFF3 << 40 | DIFF3 << 32 | DIFF3 << 24 | DIFF3 << 16 | DIFF3 << 8 | DIFF3
const FULL_DIFF4 uint64 = DIFF4 << 56 | DIFF4 << 48 | DIFF4 << 40 | DIFF4 << 32 | DIFF4 << 24 | DIFF4 << 16 | DIFF4 << 8 | DIFF4
// The overflow:
// LO1 + DIFF1 + DIFF2 - 0x0f + DIFF4 + 63 = 0xaf
// LO1 + DIFF1 + DIFF2 + 0x71 + DIFF4 + 63 = 0x12f (overflow)
const COUNT1 uint64 = uint64(HI1 - LO1) + 1
const COUNT2 uint64 = uint64(HI2 - LO2) + 1 + COUNT1
const COUNT3 uint64 = uint64(HI3 - LO3) + 1 + COUNT2
const COUNT4 uint64 = uint64(HI4 - LO4) + 1 + COUNT3
const FULL_COUNT1 uint64 = COUNT1 << 56 | COUNT1 << 48 | COUNT1 << 40 | COUNT1 << 32 | COUNT1 << 24 | COUNT1 << 16 | COUNT1 << 8 | COUNT1
const FULL_COUNT2 uint64 = COUNT2 << 56 | COUNT2 << 48 | COUNT2 << 40 | COUNT2 << 32 | COUNT2 << 24 | COUNT2 << 16 | COUNT2 << 8 | COUNT2
const FULL_COUNT3 uint64 = COUNT3 << 56 | COUNT3 << 48 | COUNT3 << 40 | COUNT3 << 32 | COUNT3 << 24 | COUNT3 << 16 | COUNT3 << 8 | COUNT3
const FULL_COUNT4 uint64 = COUNT4 << 56 | COUNT4 << 48 | COUNT4 << 40 | COUNT4 << 32 | COUNT4 << 24 | COUNT4 << 16 | COUNT4 << 8 | COUNT4
src =
((src << (56 - 7 * 6)) & 0x3f00000000000000) |
((src << (48 - 6 * 6)) & 0x003f000000000000) |
((src << (40 - 5 * 6)) & 0x00003f0000000000) |
((src << (32 - 4 * 6)) & 0x0000003f00000000) |
((src << (24 - 3 * 6)) & 0x000000003f000000) |
((src << (16 - 2 * 6)) & 0x00000000003f0000) |
((src << ( 8 - 1 * 6)) & 0x0000000000003f00) |
((src << ( 0 - 0 * 6)) & 0x000000000000003f)
src2 := src | SEPARATORS
diff := FULL_LO1
var mask uint64
// if ch >= COUNT { diff += DIFF }
mask = (src2 - FULL_COUNT1) & SEPARATORS; diff += FULL_DIFF1 & (mask - (mask >> 7))
mask = (src2 - FULL_COUNT2) & SEPARATORS; diff += FULL_DIFF2 & (mask - (mask >> 7))
mask = (src2 - FULL_COUNT3) & SEPARATORS; diff -= FULL_DIFF3 & (mask - (mask >> 7)) // overflow avoided
mask = (src2 - FULL_COUNT4) & SEPARATORS; diff += FULL_DIFF4 & (mask - (mask >> 7))
return (src + diff) & 0x7f7f7f7f7f7f7f7f
}
I didn't test all the possible encode/decodes (2**48) but I tested all 3 byte values (2**24) left shifted by 0, 6, 12, 18, and 24.
SSE2/AVX2/NEON will be faster. SSE2, AVX2, NEON have compares for bytes equal and signed greater than: _mm_cmpgt_epi8/_mm256_cmpgt_epi8/vceqq_s8 (PCMPGTB/VPCMPGTB/CMEQ) and _mm_cmpeq_epi8/_mm256_cmpeq_epi8/vcgtq_s8 (PCMPEQB/VPCMPEQB/CMGT). Thus a simple conditional add is 3 SIMD instructions (compare, and, add), but adding actual SIMD seems like overkill.
I'll benchmark this tomorrow. If someone else hasn't.
The base32 version is kind of straight forward. For decode you'll get an error on the DIFFs you need to flip the sign. For encode you need to manually check for overflows and flip the appropriate signs.
I'll submit a pull request assuming these are faster—Oh I just realized I need to make a change. To save time on partial blocks. The input to the encode function should be expanded before calling. Similar for the output of the decode function.
Comment From: Sc00bz
It's slower. I'll try a few things, but I think this will be slower without SIMD. I'm pretty sure this won't change much but the benchmark is testing encode and decode of zero bytes. So it's always hitting the same elements: encode[0] and decodeMap['A'].
Comment From: rolandshoemaker
If we can get a just as fast (likely faster) CT SIMD assembly implementation, then I think the question about whether to add a CT implementation alongside the VT one somewhat goes out of the window (if it's just faster, why not always use it). That said, if it's significantly more complex to write an assembly implementation that is agnostic to alphabets, I think the question still remains.
Comment From: apparentlymart
Would it be acceptable to say that constant-time is guaranteed only for the standard alphabet, while other alphabets might use a more general implementation that is not constant-time, or does that seem too subtle for a security-related feature? 🤔
(I'm asking the above question primarily from an API design standpoint. I acknowledge that it also implies still maintaining two implementations rather than only one, but that was already true for the compromise of exposing constant-time and not-constant-time as explicitly separate functions.)
Comment From: rolandshoemaker
I think regardless of the underlying implementation we'll want to retain an API such as the WithConstantTime
suggested in https://github.com/golang/go/issues/73901#issuecomment-2919845169). If we decide we can only enable the CT implementation for standard alphabets, then that can return an error for non-standard alphabets.
If we can come up with a fast SIMD CT implementation that can easily be extended to arbitrary alphabets that would be nice, but I don't think it's a sticking point. Really we only want this for standard alphabets (for use in encoding/pem), and if we have to carry a slower variable-time implementation for non-standard alphabets, that's probably fine.
Comment From: aclements
FWIW, the fast AVX-512 algorithms for base64 encode and decode in https://arxiv.org/pdf/1910.05109 work with arbitrary alphabets and should be straightforward to extend to base32. However, at their core is a table lookup that only AVX-512 can pull off. I haven't thought much about how to do this with smaller vectors, but will note that SIMD by its nature is generally good at turning branchy code into branchless code.
Do people use alphabets that differ in anything other than the last two elements?
Comment From: magical
Yes, there are certain password hashing schemes that use variant alphabets: most notably bcrypt
, which puts the punctuation characters ./
at the start of the alphabet instead of at the end.
./ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
https://cs.opensource.google/go/x/crypto/+/refs/tags/v0.40.0:bcrypt/base64.go;l=9
Comment From: Sc00bz
You need AVX-512's VBMI for _mm512_permutex2var_epi8 (VPERMI2B). It's also in AVX10.1 which is the first AVX10 version. (If you're not familiar, AVX10 is a consolidation of AVX-512 instruction subsets. Also AVX10/256 for limiting register width to 256 bits and a promise that future versions will include all instructions in previous versions.)
AMD's Zen4 and Zen5 have it. Intel's Cannon Lake was the first to have it. At least Icelake Intel, Icelake Xeon, Sapphire Rapids have it according to Intel Intrinsics Guide. Ah Wikipedia's compatibility table.
It might be better to target SSSE3*, AVX, and AVX2's _mm_shuffle_epi8/_mm256_shuffle_epi8 (PSHUFB/VPSHUFB). It will need 2 (base32) or 4 (base64) calls to encode and 16 calls to decode because it is limited to a 16 byte lookup table. Newer Intel's CPUs have a throughput of 2 instructions/clock which is nice.
NEON (ARM64) has vqtbl4q_s8 (TBL) which can do a 64 byte lookup table (across 4, 16 byte registers). Also it's easier than PSHUFB/VPSHUFB because anything out of bounds is set to 0. Instead of if the high bit is set then 0. For ARMv7 and ARM32 there's vtbl4_u8 (TBL) which can do a 32 byte lookup table (across 4, 8 byte registers).
* Maybe not SSSE3 since it's old and Intel only.