Golang proposal: encoding/{base32,base64}: constant-time implementations of RFC 4648 codecs for use with cryptography keys

Proposal Details

Background

The current implementations of the base32 and base64 codecs are not ideal for cryptographic use cases. The linked document describes practical timing side-channels against general purpose RFC 4648 codecs when used to encode/decode cryptographic secrets. You can verify the current implementation uses a table look-up in both directions, which introduces the risk of cache-timing attacks.

Specialized implementations of these codecs exist. @Sc00bz wrote a constant-time implementation of base64 in C, which I used as a basis for both base32 and base64 in PHP.

Proposed Change

New functions in encoding/base32 and encoding/base64:

EncodeToStringConstTime as an alternative to EncodeToString
DecodeStringConstTime as an alternative to DecodeString

These functions could then be used in other packages that handle cryptographic secrets and auditors can be absolutely sure that timing leaks are not present.

(I'm happy to provide a patch for this, if accepted.)

Comment From: mateusz834

CC @golang/security

Comment From: tob-scott-a

Just in case there's any confusion: I do not think this qualifies as a security vulnerability (or I would have reported it as one, rather than filing an issue with the intent to follow up with a patch).

Comment From: Jorropo

👍 however do we want new functions ? It would be simpler for everyone if we made the current functions constant time. If you could send your patch already it would be nice, I would like to benchmark it.

Comment From: tob-scott-a

A constant-time alternative will certainly be slower than the general-purpose functions: It replaces a map look-up with many operations.

In the context of handling cryptographic secrets, this trade-off is worth it. However, making all use-cases for a base64 codec constant-time would slow most Go programs down unnecessarily.

It is for this reason that my opening offer was to implement new functions rather than change the current behavior.

Comment From: Jorropo

A single map lookup is far from cheap, I don't want to risk myself to guessing the performance impact, a benchmark would really help 🙂. Also in go we often do not mind trading a few % of performance if it means making the API easier to use.

Comment From: mateusz834

Also in go we often do not mind trading a few % of performance if it means making the API easier to use.

We can always do this in reverse, make the current API constant-time, but add unsafe variants (if needed) for speed. Then folks would actually use the constant-time APIs, otherwise we might have low usage of such (new) APIs.

Comment From: renthraysk

Doesn't the fact that stdlib have configurable alphabets for the base64/32 rule out replacing them?

Comment From: tob-scott-a

@renthraysk It's not actually a total show-stopper (although it does make the patch a bit trickier to write). See #73909 for a first pass at a patch that adheres to the suggestions provided in this discussion so far.

Comment From: gopherbot

Change https://go.dev/cl/676917 mentions this issue: encoding/base64: add constant-time behavior, enabled by default

Comment From: rolandshoemaker

I don't think we'd want to turn on constant-time behavior by default for all uses of encoding/{base32,base64}. I suspect the vast majority of their usage is for entirely benign inputs. That said we probably would want to use it by default for encoding/pem, where the contents are much more likely to be sensitive (of course we should probably still add some way to disable this, for people who i.e. need to parse large numbers of certificates etc).

I haven't thought through this too deeply, but based on my understanding of how people typically do constant time base64, could we add a generic decoder/encoder that generates the required bit shift offsets based on the alphabet, rather than needing to implement per-alphabet encoders and decoders? We'd then just need to add a WithConstantTime (or something) method to Encoder.

Comment From: rolandshoemaker

I'm not sure this is actually correct (threw it together very quickly), but something along the lines of this https://go.dev/play/p/4toQNnegoOh?

Comment From: tob-scott-a

I'm not sure this is actually correct (threw it together very quickly), but something along the lines of this https://go.dev/play/p/4toQNnegoOh?

The problem with that sort of check is that the RFC 4648 alphabets for base64 are not contiguous, which requires arithmetic like https://go-review.googlesource.com/c/go/+/676917/1/src/encoding/base64/mapping.go to implement without branches.

I don't think we'd want to turn on constant-time behavior by default for all uses of encoding/{base32,base64}. I suspect the vast majority of their usage is for entirely benign inputs.

I'm of the same opinion here.

In this case, these lines can be omitted and the corresponding logic moved into other places (e.g., encoding/pem) as needed.

Comment From: rolandshoemaker

The problem with that sort of check is that the RFC 4648 alphabets for base64 are not contiguous, which requires arithmetic like https://go-review.googlesource.com/c/go/+/676917/1/src/encoding/base64/mapping.go to implement without branches.

Can this arithmetic not be precomputed for the discontiguous ranges in the alphabet such that there don't need to be branches? Or is there something obvious I'm missing.

Comment From: tob-scott-a

Sorry, I misread the code. It looks like your approach will work:

Compare https://go.dev/play/p/EEz9V8fgsCC with https://go.dev/play/p/VHwr1LZpst9

EDIT: Looks like there's a bit of a bug with the logic for handling ret and valid. This fixes it: https://go.dev/play/p/vQeJ3QP2VWf

Comment From: Sc00bz

I wrote this awhile ago. I think it's easier to read. Also you can edit the constants for the different character sets. Although these decode functions have an optimization for a single character range (which is obvious to change). Oh also if you want Crockford's base32 decode or case insensitive hex or base32 decode then there's a few modifications like adding ch |= 'A' ^ 'a' // convert to lowercase (wait unless EBCDIC... see footer) after any non-letter ranges. Then for Crockford's base32 decode check for o, i, and l for values 0, 1, and 1.

Note base64StandardDecodeCharFaster should save 4 operations in x86 vs base64StandardDecodeChar. Also due to instruction latency and throughput it should be the same speed but a little smaller code output.

Also for custom character sets there will be a point when scanning the character set will be faster. But I don't know when that is and I assume it's higher than common character sets (in ASCII). Basically unless the character set is randomised "for security" then doing ranges will leak info about the character set, but cases like that don't really matter because they will always find a way to break their scheme.

func base64StandardDecodeChar(ch int) int {
    const LO1 = int('A') - 1
    const HI1 = int('Z') + 1
    const LO2 = int('a') - 1
    const HI2 = int('z') + 1
    const LO3 = int('0') - 1
    const HI3 = int('9') + 1
    const LO4 = int('+') - 1
    const HI4 = int('+') + 1
    const LO5 = int('/') - 1
    const HI5 = int('/') + 1
    const COUNT1 = 0
    const COUNT2 = COUNT1 + HI1 - LO1 - 1
    const COUNT3 = COUNT2 + HI2 - LO2 - 1
    const COUNT4 = COUNT3 + HI3 - LO3 - 1
    const COUNT5 = COUNT4 + HI4 - LO4 - 1

    value := -1

    //    if (  LO  < ch  && ch < HI        )   { value += COUNT + ch - LO }
    value += (((LO1 - ch) & (ch - HI1)) >> 8) & (COUNT1 + ch - LO1)
    value += (((LO2 - ch) & (ch - HI2)) >> 8) & (COUNT2 + ch - LO2)
    value += (((LO3 - ch) & (ch - HI3)) >> 8) & (COUNT3 + ch - LO3)
    value += (((LO4 - ch) & (ch - HI4)) >> 8) & (COUNT4 + 1) // ch - LO4 == 1
    value += (((LO5 - ch) & (ch - HI5)) >> 8) & (COUNT5 + 1) // ch - LO5 == 1

    return value
}

func base64StandardDecodeCharFaster(ch int) int {
    const LO1 = int('A') - 1
    const HI1 = int('Z') + 1
    const LO2 = int('a') - 1
    const HI2 = int('z') + 1
    const LO3 = int('0') - 1
    const HI3 = int('9') + 1
    const CHAR4 = int('+')
    const CHAR5 = int('/')
    const COUNT1 = 0
    const COUNT2 = COUNT1 + HI1 - LO1 - 1
    const COUNT3 = COUNT2 + HI2 - LO2 - 1
    const COUNT4 = COUNT3 + HI3 - LO3 - 1
    const COUNT5 = COUNT4 + 1

    var borrow uint

    value := -1

    //    if (  LO  < ch  && ch < HI        )   { value += ch - LO + COUNT }
    value += (((LO1 - ch) & (ch - HI1)) >> 8) & (ch - LO1 + COUNT1)
    value += (((LO2 - ch) & (ch - HI2)) >> 8) & (ch - LO2 + COUNT2)
    value += (((LO3 - ch) & (ch - HI3)) >> 8) & (ch - LO3 + COUNT3)

    // if (                   ch == CHAR      ) { value +=  COUNT + 1 }
    _, borrow = bits.Sub(uint(ch ^ CHAR4), 1, 0); value += (COUNT4 + 1) & -int(borrow)
    _, borrow = bits.Sub(uint(ch ^ CHAR5), 1, 0); value += (COUNT5 + 1) & -int(borrow)

    return value
}

func base64StandardEncodeChar(src int) byte {
    const LO1 = int('A')
    const HI1 = int('Z')
    const LO2 = int('a')
    const HI2 = int('z')
    const LO3 = int('0')
    const HI3 = int('9')
    const LO4 = int('+')
    const HI4 = int('+')
    const LO5 = int('/')

    src += LO1

    //  if ( HI  < src      )   { src += LO_NEXT - HI - 1 }
    src += ((HI1 - src) >> 8) & (LO2 - HI1 - 1)
    src += ((HI2 - src) >> 8) & (LO3 - HI2 - 1)
    src += ((HI3 - src) >> 8) & (LO4 - HI3 - 1)
    src += ((HI4 - src) >> 8) & (LO5 - HI4 - 1)

    return byte(src)
}

This all assumes that it's in ASCII. If go can compile to IBM mainframes then strings are likely in EBCDIC. Which will break things.

Comment From: doggedOwl

however do we want new functions ? It would be simpler for everyone if we made the current functions constant time. If you could send your patch already it would be nice, I would like to benchmark it.

no actually it would not be simpler for everyone. base64 encoding is used extensively in various format (json, xml) etc without the need to be cryptographically secure. it would be a big regression if all those cases got slowed down for no good reason.

Comment From: rolandshoemaker

Ignoring the actual implementation details for a moment, I think a reasonable proposal here is to make the following change to encoding/base{32,64}:

// WithConstantTime creates a new encoding identical to enc except that the
// Encode/Decode operations happen in constant time. This is useful when the
// inputs to these operations are sensitive, and measuring the timing of the
// operations could leak information about them.  Note that Encode/Decode
// operations will still leak the length of the inputs.
func (enc Encoding) WithConstantTime() *Encoding

and the following changes to the existing encoding/pem functions:

// ...
// Encode uses constant-time base64 encoding, and will only leak the length of
// b.Bytes. To restore the old variable-time behavior, use the
// GODEBUG=vartimepem=0 environment variable.
func Encode(out io.Writer, b *Block) error

// ...
// EncodeToMemory uses constant-time base64 encoding, and will only leak the
// length of b.Bytes. To restore the old variable-time behavior, use the
// GODEBUG=vartimepem=0 environment variable.
func EncodeToMemory(b *Block) []byte

// ...
// Decode uses constant-time base64 decoding, and will only leak the length of
// p.Bytes. To restore the old variable-time behavior, use the
// GODEBUG=vartimepem=0 environment variable.
func Decode(data []byte) (p *Block, rest []byte)

and add these new functions to encoding/pem:

// VariableTimeEncode provides the same functionality as Encode, but uses
// variable-time base64 encoding. VariableTimeEncode should only be used when
// encoding non-sensitive values, or when the timing of VariableTimeEncode is
// not exposed.
func VariableTimeEncode(out io.Writer, b *Block) error

// VariableTimeEncodeToMemory provides the same functionality as EncodeToMemory,
// but uses variable-time base64 encoding. VariableTimeEncodeToMemory should
// only be used when encoding non-sensitive values, or when the timing of
// VariableTimeEncodeToMemory is not exposed.
func VariableTimeEncodeToMemory(b *Block) []byte

// VariableTimeDecode provides the same functionality as Decode, but uses
// variable-time base64 decoding. VariableTimeDecode should only be used when
// decoding non-sensitive values, or when the timing of VariableTimeDecode is
// not exposed.
func VariableTimeDecode(data []byte) (p *Block, rest []byte)

Another option would be to add a new field to Block, that lets setting variable time behavior for encoding, but we'd still need a VariableTimeDecode function. I think I like adding the new methods more, since it provides better symmetry in the API.

Comment From: tob-scott-a

@rolandshoemaker That does seem like a better user experience than the API changes in my hasty patch.

Comment From: neild

What's the use case for variable-time encoding in encoding/pem?

encoding/base{32,64} can't be constant-time by default (performance regression, most users don't need constant-time encoding). If encoding/pem.{Encode,Decode} become constant-time by default, then we're saying that most programs either need constant-time encoding or don't care about the performance penalty. That seems plausible to me. Are there programs that use PEM, don't need constant-time encoding, and care about the (hopefully small) performance cost of constant-time base64 encoding?

Comment From: tob-scott-a

Are there programs that use PEM, don't need constant-time encoding, and care about the (hopefully small) performance cost of constant-time base64 encoding?

Consider a use-case of a Key and/or Certificate Transparency system that accepts PEM-encoded public keys (which are not public) or certificates (which in turn only contain public information) in order to validate them before committing them to a Merkle tree.

In such a hypothetical system, slowing down the base64 decoding by default would add a performance bottleneck with no real upside for the system in question.

This is the sort of consideration I had in mind, and why I opened with not enabling it by default.

Comment From: neild

That's an example of a system that doesn't care about constant time encoding. Is PEM encoding/decoding going to be in critical performance path in a system like that, though? How much of a cost are we talking about here?

If we're going to double the API surface of the encoding/pem package and require all future users to spend time trying to decide whether they need constant-time encoding or not, than I think we should quantify what the actual performance difference is.

Comment From: rolandshoemaker

I think we need benchmarks to make reasonable decisions here about the trade-offs, but I will note that encoding/pem is currently only three functions. Doubling the API surface is perhaps not ideal, but concretely that comes down to adding three new functions, which isn't particularly horrific.

The point about requiring users to decide is valid, but I think the combination of naming and reasonable documentation makes it less scary. Additionally, since the existing Encode/Decode will become "safe" means it's unlikely people will unintentionally begin using the "unsafe" permutations unless they are explicitly looking for them for performance reasons.

Of course all of this is moot if the benchmarks show that the performance degradation is relatively low, hence why we need numbers before we can make a decision.

Comment From: aclements

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — aclements for the proposal review group

Comment From: aclements

I think we need a benchmark comparison between variable-time and constant-time base64 and base32 encoding to make any decisions here. If the regression is small, then I'd favor fewer configuration knobs in the API.

Comment From: tob-scott-a

@aclements Good point. I recall @Jorropo was planning to run their own benchmarks (based on this comment). If something changes and I need to write and share a small program (and the data collected from my own machines), I'm happy to do so.

Comment From: rolandshoemaker

Benchmark based on https://go.dev/cl/676917:

% benchstat vt-base64 ct-base64
goos: darwin
goarch: arm64
pkg: encoding/base64
cpu: Apple M1 Pro
                       │  vt-base64  │               ct-base64               │
                       │   sec/op    │    sec/op     vs base                 │
EncodeToString/2-10      13.35n ± 1%    17.94n ± 1%   +34.47% (p=0.000 n=10)
EncodeToString/4-10      14.59n ± 1%    22.40n ± 0%   +53.58% (p=0.000 n=10)
EncodeToString/8-10      18.58n ± 1%    31.44n ± 1%   +69.19% (p=0.000 n=10)
EncodeToString/64-10     71.42n ± 1%   192.20n ± 1%  +169.13% (p=0.000 n=10)
EncodeToString/8192-10   5.907µ ± 1%   19.849µ ± 1%  +236.02% (p=0.000 n=10)
DecodeString/2-10        19.81n ± 4%    26.03n ± 0%   +31.37% (p=0.000 n=10)
DecodeString/4-10        19.28n ± 2%    32.28n ± 1%   +67.43% (p=0.000 n=10)
DecodeString/8-10        22.80n ± 1%    44.97n ± 1%   +97.21% (p=0.000 n=10)
DecodeString/64-10       54.78n ± 1%   231.20n ± 0%  +322.05% (p=0.000 n=10)
DecodeString/8192-10     4.181µ ± 2%   25.227µ ± 1%  +503.43% (p=0.000 n=10)
NewEncoding-10           145.7n ± 0%    289.3n ± 1%   +98.56% (p=0.000 n=10)
geomean                  75.33n         168.3n       +123.44%

                       │   vt-base64   │              ct-base64               │
                       │      B/s      │     B/s       vs base                │
EncodeToString/2-10       142.9Mi ± 1%   106.3Mi ± 1%  -25.61% (p=0.000 n=10)
EncodeToString/4-10       261.6Mi ± 1%   170.3Mi ± 0%  -34.89% (p=0.000 n=10)
EncodeToString/8-10       410.6Mi ± 1%   242.7Mi ± 1%  -40.90% (p=0.000 n=10)
EncodeToString/64-10      854.6Mi ± 1%   317.5Mi ± 1%  -62.85% (p=0.000 n=10)
EncodeToString/8192-10   1322.5Mi ± 1%   393.6Mi ± 1%  -70.24% (p=0.000 n=10)
DecodeString/2-10         192.6Mi ± 4%   146.6Mi ± 0%  -23.90% (p=0.000 n=10)
DecodeString/4-10         395.7Mi ± 2%   236.3Mi ± 1%  -40.28% (p=0.000 n=10)
DecodeString/8-10         502.0Mi ± 1%   254.5Mi ± 1%  -49.29% (p=0.000 n=10)
DecodeString/64-10       1532.1Mi ± 1%   363.0Mi ± 0%  -76.31% (p=0.000 n=10)
DecodeString/8192-10     2492.0Mi ± 2%   413.0Mi ± 1%  -83.43% (p=0.000 n=10)
NewEncoding-10           1675.4Mi ± 0%   843.8Mi ± 1%  -49.63% (p=0.000 n=10)
geomean                   608.6Mi        272.4Mi       -55.25%

Does not look great. I don't think we could reasonably accept slowing down non-sensitive base64 usages this much by default.

Didn't check the perf of @Sc00bz impl, nor did I do much investigation on whether there are easy perf optimizations still available for https://go.dev/cl/676917.

Comment From: aclements

Thanks for running those. I'm not surprised that the adding an indirect function call for every byte like https://go.dev/cl/676917 does slows things down dramatically. I think we'd have to find a different way to do this.

Just as an experiment, what happens if you inline that logic for a particular alphabet? (It might work to just use a direct function call, but be sure to check that the compiler actually inlines it if you try that.)

Comment From: rolandshoemaker

Inlining the calls improves the performance, but not massively:

% benchstat vt-base64 ct-inlined-base64
goos: darwin
goarch: arm64
pkg: encoding/base64
cpu: Apple M1 Pro
                       │  vt-base64  │           ct-inlined-base64           │
                       │   sec/op    │    sec/op     vs base                 │
EncodeToString/2-10      13.35n ± 1%    14.72n ± 1%   +10.34% (p=0.000 n=10)
EncodeToString/4-10      14.59n ± 1%    19.08n ± 3%   +30.82% (p=0.000 n=10)
EncodeToString/8-10      18.58n ± 1%    26.43n ± 1%   +42.25% (p=0.000 n=10)
EncodeToString/64-10     71.42n ± 1%   153.00n ± 2%  +114.24% (p=0.000 n=10)
EncodeToString/8192-10   5.907µ ± 1%   16.689µ ± 4%  +182.52% (p=0.000 n=10)
DecodeString/2-10        19.81n ± 4%    23.62n ± 1%   +19.23% (p=0.000 n=10)
DecodeString/4-10        19.28n ± 2%    29.69n ± 1%   +53.97% (p=0.000 n=10)
DecodeString/8-10        22.80n ± 1%    41.86n ± 2%   +83.60% (p=0.000 n=10)
DecodeString/64-10       54.78n ± 1%   213.35n ± 0%  +289.47% (p=0.000 n=10)
DecodeString/8192-10     4.181µ ± 2%   24.479µ ± 1%  +485.55% (p=0.000 n=10)
NewEncoding-10           145.7n ± 0%    297.7n ± 2%  +104.32% (p=0.000 n=10)
geomean                  75.33n         150.0n        +99.13%

                       │   vt-base64   │          ct-inlined-base64           │
                       │      B/s      │     B/s       vs base                │
EncodeToString/2-10       142.9Mi ± 1%   129.5Mi ± 1%   -9.37% (p=0.000 n=10)
EncodeToString/4-10       261.6Mi ± 1%   199.9Mi ± 3%  -23.56% (p=0.000 n=10)
EncodeToString/8-10       410.6Mi ± 1%   288.7Mi ± 1%  -29.70% (p=0.000 n=10)
EncodeToString/64-10      854.6Mi ± 1%   398.9Mi ± 2%  -53.33% (p=0.000 n=10)
EncodeToString/8192-10   1322.5Mi ± 1%   468.2Mi ± 4%  -64.60% (p=0.000 n=10)
DecodeString/2-10         192.6Mi ± 4%   161.5Mi ± 1%  -16.15% (p=0.000 n=10)
DecodeString/4-10         395.7Mi ± 2%   257.0Mi ± 1%  -35.06% (p=0.000 n=10)
DecodeString/8-10         502.0Mi ± 1%   273.4Mi ± 2%  -45.54% (p=0.000 n=10)
DecodeString/64-10       1532.1Mi ± 1%   393.3Mi ± 0%  -74.33% (p=0.000 n=10)
DecodeString/8192-10     2492.0Mi ± 2%   425.6Mi ± 1%  -82.92% (p=0.000 n=10)
NewEncoding-10           1675.4Mi ± 0%   820.0Mi ± 2%  -51.06% (p=0.000 n=10)
geomean                   608.6Mi        305.6Mi       -49.78%

Because we are operating on 8 byte chunks at a time, I do wonder if a SIMD (or other multi op packing technique) could provide a significantly more performant implementation.

Comment From: randall77

For encoding at least, you only need a 64-byte lookup table. That can all fit on a single cache line. I would think if you properly aligned the lookup table it would be constant time as far as cache timing attacks go.

Comment From: randall77

Decoding not too hard to fit into a 64-byte table also. See CL 687396.

Comment From: rolandshoemaker

@randall77 I quite like the idea of using the fact that the LUT fits in a single cache line in theory, but in practice I'm not entirely comfortable with the concept of relying on CPU caching semantics for constant-time properties (I'm not sure I've ever seen a case where somebody has exploited mismatches in described behavior and implement behavior here, but it also wouldn't entirely shock me). If anything, the last four years or so have made me less willing to rely on CPU designers to not silently break it.

Comment From: Sc00bz

Here's base64 encode/decode 8 characters at a time with SIMD using standard 64 bit integers.

TL;DR I'm doing 8, 7 bit math operations in a 64 bit integer. There a separator bit to prevent negative numbers from bleeding over. Also the separator bit is use to make the 8, 7 bit conditional masks.

// Decode usage:
encoded := binary.BigEndian.Uint64(in)
data := base64StandardDecodeBlock(encoded)
out[0] = byte(data >> 40)
out[1] = byte(data >> 32)
out[2] = byte(data >> 24)
out[3] = byte(data >> 16)
out[4] = byte(data >>  8)
out[5] = byte(data >>  0)

// Encode usage:
data :=
    (uint64(in[0]) << 40) |
    (uint64(in[1]) << 32) |
    (uint64(in[2]) << 24) |
    (uint64(in[3]) << 16) |
    (uint64(in[4]) <<  8) |
    (uint64(in[5]) <<  0)
encoded := base64StandardEncodeBlock(data)
binary.BigEndian.PutUint64(out, encoded)

func base64StandardDecodeBlock(src uint64) uint64 {
    const SEPARATORS uint64 = 0x8080808080808080

    const LO1         int64 = int64('A')
    const HI1         int64 = int64('Z')
    const LO2         int64 = int64('a')
    const HI2         int64 = int64('z')
    const LO3         int64 = int64('0')
    const HI3         int64 = int64('9')
    const CHAR4       int64 = int64('+')
    const CHAR5       int64 = int64('/')
    const FULL_LO1   uint64 = uint64(LO1 << 56 | LO1 << 48 | LO1 << 40 | LO1 << 32 | LO1 << 24 | LO1 << 16 | LO1 << 8 | LO1)
    const FULL_HI1   uint64 = uint64(HI1 << 56 | HI1 << 48 | HI1 << 40 | HI1 << 32 | HI1 << 24 | HI1 << 16 | HI1 << 8 | HI1) | SEPARATORS
    const FULL_LO2   uint64 = uint64(LO2 << 56 | LO2 << 48 | LO2 << 40 | LO2 << 32 | LO2 << 24 | LO2 << 16 | LO2 << 8 | LO2)
    const FULL_HI2   uint64 = uint64(HI2 << 56 | HI2 << 48 | HI2 << 40 | HI2 << 32 | HI2 << 24 | HI2 << 16 | HI2 << 8 | HI2) | SEPARATORS
    const FULL_LO3   uint64 = uint64(LO3 << 56 | LO3 << 48 | LO3 << 40 | LO3 << 32 | LO3 << 24 | LO3 << 16 | LO3 << 8 | LO3)
    const FULL_HI3   uint64 = uint64(HI3 << 56 | HI3 << 48 | HI3 << 40 | HI3 << 32 | HI3 << 24 | HI3 << 16 | HI3 << 8 | HI3) | SEPARATORS
    const FULL_CHAR4 uint64 = uint64(CHAR4 << 56 | CHAR4 << 48 | CHAR4 << 40 | CHAR4 << 32 | CHAR4 << 24 | CHAR4 << 16 | CHAR4 << 8 | CHAR4)
    const FULL_CHAR5 uint64 = uint64(CHAR5 << 56 | CHAR5 << 48 | CHAR5 << 40 | CHAR5 << 32 | CHAR5 << 24 | CHAR5 << 16 | CHAR5 << 8 | CHAR5)

    const COUNT1       int64 = 0
    const COUNT2       int64 = COUNT1 + HI1 - LO1 + 1
    const COUNT3       int64 = COUNT2 + HI2 - LO2 + 1
    const COUNT4       int64 = COUNT3 + HI3 - LO3 + 1
    const COUNT5       int64 = COUNT4 + 1
    const FULL_COUNT4 uint64 = uint64(COUNT4 << 56 | COUNT4 << 48 | COUNT4 << 40 | COUNT4 << 32 | COUNT4 << 24 | COUNT4 << 16 | COUNT4 << 8 | COUNT4)
    const FULL_COUNT5 uint64 = uint64(COUNT5 << 56 | COUNT5 << 48 | COUNT5 << 40 | COUNT5 << 32 | COUNT5 << 24 | COUNT5 << 16 | COUNT5 << 8 | COUNT5)

    const DIFF1      uint64 = uint64( (LO1 - COUNT1))
    const DIFF2      uint64 = uint64( (LO2 - COUNT2))
    const DIFF3      uint64 = uint64(-(LO3 - COUNT3)) // Note the sign
    const FULL_DIFF1 uint64 = DIFF1 << 56 | DIFF1 << 48 | DIFF1 << 40 | DIFF1 << 32 | DIFF1 << 24 | DIFF1 << 16 | DIFF1 << 8 | DIFF1
    const FULL_DIFF2 uint64 = DIFF2 << 56 | DIFF2 << 48 | DIFF2 << 40 | DIFF2 << 32 | DIFF2 << 24 | DIFF2 << 16 | DIFF2 << 8 | DIFF2
    const FULL_DIFF3 uint64 = DIFF3 << 56 | DIFF3 << 48 | DIFF3 << 40 | DIFF3 << 32 | DIFF3 << 24 | DIFF3 << 16 | DIFF3 << 8 | DIFF3

    const FULL_ONE uint64 = 0x0101010101010101

    var value uint64
    var mask uint64

    src2 := src | SEPARATORS
    err := src & SEPARATORS

    // value += src >= LO1 && HI1 >= src ? src - DIFF1 : 0
    mask = (src2 - FULL_LO1) & (FULL_HI1 - src) & SEPARATORS
    mask = mask - (mask >> 7)
    value += (src & mask) - (FULL_DIFF1 & mask)
    err |= mask

    // value += src >= LO2 && HI2 >= src ? src - DIFF2 : 0
    mask = (src2 - FULL_LO2) & (FULL_HI2 - src) & SEPARATORS
    mask = mask - (mask >> 7)
    value += (src & mask) - (FULL_DIFF2 & mask)
    err |= mask

    // value += src >= LO3 && HI3 >= src ? src + DIFF3 : 0 // Sign of DIFF3 was flipped
    mask = (src2 - FULL_LO3) & (FULL_HI3 - src) & SEPARATORS
    mask = mask - (mask >> 7)
    value += (src & mask) + (FULL_DIFF3 & mask) // Sign of DIFF3 was flipped
    err |= mask

    // value += src == CHAR4 ? COUNT4 : 0
    mask = ((src2 ^ FULL_CHAR4) - FULL_ONE) & SEPARATORS ^ SEPARATORS
    mask = mask - (mask >> 7)
    value += FULL_COUNT4 & mask
    err |= mask

    // value += src == CHAR5 ? COUNT5 : 0
    mask = ((src2 ^ FULL_CHAR5) - FULL_ONE) & SEPARATORS ^ SEPARATORS
    mask = mask - (mask >> 7)
    value += FULL_COUNT5 & mask
    err |= mask

    value = 
        ((value & 0x3f00000000000000) >> (56 - 7 * 6)) |
        ((value & 0x003f000000000000) >> (48 - 6 * 6)) |
        ((value & 0x00003f0000000000) >> (40 - 5 * 6)) |
        ((value & 0x0000003f00000000) >> (32 - 4 * 6)) |
        ((value & 0x000000003f000000) >> (24 - 3 * 6)) |
        ((value & 0x00000000003f0000) >> (16 - 2 * 6)) |
        ((value & 0x0000000000003f00) >> ( 8 - 1 * 6)) |
        ((value & 0x000000000000003f) >> ( 0 - 0 * 6))

    // err = err == 0x7f7f7f7f7f7f7f7f ? 0 : -1
    _, err = bits.Sub64(err ^ 0x7f7f7f7f7f7f7f7f, 1, 0)
    err = ^(0 - err)

    return value | err // returns -1 on error
}

func base64StandardEncodeBlock(src uint64) uint64 {
    const SEPARATORS uint64 = 0x8080808080808080

    const LO1       int64 = int64('A')
    const HI1       int64 = int64('Z')
    const LO2       int64 = int64('a')
    const HI2       int64 = int64('z')
    const LO3       int64 = int64('0')
    const HI3       int64 = int64('9')
    const LO4       int64 = int64('+')
    const HI4       int64 = int64('+')
    const LO5       int64 = int64('/')
    const FULL_LO1 uint64 = uint64(LO1 << 56 | LO1 << 48 | LO1 << 40 | LO1 << 32 | LO1 << 24 | LO1 << 16 | LO1 << 8 | LO1)

    const DIFF1      uint64 = uint64( (LO2 - HI1 - 1) & 0x7f)
    const DIFF2      uint64 = uint64( (LO3 - HI2 - 1) & 0x7f)
    const DIFF3      uint64 = uint64(-(LO4 - HI3 - 1) & 0x7f) // avoid overflow with 0x71
    const DIFF4      uint64 = uint64( (LO5 - HI4 - 1) & 0x7f)
    const FULL_DIFF1 uint64 = DIFF1 << 56 | DIFF1 << 48 | DIFF1 << 40 | DIFF1 << 32 | DIFF1 << 24 | DIFF1 << 16 | DIFF1 << 8 | DIFF1
    const FULL_DIFF2 uint64 = DIFF2 << 56 | DIFF2 << 48 | DIFF2 << 40 | DIFF2 << 32 | DIFF2 << 24 | DIFF2 << 16 | DIFF2 << 8 | DIFF2
    const FULL_DIFF3 uint64 = DIFF3 << 56 | DIFF3 << 48 | DIFF3 << 40 | DIFF3 << 32 | DIFF3 << 24 | DIFF3 << 16 | DIFF3 << 8 | DIFF3
    const FULL_DIFF4 uint64 = DIFF4 << 56 | DIFF4 << 48 | DIFF4 << 40 | DIFF4 << 32 | DIFF4 << 24 | DIFF4 << 16 | DIFF4 << 8 | DIFF4
    // The overflow:
    // LO1 + DIFF1 + DIFF2 - 0x0f + DIFF4 + 63 = 0xaf
    // LO1 + DIFF1 + DIFF2 + 0x71 + DIFF4 + 63 = 0x12f (overflow)

    const COUNT1      uint64 = uint64(HI1 - LO1) + 1
    const COUNT2      uint64 = uint64(HI2 - LO2) + 1 + COUNT1
    const COUNT3      uint64 = uint64(HI3 - LO3) + 1 + COUNT2
    const COUNT4      uint64 = uint64(HI4 - LO4) + 1 + COUNT3
    const FULL_COUNT1 uint64 = COUNT1 << 56 | COUNT1 << 48 | COUNT1 << 40 | COUNT1 << 32 | COUNT1 << 24 | COUNT1 << 16 | COUNT1 << 8 | COUNT1
    const FULL_COUNT2 uint64 = COUNT2 << 56 | COUNT2 << 48 | COUNT2 << 40 | COUNT2 << 32 | COUNT2 << 24 | COUNT2 << 16 | COUNT2 << 8 | COUNT2
    const FULL_COUNT3 uint64 = COUNT3 << 56 | COUNT3 << 48 | COUNT3 << 40 | COUNT3 << 32 | COUNT3 << 24 | COUNT3 << 16 | COUNT3 << 8 | COUNT3
    const FULL_COUNT4 uint64 = COUNT4 << 56 | COUNT4 << 48 | COUNT4 << 40 | COUNT4 << 32 | COUNT4 << 24 | COUNT4 << 16 | COUNT4 << 8 | COUNT4

    src =
        ((src << (56 - 7 * 6)) & 0x3f00000000000000) |
        ((src << (48 - 6 * 6)) & 0x003f000000000000) |
        ((src << (40 - 5 * 6)) & 0x00003f0000000000) |
        ((src << (32 - 4 * 6)) & 0x0000003f00000000) |
        ((src << (24 - 3 * 6)) & 0x000000003f000000) |
        ((src << (16 - 2 * 6)) & 0x00000000003f0000) |
        ((src << ( 8 - 1 * 6)) & 0x0000000000003f00) |
        ((src << ( 0 - 0 * 6)) & 0x000000000000003f)
    src2 := src | SEPARATORS
    diff := FULL_LO1

    var mask uint64
    // if   ch  >= COUNT                    { diff += DIFF }
    mask = (src2 - FULL_COUNT1) & SEPARATORS; diff += FULL_DIFF1 & (mask - (mask >> 7))
    mask = (src2 - FULL_COUNT2) & SEPARATORS; diff += FULL_DIFF2 & (mask - (mask >> 7))
    mask = (src2 - FULL_COUNT3) & SEPARATORS; diff -= FULL_DIFF3 & (mask - (mask >> 7)) // overflow avoided
    mask = (src2 - FULL_COUNT4) & SEPARATORS; diff += FULL_DIFF4 & (mask - (mask >> 7))

    return (src + diff) & 0x7f7f7f7f7f7f7f7f
}

I didn't test all the possible encode/decodes (2**48) but I tested all 3 byte values (2**24) left shifted by 0, 6, 12, 18, and 24.

SSE2/AVX2/NEON will be faster. SSE2, AVX2, NEON have compares for bytes equal and signed greater than: _mm_cmpgt_epi8/_mm256_cmpgt_epi8/vceqq_s8 (PCMPGTB/VPCMPGTB/CMEQ) and _mm_cmpeq_epi8/_mm256_cmpeq_epi8/vcgtq_s8 (PCMPEQB/VPCMPEQB/CMGT). Thus a simple conditional add is 3 SIMD instructions (compare, and, add), but adding actual SIMD seems like overkill.

I'll benchmark this tomorrow. If someone else hasn't.

The base32 version is kind of straight forward. For decode you'll get an error on the DIFFs you need to flip the sign. For encode you need to manually check for overflows and flip the appropriate signs.

I'll submit a pull request assuming these are faster—Oh I just realized I need to make a change. To save time on partial blocks. The input to the encode function should be expanded before calling. Similar for the output of the decode function.

Comment From: Sc00bz

It's slower. I'll try a few things, but I think this will be slower without SIMD. I'm pretty sure this won't change much but the benchmark is testing encode and decode of zero bytes. So it's always hitting the same elements: encode[0] and decodeMap['A'].

Comment From: rolandshoemaker

If we can get a just as fast (likely faster) CT SIMD assembly implementation, then I think the question about whether to add a CT implementation alongside the VT one somewhat goes out of the window (if it's just faster, why not always use it). That said, if it's significantly more complex to write an assembly implementation that is agnostic to alphabets, I think the question still remains.

Comment From: apparentlymart

Would it be acceptable to say that constant-time is guaranteed only for the standard alphabet, while other alphabets might use a more general implementation that is not constant-time, or does that seem too subtle for a security-related feature? 🤔

(I'm asking the above question primarily from an API design standpoint. I acknowledge that it also implies still maintaining two implementations rather than only one, but that was already true for the compromise of exposing constant-time and not-constant-time as explicitly separate functions.)

Comment From: rolandshoemaker

I think regardless of the underlying implementation we'll want to retain an API such as the WithConstantTime suggested in https://github.com/golang/go/issues/73901#issuecomment-2919845169). If we decide we can only enable the CT implementation for standard alphabets, then that can return an error for non-standard alphabets.

If we can come up with a fast SIMD CT implementation that can easily be extended to arbitrary alphabets that would be nice, but I don't think it's a sticking point. Really we only want this for standard alphabets (for use in encoding/pem), and if we have to carry a slower variable-time implementation for non-standard alphabets, that's probably fine.

Comment From: aclements

FWIW, the fast AVX-512 algorithms for base64 encode and decode in https://arxiv.org/pdf/1910.05109 work with arbitrary alphabets and should be straightforward to extend to base32. However, at their core is a table lookup that only AVX-512 can pull off. I haven't thought much about how to do this with smaller vectors, but will note that SIMD by its nature is generally good at turning branchy code into branchless code.

Do people use alphabets that differ in anything other than the last two elements?

Comment From: magical

Yes, there are certain password hashing schemes that use variant alphabets: most notably bcrypt, which puts the punctuation characters ./ at the start of the alphabet instead of at the end.

./ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789

https://cs.opensource.google/go/x/crypto/+/refs/tags/v0.40.0:bcrypt/base64.go;l=9

Comment From: Sc00bz

You need AVX-512's VBMI for _mm512_permutex2var_epi8 (VPERMI2B). It's also in AVX10.1 which is the first AVX10 version. (If you're not familiar, AVX10 is a consolidation of AVX-512 instruction subsets. Also AVX10/256 for limiting register width to 256 bits and a promise that future versions will include all instructions in previous versions.)

AMD's Zen4 and Zen5 have it. Intel's Cannon Lake was the first to have it. At least Icelake Intel, Icelake Xeon, Sapphire Rapids have it according to Intel Intrinsics Guide. Ah Wikipedia's compatibility table.

It might be better to target SSSE3*, AVX, and AVX2's _mm_shuffle_epi8/_mm256_shuffle_epi8 (PSHUFB/VPSHUFB). It will need 2 (base32) or 4 (base64) calls to encode and 16 calls to decode because it is limited to a 16 byte lookup table. Newer Intel's CPUs have a throughput of 2 instructions/clock which is nice.

NEON (ARM64) has vqtbl4q_s8 (TBL) which can do a 64 byte lookup table (across 4, 16 byte registers). Also it's easier than PSHUFB/VPSHUFB because anything out of bounds is set to 0. Instead of if the high bit is set then 0. For ARMv7 and ARM32 there's vtbl4_u8 (TBL) which can do a 32 byte lookup table (across 4, 8 byte registers).

* Maybe not SSSE3 since it's old and Intel only.

Comment From: aclements

Placed on hold. — aclements for the proposal review group

Comment From: rolandshoemaker

Proposal is on hold pending attempts at getting benchmark numbers/plausible designs for SIMD assembly cores.

Comment From: Sc00bz

Maybe not SSSE3 since it's old and Intel only.

Correction on that AMD added SSSE3 when they added AVX. I missed this and it makes sense since AVX includes all of SSSE3 but with 3 operands. Thus they have two encodings that map to the same operation.

I finished Avo SSSE3 for hex and base64. Base32 only needs to and from 5 bits/uint8 to bytes. I still need to deal with the ignored whitespace (\r and \n) and errors, but that's all in standard go. Currently the base64 SSSE3 is unrolled but I think it makes more sense as a loop. Assuming that's faster than the non-constant time.

If someone wants to work on ARM Neon, these are the three base32/base64 functions (only difference between this and base32 besides names and comments is both [:24] are [:20] because I didn't want to do bounds checking in assembly):

// Decode 32 characters to 32, 6 bits
func base64DecodeBlock(out, in []byte, lut *[128]byte) (errBitField uint32, whitespaceBitField uint32) {
    return base64DecodeBlockSSSE3(out[:32], in[:32], lut)
}

// Decoded 32, 6 bits to 24 bytes
func base64DecodedToBytesBlock(out, in []byte) {
    base64DecodedToBytesBlockSSSE3(out[:24], in[:32])
}

// Encode 24 bytes to 32 characters
func base64EncodeBlock(out, in []byte, lut *[64]byte) {
    base64EncodeBlockSSSE3(out[:32], in[:24], lut)
}

Note the decode look up table is only for characters 0-127. The look up table has "\x80" for invalid characters and "\x40" for whitespace (currently just \r and \n). I have it fallback to non-constant time if there are any extended ASCII characters in the base## character set. The reason for the limit is that's extremely uncommon and speed. Oh right, each bit in errBitField and whitespaceBitField corresponds to the 32 bytes. The least significate bit is for in[0] and most significate bit is for in[31]. Note if it's not a character from the character set then it's either an error (padding character counts as an error and gets dealt with later) or whitespace but not both (ie (errBitField & whitespaceBitField) == 0).

These are the two hex functions:

func hexDecodeBlock(out, in []byte) (errBitField uint32) {
    return hexDecodeBlockSSE2(out[:16], in[:32])
}

func hexEncodeBlock(out, in []byte) {
    hexEncodeBlockSSE2(out[:32], in[:16])
}

Comment From: Sc00bz

After dealing with the ignored whitespace (\r and \n) and errors, I ended up doing whitespaceBitField |= errBitField. I changed invalid characters back to "\xff". So that the two most significate bits are set. Thus counts as both an error and whitespace. Also it's not constant time for padding check and whitespace, but there's no way around that anyway. I ended up just copying decodeQuantum()'s padding check since I realized any attempt to make it more constant time doesn't make sense here. I'll post code tomorrow.