Background: Issue https://github.com/golang/go/issues/69420 proposes a types.Hash
function that provides a hash operator for types.Type
values that is consistent with the equivalence relation for types defined by types.Identical
. The only real question in that proposal is: what is to be Go's future convention for the signature of a custom hash function?
The naive answer is func(T) uint64
, where T stands for types.Type
. However, hash tables with only a single hash function are vulnerable to a "hash flooding" denial-of-service attack, in which an adversary provides many inputs that all hash to the same value, causing the table to degenerate into a linked list. The defense is for each process (or particular hash table) to select a different hash function at random so that the values cannot be predicted. The hash/maphash
package embodies this approach: a random Seed value determines the hash functions (Hash.Strings
and Hash.Bytes
), which depend on the seed's opaque random value. (maphash does not provide a Seed-dependent way to hash integers.)
So, perhaps custom hash functions should accept a maphash.Seed
parameter. And perhaps Seed should also provide methods for hashing integers.
Proposal: In the hash package, define function types HashFunc and EqFunc that document the conventions for Go hash tables.
Here is the minimal answer, without Seeds:
package hash
// HashFunc defines a hash function for values of type T.
// A conformant HashFunc and EqFunc pair (hash, eq) must observe the invariant
// that eq(x, y) => hash(x) == hash(y).
type HashFunc[T any] func(T) uint64
// EqFunc defines an equivalence relation for values of type T.
type EqFunc[T any] func(x, y T) bool
@ianlancetaylor @griesemer
Comment From: gabyhelp
Related Issues
- proposal: go/types: add Hash function #69420
- proposal: hash: export a built-in hash function for comparable values #21195 (closed)
- proposal: container/unordered: a generic hash table with custom hash function and equivalence relation #69559
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: rsc
(The minimal answer without Seeds is definitely the wrong one, as you explained before giving it.)
Comment From: ianlancetaylor
One approach is to define
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
}
A generic hash table with keys of type T will take a value of type Hasher[T], and invoke the methods of that value to compute hash values and to compare elements for equality.
Then we can have
// MakeHasher returns a Hasher[T] that invokes the hash and equal functions.
// Note that this does not use an explicit seed.
func MakeHasher[T any](hash(T) uint64, equal(T, T) bool) Hasher[T] { ... }
// MakeSeededHasher returns a Hasher[T] that uses a random seed.
func MakeSeededHasher[T any](hash(T, uint64), equal(T, T) bool) Hasher[T] {
return &seededHasher[T]{hash: hash, equal: equal, seed: randUint64()}
}
type seededHasher[T any] struct {
hash func(T, uint64)
equal func(T, T) bool
seed uint64
}
func (sh *seededHasher[T]) Hash(v T) uint64 {
return sh.hash(v, sh.seed)
}
func (sh *seededHasher[T]) Equal(a, b T) bool {
return sh.equal(a, b)
}
func (sh *seededHasher[T]) SetSeed(seed uint64) {
sh.seed = seed
}
// MakeMapHashHasher returns a Hasher[T] that uses [maphash.Comparable].
func MakeMapHashHasher[T comparable](seed maphash.Seed) Hasher[T] {
return mapHashHasher[T]{seed}
}
type mapHashHasher[T comparable] struct {
seed maphash.Seed
}
func (mhh mapHashHasher[T]) Hash(v T) uint64 {
return maphash.Comparable(mhh.seed, v)
}
func (mhh mapHashHasher[T]) Equal(a, b T) bool {
return a == b
}
Comment From: adonovan
(The minimal answer without Seeds is definitely the wrong one, as you explained before giving it.)
Indeed, I was hoping to confirm Cunningham's law.
func (sh *seededHasher[T]) Hash(v T) uint64 { return sh.hash(v, sh.seed) }
This approach doesn't support a per-table seed.
That said, I still don't really understand how even a per-table seed provides a defense against flooding unless it is passed through the hash function itself. If the attacker knows the hash function and can provide all the keys in the table, then it's not enough to transform the hashes at the end, since they will all be transformed in the same way, whatever that is, preserving collisions. It seems to me that the seed would need to be threaded through the hash function so that whether two keys' hashes collide is impossible to predict from the sequence of elements that are incorporated into the hash.
Comment From: randall77
This approach doesn't support a per-table seed.
I don't think that's super important. A per-process seed is probably good enough.
That said, I still don't really understand how even a per-table seed provides a defense against flooding unless it is passed through the hash function itself.
Yes, it must be passed through the hash function itself. Ian's proposed code does that.
Comment From: ianlancetaylor
I believe that MakeSeededHasher
does support a per-table seed, because you can pass a call to MakeSeededHasher
to the hash table's Make
function. Each hash table will then be using a different seed.
Comment From: ianlancetaylor
That is, I was thinking in terms of passing a Hasher[T]
as a value when making a hash table. But there is another approach, which is to make Hasher[T]
a type parameter of the hash table. That approach has the advantage that it permits inlining and the hash and equality methods. That approach still requires that the hash table include a value of type Hasher[T]
. And we can arrange for that value to initialize itself with a seed when first called.
Comment From: aclements
TL;DR: Of the options I present below, option 3 feels like the best balance to me.
Note that a lot of this discussion is.. uh.. rehashing #69559.
I worry that @ianlancetaylor 's API proposed in https://github.com/golang/go/issues/70471#issuecomment-2489681099 (or one where a hash table is parameterized by the hasher type), makes it easier to write a non-seeded hash function than a seeded one. I'd much rather an API that makes it an obvious mistake to not seed the hash function.
Also, I think it's important to compare what impact different API options would have on a custom hash table type. Otherwise we're operating in a vacuum.
Setting aside for a moment the type of the seed, option 1 is an obvious extension of @adonovan 's API:
type HashFunc[T any] func(Seed, T) uint64
This leads to a hash table API like:
type EqFunc[T any] func(T, T) bool
type HashTable[T any] struct {
seed Seed
hash HashFunc[T]
eq EqFunc[T]
...
}
func NewHashTable[T any](hash HashFunc[T], eq EqFunc[T]) *HashTable
@mateusz834 proposed this here.
Option 1 requires the hash table to store three words for its hasher, and calls to the hasher and equal are indirect.
Option 2 is to do this same transformation, but to @ianlancetaylor 's Hasher interface:
type Hasher[T any] interface {
Hash(Seed, T) uint64
Equal(T, T) bool
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed Seed
...
}
Option 2 most likely stores just the seed in the hash table, which is minimal, and calls to the hasher and equal functions will be direct. However, the type implementing Hasher will almost always be an empty struct, so we're in the land of pure type metaprogramming here, which isn't wrong but sure feels weird. This definitely suffers from the lack of type-type inference.
Option 3 is to make the Hasher store the seed:
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
Reset(Seed)
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably just struct { Seed }
...
}
func NewHashTable[T any, Hash Hasher[T]]() *HashTable[T, Hash] {
ht := &HashTable[T, Hash]{}
seed := MakeSeed()
ht.hash.Reset(seed)
...
}
This is similar to a proposal by @Merovius.
Option 3, like option 2, stores just the seed in the hash table, and calls to the hasher are direct, but suffers from the lack of type-type inference. However, it avoids pure metaprogramming. It feels slightly weird that the type implementing Hasher is mutable, but that's not too unusual.
Option 4 avoids mutability by introducing a hash constructor function:
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
}
type NewHasher[T any, Hash Hasher[T]] func(Seed) Hash
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct{
h Hash
...
}
func NewHashTable[T any, Hash Hasher[T]](newHasher NewHasher[T, Hash]) *HashTable[T, Hash] {
seed := MakeSeed()
h := newHasher(seed)
return &HashTable[T, Hash]{h: h, ...}
}
This is pretty similar to option 3, but it feels like it's getting heavy on the mechanism.
And perhaps Seed should also provide methods for hashing integers.
As for the Seed type, this is an interesting idea, though in practice I think a lot of hash functions would either want to ultimately pass the Seed on to maphash (so it's just opaque) or will want to get transform the seed into a uint64 or some bytes or something they can use directly. You could do that with the interface by passing, e.g., 0, 1, etc, but that feels not very obvious.
Alternatively, the Seed could provide a few different ways to get a concrete seed value. E.g., it could have a Uint64 method to get it as a uint64, and a Bytes method to populate an arbitrarily large []byte. These methods would be idempotent: pure functions of the Seed's hidden value. And there would be no way to construct a Seed from any of these representations.
If we want to support hash functions that derive something from Seed and use that instead of using Seed opaquely, I think we need to have something like Reset
from option 3 or NewHasher
from option 4 so it has a chance to derive what it needs to derive.
Comment From: Merovius
I don't want to touch maphash.Seed
. I believe maphash.Seed
is fine as it is. If we want to enable a hash function to work with different seed types, I think the straight forward way to do that would be to make Seed
a type parameter on the concrete implementation. Not to change maphash.Seed
to be transformable into a bunch of different types. Then maphash.Seed
can be one possible type argument, uint64
can be another, and if a hash function needs more bits, it can use []byte
or struct { Lo, Hi uint64 }
or whatever.
I think the advantage of @ianlancetaylor's design is that it makes the type of seed opaque to the hash table. I think that is a good thing. It gives maximum flexibility, while staying simple in usage. If we want to make it maximally simple to use safely, we could do something like
type Hasher[T any] interface{
Hash(T) uint64
Equal(T, T) bool
}
// exported type, to optionally avoid indirection.
type MapHashHasher[T comparable] struct {
Seed maphash.Seed
}
func (h *MapHashHasher[T]) Reset() { h.Seed = maphash.MakeSeed() }
func (h MapHashHasher[T]) Hash(v T) uint64 { return maphash.Comparable(h.Seed, v) }
func (h MapHashHasher[T]) Equal(a, b T) bool { return a == b }
func MakeMapHashHasher[T comparable]() MapHashHasher { return MapHashHasher[T]{maphash.MakeSeed()} }
type FuncHasher[T, Seed any] struct {
hash(Seed, T) uint64
equal(T, T) bool
seed Seed
}
func MakeFuncHasher[T, Seed any](hash func(Seed, T) uint64, equal func(T, T) bool, seed Seed) FuncHasher {
return FuncHasher[T, Seed]{hash, equal, seed}
}
func (h FuncHasher) Hash(v T) uint64 { return h.hash(h.seed, v) }
func (h FuncHasher) Equal(a, b T) bool { return h.equal(a, b) }
To be fair, this still makes it easy to construct a badly seeded hasher for composite types. There are two ways to deal with that:
- Not put
Seed
intoMakeFuncHasher
at all and instead just make it a fixeduint64
. That way, we can choose it at random and leave it up to third parties to provide implementations with other seed types. After all, the fact that the seed type is opaque is exactly the point. - Require that
Seed
has aReset(*rand.Rand)
method. But… blegh.
With this, a hashtable API can decide itself whether it wants to parameterize over Hasher[T]
(avoiding an indirection) or just take a Hasher[T]
value in the constructor (avoiding extra type parameters).
I think ultimately, the Hasher
interface not knowing about Seed
is the maximally general interface. The type of the seed is, ultimately, up to the implementation. So it shouldn't appear in the interface. And I think it's easy to see, that we can add whatever set of concrete implementations we want. Or perhaps don't provide any - in the beginning - putting that up to the community. Or put the base interface into the standard library (so we can use it in our APIs) but then put a bunch of concrete, experimental implementations into x/exp
and see which of them get adopted to ultimately get promoted into the stdlib.
Comment From: rsc
Option 4 seems like the right answer for a hash table with custom equality: it needs Hash(T) and Equal(T, T), and the underlying Hash should absolutely be seeded, but you can just instantiate each hash table with a pre-seeded Hasher.
It seems like the interface belongs more in the hash-table package than it does in package hash, which is not really about hash tables. It would be more like sort.Interface defining the interface it needs.
So maybe we have the first part of the hash table package: the interface it will require for processing type T.
Comment From: rsc
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
Comment From: aclements
@Merovius , I worry that entirely taking any seed out of Hasher like in your proposed
type Hasher[T any] interface{
Hash(T) uint64
Equal(T, T) bool
}
makes it far too tempting for people to write bad hash functions that either aren't seeded at all, or don't use a per-instance hash. I'd much rather make it hard for people to ignore seeding.
Not to change maphash.Seed to be transformable into a bunch of different types.
I had another thought on this. What if we forced hashers to take a maphash.Seed
, but simply didn't provide any new methods to transform it into other things? That would force most hashers to ultimately dispatch to our high-quality hash implementations in maphash
. My sense is that what people really want here is not to define wildly different hash functions, but just to customize a bit which fields and how deep in a structure equality and hashing go.
Comment From: aclements
It seems like the interface belongs more in the hash-table package than it does in package hash, which is not really about hash tables.
I'm not sure I understand this argument. Hashes are useful for lots of things, including other implementations hash tables and data types that are totally unlike hash tables. To me, this seems philosophically more aligned with maphash
(give me a hash and I'll do something with it) than a concrete container/unordered
type (here's how you provide a hash for this particular implementation of this particular data type).
Comment From: adonovan
That would force most hashers to ultimately dispatch to our high-quality hash implementations in maphash.
maphash.Hash doesn't support integers and floats, only strings. Perhaps it should support all the elementary types, both for efficiency, and to avoid mistakes coercing a number-shaped peg into a string-shaped hole.
Comment From: mateusz834
@adonovan #54670 ?
// WriteComparable adds x to the data hashed by h.
func WriteComparable[T comparable](h *Hash, x T)
Comment From: adonovan
Thanks, I had forgotten about that.
Comment From: Merovius
@aclements
I had another thought on this. What if we forced hashers to take a
maphash.Seed
, but simply didn't provide any new methods to transform it into other things?
If anything, it should probably take a *maphash.Hash
then. Especially if we are talking about the possibility of needing to hash recursively.
Comment From: prattmic
@Merovius , I worry that entirely taking any seed out of Hasher like in your proposed
go type Hasher[T any] interface{ Hash(T) uint64 Equal(T, T) bool }
makes it far too tempting for people to write bad hash functions that either aren't seeded at all, or don't use a per-instance hash. I'd much rather make it hard for people to ignore seeding.
To add to this, in the absence of clear documentation, I think it is a bit unclear where this interface should be implemented. i.e., I think some people would mistakenly implement this directly on the type you want to hash:
type Foo struct { ... }
func (f Foo) Hash(f2 Foo) {}
func (f Foo) Equal(f1, f2 Foo) {}
Hopefully the extra argument (why does Hash have an argument and receiver?) would clue people in. Still, if someone did this, then they would almost certainly not have a per-instance seed as that doesn't really make sense in this context at all.
Comment From: aclements
If anything, it should probably take a *maphash.Hash then. Especially if we are talking about the possibility of needing to hash recursively.
That's a great point about recursive hashing. To reiterate what I believe @Merovius is getting at: hashes are often built up recursively, mirroring the type hierarchy. If custom hashers take a maphash.Seed
, the custom hasher is going to have to create a maphash.Hash
to use that seed. That alone isn't a problem, but if that hasher has to call another custom hasher, the only thing it can pass is the same Seed that was passed to it. The child hasher will have to create another maphash.Hash
, with the same seed, and ultimately return the Hash.Uint64()
value from it, which the parent hasher will then have to mix in to its Hash
. That's really wordy, so here's a code example:
type T1 struct { a string; b T2 }
type T2 struct { c string }
type H1 struct {} // T1's hasher
func (H1) Hash(seed maphash.Seed, val T1) uint64 {
var mh maphash.Hash // Turn seed into a maphash
mh.SetSeed(seed)
mh.WriteString(val.a)
// We pass the same seed because we have no other choice,
// and then fold the uint64 into our maphash.
maphash.WriteComparable(&mh, H2{}.Hash(seed, val.b))
return mh.Uint64()
}
type H2 struct {} // T2's hasher
func (H2) Hash(seed maphash.Seed, val T2) uint64 {
var mh maphash.Hash // Turn seed into a maphash
mh.SetSeed(seed)
mh.WriteString(val.c)
return mh.Uint64() // Turn maphash into a uint64
}
It would be better to avoid all of this back and forth and just pass a *maphash.Hash
, which can easily be passed down to child hashers.
@aclements option 2b
Modifying my "option 2" from above to take a maphash.Hash
looks like:
type Hasher[T any] interface {
Hash(*maphash.Hash, T)
Equal(T, T) bool
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed maphash.Seed
...
}
With this, the equivalent of my above code example looks like:
type T1 struct { a string; b T2 }
type T2 struct { c string }
type H1 struct {} // T1's hasher
func (H1) Hash(mh *maphash.Hash, val T1) {
mh.WriteString(val.a)
H2{}.Hash(mh, val.b)
}
type H2 struct {} // T2's hasher
func (H2) Hash(mh *maphash.Hash, val T2) {
mh.WriteString(val.c)
}
This seems much nicer to me!
Unfortunately, today, this would force the maphash.Hash
passed to Hasher.Hash to escape because we don't stencil the methods of HashTable
based on the concrete type of Hash
. Ideally, each get/put would declare a local maphash.Hash
to pass into the Hash
method, but today this would mean an allocation on every get/put. Alternatively, HashTable
itself could store a maphash.Hash
instead of Seed
, but that's 152 bytes (compared to 8 for Seed). Or you use a sync.Pool
, which is also ugly. This a all a form of #48849.
I don't believe this is a fundamental limitation. I spent a while banging out potential solutions with @cherrymui, and we came up with a few options around either including escape information in the GC shape used for stenciling, or doing a runtime form of "conditional escape" by putting escape information into the generic dictionary. None of these solutions are easy, but they don't seem super hard, and fixing this would lift a lot of boats.
I am loathe to warp basic interfaces like this to the current limitations on our tools, especially when we see a path to fixing those limitations. Though I also recognize that it's frustrating to block one thing on another.
@aclements option 3b
I don't think passing the *maphash.Hash
makes sense with the other approaches I proposed above.
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
Reset(*maphash.Hash)
}
Here, Reset
would have to store the maphash.Hash
in the Hasher
, which would probably cause it to escaped at a much deeper level. It also means a Hasher
cannot possibly be thread-safe.
@aclements option 4b
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
}
type NewHasher[T any, Hash Hasher[T]] func(*maphash.Hash) Hash
func NewHashTable[T any, Hash Hasher[T]](newHasher NewHasher[T, Hash]) *HashTable[T, Hash] {
var mh maphash.Hash
mh.SetSeed(maphash.MakeSeed())
h := newHasher(&mh)
return &HashTable[T, Hash]{h: h, ...}
}
There are some variations on this, but no matter what the 152 byte maphash.Hash
has to live as long as the HashTable
. And like option 3, a Hasher
could not possibly be thread-safe.
@Merovius parameterized Seed option
I think @Merovius 's parameterized Seed runs into a lot of similar problems. There, it's up to the Hasher to store either a maphash.Seed
or a maphash.Hash
or something else entirely. But because the seed state is in the Hasher itself, if you have a hierarchy of hashers, the parent Hasher is going to have to store the Hashers for all of its children, recursively. That makes the same of a Hasher implementation proportional to the total size of the tree of types under it.
If you implement a hash table on this, all of this seed state has to persist as long as the hash table itself. Even if the Hasher stores just a maphash.Seed
, each call to a Hash
method, including recursively, still has to create its own maphash.Hash
from that.
I'm not positive, but I think @Merovius 's solution does avoid the escape problem. If a hash table type is parameterized over the Hasher type, then we will stencil that type definition. I think we might still consider the method calls to escape their receiver, but that would just force a one-time heap allocation of the whole hash table object, not a per-get/put heap allocation.
Comment From: adonovan
I like option 2b; I don't mind waiting for the compiler to catch up before we can do it. This approach will elevate the rather obscure *maphash.Hash
type to a role of great prominence across our APIs, but perhaps that's fine. Will we need to add Write methods to it for all the basic types?
I agree that 3b (Reset) introduces state where it needn't exist.
How immovable an obstacle is the 152-byte size of the maphash? Is a buffer smaller than 128 bytes much less efficient? Or is the point that rthash in two chunks delivers a different result from calling it in one chunk?
Comment From: aclements
How immovable an obstacle is the 152-byte size of the maphash? Is a buffer smaller than 128 bytes much less efficient? Or is the point that rthash in two chunks delivers a different result from calling it in one chunk?
I believe the smallest possible buffer is sizeof uintptr. That would still leave us with 32 bytes, and would certainly be much less efficient, though I didn't see a benchmark specifically for this.
Comment From: aclements
If we have an interface like option 2b, maphash
could also provide a trivial default implementation that uses the built-in comparable hashing:
type Default[T comparable] struct {}
func (Default[T]) Hash(h *Hash, v T) {
WriteComparable(h, v)
}
func (Default[T]) Equal(a, b T) bool {
return a == b
}
Comment From: aclements
@adonovan is going to try writing a CL showing how go/types
would use this interface to prove it out.
Comment From: gopherbot
Change https://go.dev/cl/657297 mentions this issue: go/types: Hasher, a hash function for Types
Comment From: gopherbot
Change https://go.dev/cl/657296 mentions this issue: hash/maphash: add Hasher interface and Default implementation
Comment From: adonovan
See https://go.dev/cl/657296 for maphash.Hasher and https://go.dev/cl/657297 for types.Hasher.
The implementation was largely straightforward; most of the complexity is intrinsic to the problem, as witnessed by the x/tools/go/types/typeutil.Hash function from which this code was derived. However:
-
Hashing an unordered set (such as Interface.Methods) is a little trickier. You need to either (a) sort the elements, which requires allocation, or (b) create a sub-Hash for each element of the set, combine all the sub-Hashes' Sum64 values using XOR, and then write that value into the main Hash. A non-obvious detail is the need to set the Seed of the sub-Hash to that of the parent Hash.
-
Writing bytes into the Hash is nice and easy: h.WriteByte(b). Writing other integers is more of a mouthful: maphash.WriteComparable(h, i). The lack of generic methods or lambdas means you can't write a short helper method or function. I used SetByte(byte(i)) frequently, and then justified this as an optimization since it is rare in nearly all cases in go/types for the values to need more than 8 bits, and a hash function is free to be lossy.
Comment From: aclements
A non-obvious detail is the need to set the Seed of the sub-Hash to that of the parent Hash.
We can give a documentation example of "how to hash a set".
We just added Clone
to the cryptographic hashes. We could do the same for maphash.Hash
. It's maybe slightly more expensive than just cloning the seed, but it seems more general.
Writing bytes into the Hash is nice and easy: h.WriteByte(b). Writing other integers is more of a mouthful: maphash.WriteComparable(h, i).
You could write a local generic function with a short name that does maphash.WriteComparable, or we could add more WriteXXX methods to maphash.Hash
. That can't cover named types, but it could cover all built-in numeric types.
These both seem like quality of life improvements we could make down the road, so I see this as evidence that this approach is pretty solid.
Comment From: aclements
Have all remaining concerns about this proposal been addressed?
The proposal is to add the following API definition to package maphash
:
// A Hasher is a type that implements hashing and equality for type T.
//
// A Hasher must be stateless. Hence, typically, a Hasher will be an empty struct.
type Hasher[T any] interface {
// Hash updates hash to reflect the contents of value.
//
// If two values are [Equal], they must also Hash the same.
// Specifically, if Equal(a, b) is true, then Hash(h, a) and Hash(h, b)
// must write identical streams to h.
Hash(hash *Hash, value T)
Equal(T, T) bool
}
As an example, some simple hashers could look like:
type Strings []string
type StringsHasher struct{} // Hasher for Strings
func (StringsHasher) Hash(mh *maphash.Hash, val Strings) {
for _, s := range val {
mh.WriteString(s)
}
}
type Thing struct {
a Strings
b int
}
type ThingHasher struct {} // Hasher for Thing
func (ThingHasher) Hash(mh *maphash.Hash, val Thing) {
StringsHasher{}.Hash(mh, val.a)
maphash.WriteComparable(mh, val.b)
}
A simple custom hash-based could use this interface as follows:
// Example hash-based set
type Set[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed maphash.Seed
data []T
...
}
func (s *Set[T, Hash]) Has(val T) bool {
var mh maphash.Hash
mh.SetSeed(s.seed)
s.hash.Hash(&mh, val)
offset := mh.Sum64()
for i := range s.data {
// ... break if this slot is empty ...
if s.hash.Equal(val, s.data[(i + offset) % len(s.data)]) {
return true
}
}
return false
}
var s Set[Thing, ThingHasher]
Comment From: aclements
One option we didn't really explore was that types could provide their own Hash and Equal:
type Hashable[T any] interface {
Hash(hash *maphash.Hash)
Equal(T) bool
}
This would be used like:
type Set[T Hashable[T]] struct {
seed maphash.Seed
data []T
// ...
}
And types would implement this like:
type Strings struct {
xs []string
}
func (s Strings) Hash(hash *maphash.Hash) {
for _, s := range s.xs {
hash.WriteString(s)
}
}
func (s Strings) Equal(t Strings) bool {
return slices.Equal(s.xs, t.xs)
}
var stringSet Set[Strings]
Full example here.
This is strictly less flexible than the Hasher
interface because you can't plug in different types of hashing/equality, or supply hashing/equality for a type you don't control. Many other languages have exactly this constraint, but they also had user-written hashing and equality from the beginning, making it an expectation that types would provide hashing and equality where it makes sense.
It would be easy to adapt a Hashable
type to Hasher
:
type HashableHasher[T Hashable[T]] struct {}
func (HashableHasher[T]) Hash(hash *maphash.Hash, value T) {
value.Hash(hash)
}
func (HashableHasher[T]) Equal(a T, b T) bool {
return a.Equal(b)
}
However, if containers are all based around Hasher
, it's kind of annoying to use: Set[Strings, HashableHasher[Strings]]
. (Too bad we can't specify defaults for type parameters. 😄)
Given the inflexibility of Hashable
, I think just going with Hasher
is the right approach, but I wanted to put this out there.
Comment From: adonovan
Specifying the relation outside the type is also consistent with Go's approach to sorting.
Comment From: aclements
Based on the discussion above, this proposal seems like a likely accept. — aclements for the proposal review group
The proposal is to add the following API definition to package maphash
:
// A Hasher is a type that implements hashing and equality for type T.
//
// A Hasher must be stateless. Hence, typically, a Hasher will be an empty struct.
type Hasher[T any] interface {
// Hash updates hash to reflect the contents of value.
//
// If two values are [Equal], they must also Hash the same.
// Specifically, if Equal(a, b) is true, then Hash(h, a) and Hash(h, b)
// must write identical streams to h.
Hash(hash *maphash.Hash, value T)
Equal(T, T) bool
}
As an example, some simple hashers could look like:
type Strings []string
type StringsHasher struct{} // Hasher for Strings
func (StringsHasher) Hash(mh *maphash.Hash, val Strings) {
for _, s := range val {
mh.WriteString(s)
}
}
func (StringsHasher) Equal(a, b Strings) bool {
return slices.Equal(a, b)
}
type Thing struct {
ss Strings
i int
}
type ThingHasher struct{} // Hasher for Thing
func (ThingHasher) Hash(mh *maphash.Hash, val Thing) {
StringsHasher{}.Hash(mh, val.ss)
maphash.WriteComparable(mh, val.i)
}
func (ThingHasher) Equal(a, b Thing) bool {
if a.i != b.i {
return false
}
return StringsHasher{}.Equal(a.ss, b.ss)
}
A simple custom hash-based could use this interface as follows:
// Example hash-based set
type Set[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed maphash.Seed
data []T
// ...
}
func (s *Set[T, Hash]) Has(val T) bool {
var mh maphash.Hash
mh.SetSeed(s.seed)
s.hash.Hash(&mh, val)
offset := mh.Sum64()
for base := range s.data {
i := (uint64(base) + offset) % uint64(len(s.data))
// ... break if this slot is empty ...
if s.hash.Equal(val, s.data[i]) {
return true
}
}
return false
}
var s Set[Thing, ThingHasher]
Comment From: mvdan
@aclements minor typo in the interface godoc - "write identical streams to h" - the parameter is hash
, not h
.
Comment From: aclements
Hmm. I think I did mean h
, as in the argument to Hash(h, a)
and Hash(h, b)
, not the parameter hash
. But I think either way is fine if that seems confusing.
Comment From: mateusz834
I am thinking whether Hasher[T any]
, should be a Hasher[T, T2 any]
instead, just as the slices package does say for EqualFunc
.
type Hasher[T,T2 any] interface {
Hash(hash *maphash.Hash, value T)
Equal(T, T2) bool
}
There is a concept of Context
s in hash maps, that shines in Zig https://ziglang.org/documentation/0.14.0/std/#std.hash_map.AutoHashMap in short terms you can use a different Context
(in the Go terms that would be the same thing as a Hasher
) in Get/Detete operations. So basically you have an Map with Hasher[T, T2]
and you can change the Hasher
to say Hasher[T3, T2]
for Get/Delete. This might sound weird at first glance, but it allows for crazy usages of HashMaps, for example map[uint16]struct{}
could be used for DNS Name compression (see https://github.com/mateusz834/zig-dns-compression/blob/235006c3686f985a16179fd112acb78ca62f90a9/src/main.zig#L81).
Zig folks somehow managed to use map[struct{}]struct{}
with this. (https://github.com/ziglang/zig/blob/6e8493daa3c4c4dc2d6430d5379b058f6b65f297/src/InternPool.zig#L1599)
To be clear i do not feel like this kind of stuff (with Adapted HashMaps) fits for Go (or at least in the std).
But having a Hasher[T, T2 any]
, say for a Set
, sound like an interesting idea, where you could insert to it with an X
type, but query based on Y
.
I am still unsure about this, but leaving this here.
Comment From: adonovan
There is a concept of
Context
s in hash maps, that shines in Zig https://ziglang.org/documentation/0.14.0/std/#std.hash_map.AutoHashMap in short terms you can use a differentContext
(in the Go terms that would be the same thing as aHasher
) in Get/Detete operations. So basically you have an Map withHasher[T, T2]
and you can change theHasher
to sayHasher[T3, T2]
for Get/Delete. This might sound weird at first glance, but it allows for crazy usages of HashMaps, for examplemap[uint16]struct{}
could be used for DNS Name compression (see https://github.com/mateusz834/zig-dns-compression/blob/235006c3686f985a16179fd112acb78ca62f90a9/src/main.zig#L81).Zig folks somehow managed to use
map[struct{}]struct{}
with this. (https://github.com/ziglang/zig/blob/6e8493daa3c4c4dc2d6430d5379b058f6b65f297/src/InternPool.zig#L1599)To be clear i do not feel like this kind of stuff (with Adapted HashMaps) fits for Go (or at least in the std).
But having a
Hasher[T, T2 any]
, say for aSet
, sound like an interesting idea, where you could insert to it with anX
type, but query based onY
.
Interesting. Zig's hash table apparently allows the client to provide distinct hash/equal operator pairs for use by the "insert" and "get" operations. It's not clear to me whether this is a necessity of Zig's type system (const vs nonconst access to the map data), or a performance hack, perhaps allowing the client to choose a different representation for the key as it is represented within the map. Presumably it also allows you to exploit the fact that key has a bounded lifetime in Get but may escape in Insert, which is often a bottleneck in Go; maphash.Comparable causes its pointer argument to escape.
This all seems excessively complex to me.
Comment From: mateusz834
It's not clear to me whether this is a necessity of Zig's type system (const vs nonconst access to the map data), or a performance hack.
It's a performance hack (I think you can call it a Data Oriented Design approach, ziglang/zig#8619)
This all seems excessively complex to me.
True, it is really hard to reason about what is happening with such uses of hash maps.
Comment From: aclements
No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — aclements for the proposal review group
The proposal is to add the following API definition to package maphash
:
// A Hasher is a type that implements hashing and equality for type T.
//
// A Hasher must be stateless. Hence, typically, a Hasher will be an empty struct.
type Hasher[T any] interface {
// Hash updates hash to reflect the contents of value.
//
// If two values are [Equal], they must also Hash the same.
// Specifically, if Equal(a, b) is true, then Hash(h, a) and Hash(h, b)
// must write identical streams to h.
Hash(hash *maphash.Hash, value T)
Equal(T, T) bool
}
As an example, some simple hashers could look like:
type Strings []string
type StringsHasher struct{} // Hasher for Strings
func (StringsHasher) Hash(mh *maphash.Hash, val Strings) {
for _, s := range val {
mh.WriteString(s)
}
}
func (StringsHasher) Equal(a, b Strings) bool {
return slices.Equal(a, b)
}
type Thing struct {
ss Strings
i int
}
type ThingHasher struct{} // Hasher for Thing
func (ThingHasher) Hash(mh *maphash.Hash, val Thing) {
StringsHasher{}.Hash(mh, val.ss)
maphash.WriteComparable(mh, val.i)
}
func (ThingHasher) Equal(a, b Thing) bool {
if a.i != b.i {
return false
}
return StringsHasher{}.Equal(a.ss, b.ss)
}
A simple custom hash-based could use this interface as follows:
// Example hash-based set
type Set[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed maphash.Seed
data []T
// ...
}
func (s *Set[T, Hash]) Has(val T) bool {
var mh maphash.Hash
mh.SetSeed(s.seed)
s.hash.Hash(&mh, val)
offset := mh.Sum64()
for base := range s.data {
i := (uint64(base) + offset) % uint64(len(s.data))
// ... break if this slot is empty ...
if s.hash.Equal(val, s.data[i]) {
return true
}
}
return false
}
var s Set[Thing, ThingHasher]
Comment From: aclements
Over on #69559, @mateusz834 pointed out that we didn't specify whether the zero value of a Hasher must be usable, which leads to ambiguity when designing APIs around it because it's not clear if the caller must provide a Hasher value or if the type alone is always sufficient.
I think we should specify that the zero value of a Hasher must be usable. Given that Hashers must be stateless and this will almost always (possibly always always) be an empty struct, I think this is a reasonable requirement. Requiring this means that APIs can work solely with the type of a Hasher (e.g., as a type parameter) without having to provide a mechanism for setting the Hasher's value.
Comment From: adonovan
There are essentially three choices for a hash table:
- Obtain the hash function and equivalence relation (hereafter "Hash") from the values, as in java.lang.Object. This is what Go's built-in maps do, but only for a non-extensible subset of types.
- Obtain the Hash from the table, as a dynamic value stored in a field. This is what the current draft of hash.Map proposes. However, it requires dynamic calls through the interface value and, more importantly, defeats escape analysis, so that every Get(k) operation causes k to escape.
- Obtain the Hash from the type of Map[K,V,H], as in your comments above. This eliminates the dynamic call and permits the key of Get(k) to be stack allocated.
The downsides of option 3, however, are that:
(a) it requires that the Hash function be not merely stateless but a unitary type (struct{}
);
(b) it entails a different stencil of the generic function for every usage, increasing executable size and i-cache working set; and
(c) it requires the Hash type to be mentioned even when it is irrelevant, for example, in function parameters.
All three of these are I think instances of what @ianlancetaylor calls "programming with types". I am reminded of C++'s STL, in which the type of string is std::basic_string<element_type, char_traits, allocator_type>
. At least in C++ it is possible, indeed normal, to omit the secondary type parameters because sensible defaults can be inferred; but that is not the case here. The type of hash.Map[K,V,H]
carries around an undesirable third parameter that must be explicitly stated. If you assign it to a hypothetical abstract Map[K, V]
interface and refer to it through this type thereafter, then you can do away with the third parameter, but of course reintroduces both the dynamic call and the escape of Get(k) that we were trying to eliminate.
This raises a potential solution: define, along with Hash and hash.Map[K,V,H], an abstract Map[K,V] interface, so that most users will refer to hash.Map via the abstract type and pay the additional costs, but motivated performance-sensitive clients can use Map[K,V,H] directly. I'm not particularly enthusiastic about it though: we should be driven by actual use cases, and my primary motivation is where K=types.Type and the Type has already escaped. If others have different needs, they should speak up.
Comment From: mateusz834
The downsides of option 3, however, are that: (a) it requires that the Hash function be not merely stateless but a unitary type (struct{});
Why? You can still use the H
type param. I mentioned that here: https://github.com/golang/go/issues/69559#issuecomment-2810355972
Comment From: mateusz834
This raises a potential solution: define, along with Hash and hash.Map[K,V,H], an abstract Map[K,V] interface, so that most users will refer to hash.Map via the abstract type and pay the additional costs, but motivated performance-sensitive clients can use Map[K,V,H] directly
I wonder whether a type-alias could help here somehow:
type MapStatic[K, V any, H Hasher[K]] struct{}
type Map[K, V any] = MapStatic[K, V, Hasher[K]] // uses interface type
Comment From: adonovan
Why? You can still use the H type param.
You're right; sorry.
I wonder whether a type-alias could help
It doesn't solve the escape problem: the alias is just a short way of writing MapStatic[K, V, Hasher[K]]
, whose implementation has H equal to an interface type, so Get would requires dynamic calls and conservative heap allocation.
Of course, you can always define a specific alias for one particular hasher of interest. In my case this is types.Type and its hash function:
package types
type HashMap[V any] = hash.Map[Type, V, TypeHash]
but then you can't pass a types.HashMap
to a func f[K comparable, V any](m *hash.Map[K,V])
, indeed you can't even define f this way, without without doing one of two things:
(i) adding a third type parameter H to f (meaning the alias hasn't really hidden anything from us), or
(ii) changing the type of f's parameter m to an abstract Map[K, V] interface type, adding a(nother) level of dynamic indirection, which is what we were aiming to avoid.
Comment From: mateusz834
It doesn't solve the escape problem: the alias is just a short way of writing MapStatic[K, V, Hasher[K]], whose implementation has H equal to an interface type, so Get would requires dynamic calls and conservative heap allocation.
Yes, i didn't mean to solve that with this, it is just a idea that we can have two map types without having two separate structs.
but then you can't pass a types.HashMap to a func fK comparable, V any, indeed you can't even define f this way, without without doing one of two things.
MapStatic[K, V, H]
could have a method to convert it to a Map[K, V]
Nevertheless an interfaces solves the same problem, but now as i think about that, the type-alias approach allows us to add type Map[K, V any] struct{...}
and then (if needed) we could add type MapStatic[K, V any, H Hasher[K]]
(in future), then Map[K, V]
could be just an alias over MapStatic
.
Comment From: aaronbee
we should be driven by actual use cases, and my primary motivation is where K=types.Type and the Type has already escaped. If others have different needs, they should speak up.
I agree that in practice the key type will have already escaped for most uses of hash.Map
. In my experience the reasons to use something like hash.Map
over the built-in map are usually around a large gnarly key type that you pass around by pointer.
For example the key is a map or a struct containing a map or slice. Or it's a large struct that you don't want a copy of inside the map. Counter examples would be using a slice as key, esp. []byte
, where the header would escape.
Comment From: DmitriyMV
@aaronbee
Only the byte array under the slice pointer will escape, since the slice header is copied during Get(K)
call where K is []byte{...}
.
One interesting thing is that if hash.Hasher
is used as an interface value, it means that *maphash.Hash
in hash.Hasher.Hash
always escapes, even if key is int
or float64
. So the data structure has to account for that (by using sync.Pool
of *maphash.Hash
perhaps)?
Comment From: Merovius
@adonovan Option 1 also eliminates dynamic calls and doesn't require you to carry around the hash as a type parameter. I think the only reason that we are predisposed against that is that it requires you to define a new type to override the hash (and as a special case, it doesn't work with predeclared types). Personally, I'm not convinced that downside outweighs the downsides of the alternative approaches.
Comment From: adonovan
[@Merovius] Option 1 also eliminates dynamic calls and doesn't require you to carry around the hash as a type parameter.
It's true that it doesn't require you to specify the hash as a additional type parameter, but it does require you to encode the hash in the existing type parameter K. So, for types.Type, you need to declare a wrapper type HashableType that essentially combines the concepts of K=types.Type and H=types.Hash, and then explicitly wrap/unwrap this type around every operation:
type Hashable[K any] interface {
Hash(*maphash.Hash)
Equals(K) bool
}
type Map[K Hashable[K], V any] struct{}
func (m *Map[K, V]) Get(k K) (_ V) { k.Hash(nil); k.Equals(k); return }
type HashableType struct{ t types.Type }
func (ht HashableType) Hash(*maphash.Hash) { types.Hash(ht.t) }
func (x HashableType) Equals(y HashableType) bool { return types.Identical(x.t, y.t) }
func main() {
var m Map[HashableType, int]
var t types.Type
_ = m.Get(HashableType{t})
}
So it's good to explore it for completeness, but I don't think it has any real advantage over option 3.
Comment From: cr7pt0gr4ph7
I like that option 1 encapsulates everything that is related to keys and their comparison in a single type parameter:
* Just having two type parameters instead of three might make code a bit more easier to read for the uninitiated reader.
* Also, if a value of some type K
is equal in one context, it will be considered equal in all other contexts (e.g. when transferring map keys to a set) without having to care about anything else. Transitioning to a different equality regime is clearly marked and explicit.
* Wrapping a type to implement an additional interface is a well-known idiom in Go.
* Authors of generic types that wrap such maps cannot "forget" to expose the corresponding Hasher
type parameters.
A few questions came to my mind related to both option 1 & 3:
* What should happen if K
is a interface type, and nil
is passed for key
? Especially given that nil
values may or may not have a dynamic type (i.e. vtable) stored.
* Is changing the notion of equality used by an exported map value a breaking change.
* Is changing the hash algorithm but keeping equality relations the same on an exported map value a breaking change?
Comment From: cr7pt0gr4ph7
Also, another afterthought:
While equally powerful on a technical level, the two options will likely tend to steer API design in a certain direction (for types that you control).
Option 1:
- Library authors will likely tend to implement the equality algorithm they need on the type itself.
- There's no way to expose a type without exposing it's equality methods (disregarding private wrapper types of course).
- Having one kind of equality be implemented on the type itself, and others separately, will likely cause library users to perceive the former one as a kind of "default".
- There's the heightened risk of embedding a type but only overriding Equals
but not Hash
- There's the potential for confusion whether the built-in map uses the Equals
method or not (it doesn't, but that might not be clear to Go newcomers).
Option 3: - There's no hierarchy between different kinds of equality for a type. - It's more easily possible to parametrize a Hasher of some struct on Hashers for their component fields.
Comment From: Merovius
@adonovan I think the critical part of that example is that types.Type
is an interface type, so you can't just add an extra method to it (which you could for non-interface types), correct? But what's more, types.Type
is a relatively special interface type, as it is really conceptually a closed sum - so we could add a method to all the concrete implementations and then do an interface type-assertion (or arguably even embed hash.Hashable
into types.Type
, due to its closed nature. But probably not).
For everyday interfaces (like io.Reader
) it really doesn't make sense to wrap them, as you don't know if all implementations can be hashed. And for concrete types, you can add a method to the type proper without problems.
I don't think it has any real advantage over option 3.
The advantages are IMO pretty real - in the common case. They don't work for a relatively special case like types.Type
, but I don't think we should really make a relatively fundamental decision like this based on a special case like that.
Comment From: cr7pt0gr4ph7
I was originally in favor of option 1 due to its simplicity, but am now in favor of option 3 because it avoids boilerplate for a likely very common case that hasn't been mentioned yet AFAICT:
Given some library-defined data structure that requires a Hasher
/Hashable
, option 3 makes it easily possible to provide a hasher.Default[T comparable]
(likely based on maphash.Comparable
) to user code with zero additional boilerplate.
Option 1 would instead require users to either write boilerplate (*MyType).Equals/Hash
methods, or wrap their keys into a hasher.DefaultHashable[T comparable]
.
It therefore seems to me that option 3 is likely preferable.
It still makes sense to also provide a hasher.Hashable[T]
interface (as well as a hasher.FromHashable[T Hashable[T]]
impl), albeit as an exception rather than the norm, as users will have to write such a thing themselves anyway when they are creating class hierarchies through struct embedding + interfaces.
Comment From: aclements
I'm putting this back on the docket.
For option 3 (using maphash.Hasher as a dynamic interface), I'm not so concerned about the key escaping. I'm concerned that the maphash.Hash
will escape, causing an allocation on every single Get or Set call. An implementation could use something like a pool of maphash.Hash
es but I don't like an API where the obvious default use is expensive. @adonovan 's going to measure the cost of this, but I can't imagine it will be low enough that this will seem acceptable.
Option 1 (a type is its own hasher) has a lot of appeal. That's how lots of other languages do this. Lots of types are only going to have one reasonable definition of equality, in which case it's better to bundle it with the type than to require the user to find some second type that goes with the first type. However, it requires a "wrapper" type in three situations:
- You don't control the type.
- It's an interface type and you can't change the interface definition (or need to deal with nil values).
- There's more than one equivalence relation.
I think we've been pursuing a separate hasher type because we haven't had this until now, so it seems likely people will want custom equality for a lot of types they don't control, and because this version of this discussion was driven by hashing go/type.Type, which is both an interface type and has more than one equivalence relation. My sense is that problem 1 will fade over time, and that problems 2 and 3 will be relatively rare.
Option 1 requires a different API, like what @adonovan gave above:
// Hashable types provide a custom equality predicate and accompanying hash function.
type Hashable[T any] interface {
// Hash updates hash to reflect the contents of this value.
//
// If two values are [Equal], they must also Hash the same.
// Specifically, if Equal(a, b) is true, then Hash(h, a) and Hash(h, b)
// must write identical streams to h.
Hash(hash *maphash.Hash)
Equal(T) bool
}
Updating my example from above, a Hashable type would look like:
type Strings []string
func (ss Strings) Hash(mh *maphash.Hash) {
for _, s := range ss {
mh.WriteString(s)
}
}
func (ss Strings) Equal(b Strings) bool {
return slices.Equal(ss, b)
}
type Thing struct {
ss Strings
i int
}
func (t Thing) Hash(mh *maphash.Hash) {
t.ss.Hash(mh)
maphash.WriteComparable(mh, t.i)
}
func (t Thing) Equal(b Thing) bool {
return t.i == b.i && t.ss.Equal(b.ss)
}
And an example hash-based collection would look like:
// Example hash-based set
type Set[T Hashable[T]] struct {
seed maphash.Seed
data []T
// ...
}
func (s *Set[T]) Has(val T) bool {
var mh maphash.Hash
mh.SetSeed(s.seed)
val.Hash(&mh)
offset := mh.Sum64()
for base := range s.data {
i := (uint64(base) + offset) % uint64(len(s.data))
// ... break if this slot is empty ...
if val.Equal(s.data[i]) {
return true
}
}
return false
}
var s Set[Thing]
Comment From: aclements
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — aclements for the proposal review group
Comment From: ianlancetaylor
For what it's worth, I don't think it's desirable to require types to implement specific methods in order to support hashing. That is, of course, what we did for sort.Interface
, and it led to a lot of boilerplate code for sorting. Adding sort.Slice
and slices.SortFunc
was a big step forward. See also https://go.dev/blog/when-generics#for-type-parameters-prefer-functions-to-methods.
I''m not sure how seriously we should take the issue of escaping. It seems to me that it only arises for structs where the struct is stored in the hash map by value. We can expect those structs to be reasonably small. Since the hash function/method will presumably take arguments by value, the structs as such don't escape. The only thing that can escape is when the struct contains pointers/slices/strings that otherwise would not escape. Is that really a common scenario?
Comment From: adonovan
@aclements wrote:
You could write a local generic function with a short name that does maphash.WriteComparable, ...
Local functions cannot be named, and thus cannot be generic. But see https://github.com/golang/go/issues/73502.
Comment From: adonovan
I agree with @ianlancetaylor that the methods should not be in the values (option 1), and I already argued against "programming with types" (option 3). Option 2 has regrettable performance implications (@aclements points out that it's not just the key that escapes, but the likely much larger maphash.Hash state). Since that's an empirical question I wrote a quick benchmark of three different map implementations:
- built-in
map[types.Type]ast.Expr
; - the proposed
hash.Map[types.Type, ast.Expr]
, with dynamic dispatch; - the alternative
hash.Map[K, V, H]
, with a statically known Hasher.
The results:
BenchmarkMap/builtin_map-8 148677 7325 ns/op 9032 B/op 9 allocs/op
BenchmarkMap/dynamic_Map-8 60730 19742 ns/op 25391 B/op 201 allocs/op
BenchmarkMap/static_Map-8 76262 15391 ns/op 7606 B/op 90 allocs/op
Looking at profiles, nearly all the allocation in 'static' is appending entries to the bucket. A smarter implementation (possibly the runtime's swissmap?) would really help to bring this down. But looking at 'dynamic', about 70% of the allocation is indeed the escaping local var maphash.Hash
in Map.hash, and this is not going to be reduced by a smarter map. So I think @aclements has a point. There still remains the question of whether performance of this map matters at all.
Comment From: Merovius
@adonovan This isn't just about adding a map for go/types
. It's about standardizing the idiom for hashable types. It is likely what third party package authors will rely on if they implement their own hash-based data structures. Which they will do, because they have performance-sensitive use cases not well served by the standard library (a sync.Map
with different tradeoffs more suited to other workloads come to mind). Ultimately, third party containers where probably the most important motivation for adding generics.
I think paying attention to performance is important, here.
Comment From: ianlancetaylor
Thanks for the reminder of the actual problem.
Since we are assuming maphash.Hash
here, what if the hash/maphash package adds a pool of maphash.Hash
values, such that you can get a value by calling maphash.PooledHash(maphash.Seed) *maphash.Hash
. Perhaps a value of this type would be automatically returned to the pool after a call to the Sum
or Sum64
method. This would essentially replace
var mh maphash.Hash
mh.SetSeed(s.seed)
s.hash.Hash(&mh, val)
with
mh := maphash.PooledHash(s.seed)
s.hash.Hash(mh, val)
Comment From: adonovan
A pool could solve the allocation problem, but there's an awful lot of explanation to give to the novice who asks "how do I add a hashCode method to my type?"
Comment From: mateusz834
Since we are assuming maphash.Hash here, what if the hash/maphash package adds a pool of maphash.Hash values, such that you can get a value by calling maphash.PooledHash(maphash.Seed) *maphash.Hash. Perhaps a value of this type would be automatically returned to the pool after a call to the Sum or Sum64 method. This would essentially replace
The maphash.Hash
could be stored directly in the Map struct and then the map can Reset()
before it being used in the Hasher.
Comment From: adonovan
The
maphash.Hash
could be stored directly in the Map struct and then the map canReset()
before it being used in the Hasher.
That's not concurrency safe.
Comment From: mateusz834
That's not concurrency safe.
That is true, but (i assume) the Map
from #69559 is not safe for concurrent use. Same as the built-in map
. I would also assume that the final version of it would be simmilar to the runtime swiss map (which has a writing
field used to detect concurrent uses).
Comment From: adonovan
That is true, but (i assume) the Map from https://github.com/golang/go/issues/69559 is not safe for concurrent use. Same as the built-in map.
Not so: as with the built-in map, concurrent reads are safe and expected to be common.
Comment From: mateusz834
Not so: as with the built-in map, concurrent reads are safe and expected to be common.
Ahh, right right. Missed that case.
Comment From: ianlancetaylor
It's possible that this proposal is trying to straddle a couple of cases, making it harder to understand.
One case is providing a convention for how hash tables should hash types. That is the simple, but useful, idea that the hash package would define an interface that hash tables can use. That is approximately what was accepted in https://github.com/golang/go/issues/70471#issuecomment-2773694290.
The other case is thinking about to use the existing hash/maphash package to build hashes for arbitrary types.
The cases were mixed because the accepted interface takes an argument of type maphash.Seed
. That inherently pushes people toward using the hash/maphash package for their hash code. There's actually no reason for that: the hash.Hasher
proposal could just as easily use uint64
as the type of the seed. That would let hash tables provide a seed without tying them to hash/maphash. It would also mean that hash tables could use non-random seeds for cases where that is appropriate, such as tables that survive a process, or are shared between processes, or for debugging. That said a uint64
seed wouldn't work well with hash/maphash as currently defined, as there is no way to convert a uint64
to a maphash.Seed
.
Perhaps it will be easier to think about this proposal if we can separate the issues of 1) convention for hash tables; 2) use hash/maphash for arbitrary types.
I think the concerns I see expressed here are with using hash/maphash, not with the idea of a standardized hash function which is how this proposal started.
The main issue I see with using hash/maphash for generalized types is that it is a repetitive boilerplate process. You have to iterate over the elements of some data structure. If those elements participate in whatever equality function you are using, you have to hash them in a way that matches the equality function. If we assume that we want to avoid reflection and code generation, we want to ideally have something like
mh := maphash.MakeHasher(seed)
hashField1(mh, v.Field1)
hashField2(mh, v.Field2)
...
return mh.Sum64()
Given the way that hash/maphash works today, I don't think we can do any better. And this is pretty close to how we have to write a custom equality function.
Do you see some other approach?
Comment From: Merovius
Just to put that out there (I considered saying this earlier, but I'm not sure how much I believe myself): Fundamentally, hashing a type is about collecting all the information relevant to distinguish a value and then mixing it into the hash. That is actually almost identical to the problem of faithfully encoding a value. Almost, because the latter 1. doesn't require determinism and 2. usually tries to make the representation easy to read back, as well.
But this suggests the alternative of using encoding.BinaryAppender (and an interface for equality) as "the interface for hashable types". As the key is a type parameter anyways, the method can be statically dispatched and (hopefully) the passed in []byte
does not escape. Or the table can pull temporary []byte
s to encode into from a pool. The encoded binary representation can then be written into whatever hash the table likes. This would both be zero-alloc and solve repetitive method definitions for types that are already encodable into binary.
The interface wouldn't be terribly precise, because of the differences I mention above. But that could be documented ("the encoding must be deterministic…"). If we are unhappy with that - or are concerned that a type might want to be hashable, but not encodable - we could also replicate the same signature and semantics into a differently named method.
Again, just putting a thought rumbling around in my head out there.
Comment From: ianlancetaylor
I agree that encoding.BinaryAppender
works fine in many cases. But not all. If we are talking about a general convention for hashing types, we need to support arbitrary cases. But if we are talking about supporting arbitrary types in the maphash package, then BinaryAppender
is a good path. One could imagine testing for BinaryAppender
(and TextAppender
) before falling back to a reflection-based approach.
Comment From: adonovan
@Merovius While I agree that it's conceptually very useful to think of the hash
and equal
functions as operations on the string produced by a canonical encode
function for your data type, in my experience it is rare to implement hash and equals this way except in the simplest of cases, because (a) most encode functions can't be assumed to produce a canonical message, and (b) encoding typically allocates more memory than is necessary.
I tried to capture the analogy in the commentary in https://go.dev/cl/657296, but I think it's no more than an analogy, and I would be quite surprised to find that the presence of a BinaryAppend method caused my type to become hashable (or overrode its default reflect-based hashing strategy).
Comment From: aclements
I think what I wrote above still stands:
I think we've been pursuing a separate hasher type because we haven't had this until now, so it seems likely people will want custom equality for a lot of types they don't control, and because this version of this discussion was driven by hashing go/type.Type, which is both an interface type and has more than one equivalence relation.
I think we're overindexing on types.Type
. I'd like some examples of places where people current use custom hashing/equality to get a sense of how often 1) you don't control the type being hashed, 2) it's an interface type, or 3) there's more than one equivalence relation.
One I remember writing is in x/perf/benchproc.Projection
, which has a custom hash table of Keys
defined here and used here. In this case, I control the type, it's not an interface, and there's one clear equivalence relation. The hashing is also completely internal, since this is used by an intern table to ensure pointer equality is the same as my custom equality.
Comment From: aclements
For what it's worth, I don't think it's desirable to require types to implement specific methods in order to support hashing. That is, of course, what we did for sort.Interface, and it led to a lot of boilerplate code for sorting. Adding sort.Slice and slices.SortFunc was a big step forward.
That's an interesting parallel. Though I think you're much more likely to need different sort orders than different equivalence relations.
The cases were mixed because the accepted interface takes an argument of type maphash.Seed. That inherently pushes people toward using the hash/maphash package for their hash code. There's actually no reason for that: the hash.Hasher proposal could just as easily use uint64 as the type of the seed.
I disagree that there's no reason for that. Much of the discussion early in this issue assumed some sort of seed would be passed in and discussed no seed vs a uint64 vs a maphash.Seed
, but we eventually turned toward passing *maphash.Hash
because passing around just a seed is really awkward for recursive hashing.
If those elements participate in whatever equality function you are using, you have to hash them in a way that matches the equality function.
You're absolutely right about this, but I'm not sure how to do better.
In a sense, hashing is "decompose this into a stream of basic types and hash that" and equality is "decompose both of these into streams of basic types and compare those." It's really tempting to say that a type just needs to provide a single "decompose this into a stream of basic types" and we can derive hashing and quality from there. But they're different enough that I don't know how to unify them, at least not efficiently. In hashing, you're only walking one value, but in equality you're walking two values. And of course you don't want to reify the whole stream. In equality, you also want to stop as soon as the streams differ, but in hashing you always want to go to the end.
I can imagine some sort of coroutine thing to jump back and forth between, but that's a lot of overhead compared to testing equality. Lazy evaluation could do this, but I don't see how to fit that in. Maybe there's some way to take advantage of both sides of the equality being the same code, just walking different values?
Comment From: adonovan
In a sense, hashing is "decompose this into a stream of basic types and hash that" and equality is "decompose both of these into streams of basic types and compare those." It's really tempting to say that a type just needs to provide a single "decompose this into a stream of basic types" and we can derive hashing and quality from there. But they're different enough that I don't know how to unify them, at least not efficiently. In hashing, you're only walking one value, but in equality you're walking two values. And of course you don't want to reify the whole stream. In equality, you also want to stop as soon as the streams differ, but in hashing you always want to go to the end.
I can imagine some sort of coroutine thing to jump back and forth between, but that's a lot of overhead compared to testing equality. Lazy evaluation could do this, but I don't see how to fit that in. Maybe there's some way to take advantage of both sides of the equality being the same code, just walking different values?
Don't forget that if you need to hash an unordered set of n items within a large data type, you have to either (a) sort them first, which allocates, or (b) hash each item with a sub-hasher using the same seed, combine the subhashes commutatively, then write the result into the main hasher. By contrast, to compute equivalence, you use maps.Equals. The structures of the two algorithms are quite different.
Comment From: ianlancetaylor
I disagree that there's no reason for that. Much of the discussion early in this issue assumed some sort of seed would be passed in and discussed no seed vs a uint64 vs a
maphash.Seed
, but we eventually turned toward passing*maphash.Hash
because passing around just a seed is really awkward for recursive hashing.
OK, but isn't that another way of saying that we are trying to figure out how to use the hash/maphash package to hash arbitrary types? That seems like a worthy goal but it's not the same as figuring out a standardized way for hash tables to compute a hash code.
For an example of using a different hash code for the same type I'll cite the gofrontend code. In that code I use hash tables of types such that identical types are equivalent, and I also use a hash table of types such that alias types are hashed differently from the types to which they are aliased, although of course alias types are identical to their target type. This is used to ensure that type aliases appear in export data. (There are other ways to do that, of course, but the point is that C++ permits me to pick the hash function to use, and it was convenient to do so.)
Comment From: mateusz834
There is also a possibility of doing it this way:
type Hasher[T any] interface {
Hash(hash maphash.Hash, value T) maphash.Hash
Equal(T, T) bool
}
Where to Hash()
we provide a copy of a Hash and the implementation returns it (with the updated state). This way the Hash
struct is not going to be heap-allocated, at a cost of copying memory.
But this might be a bit confusing and error-prone:
func (someHasher) Hash(h maphash.Hash, value someStruct) maphash.Hash {
someOtherHasher.Hash(h, value.someField) // ignored updated (returned) Hash.
// h = someOtherHasher.Hash(h, value.someField) // proper way
return h
}
Comment From: adonovan
Another data point in support of a static solution (encoding the hasher among the type parameters):
Last week I replaced the array in token.FileSet with a balanced binary tree, expecting it to improve the asymptote of appending out of order but to hurt the common-case cost of lookups. To my surprise, lookups also got faster. This is counterintuitive: lookups in an array and binary tree are essentially the same algorithm but the obviously the array has much better locality. The reason for this result is that the binary search on the array used slices.BinarySearchFunc, which makes dynamic calls to the key-comparison function, whereas the tree, specialized for one data type, makes static calls.
Comment From: aclements
A while back, we accepted this proposal as an "equality type". We reopened it because in practice is seemed annoying to have to provide an additional type parameter to collection types specifying the hashing algorithm. However, having explored alternatives, I think we've returned to the original conclusion. Sometimes you need multiple equivalence relations, and static dispatch is important both for raw call performance and for escape analysis. Putting those together, the hasher must be its own type parameter.
Comment From: aclements
I've been noodling on ways to make that extra type parameter "nicer" and have a few extremely underbaked ideas that I'll share just in case they become useful inspiration.
Most of the time there's a single reasonable default hasher for a type, which suggests some sort of "default value" syntax for type parameters. Go, famously, does not have defaults for function parameters, but type inference already allows omitting type parameters, so there's some precedent in that space.
Then we also need some way to say "T's default hasher type is H." This is a "type mapping" and it's something we've also butted heads with several times in the SIMD work.
Putting these together, one could imagine extending Go types and interfaces to allow types to have members that are themselves types, similar to how C++ represents type mappings. That would look something like:
type Thing struct {
}
// ThingHasher implements maphash.Hasher for Thing
type ThingHasher struct {...}
// New: This defines a "DefaultHasher" member of Thing that is itself a type.
// The syntax is analogous to how you define a method.
// (If we limit this to aliases, we avoid hairy questions of how you
// define methods on type members.)
type (Thing) DefaultHasher = ThingHasher
// DefaultHashable is the set of types that provide a default hasher,
// via a member called "DefaultHasher" that is itself a type.
type DefaultHashable interface {
// New: A type member, analogous to an interface method.
type DefaultHasher
}
// New: The "Hasher" type parameter of "Set" provides a
// default value. Sketchy: There's a new subset of
// expression syntax hiding here, since this needs to
// be evaluated statically, and needs to have a concept
// of "this type expression failed, so you have to
// provide a value for this parameter." That subset might
// be quite small, though.
type Set[Key any, Hasher maphash.Hasher = Key.(DefaultHashable).DefaultHasher] struct {
h Hasher
...
}
Comment From: mateusz834
Wouldn't it be possible to make this a:
type Set[H Hasher[K], K any] struct{}
And improve the type-inference, so that we don't have to specify the K
, when K
can be inferred from H
?
Set[maphash.Hasher[int]]
Set[maphash.Hasher[int], int]
Comment From: aclements
@mateusz834 , that's a good point. It feels weird to put the hasher first, but it would solve the problem without complicated language additions!
Comment From: aclements
Have all remaining concerns about this proposal been addressed?
The proposal is to add the following API definition to package maphash
:
// A Hasher is a type that implements hashing and equality for type T.
//
// A Hasher must be stateless. Hence, typically, a Hasher will be an empty struct.
type Hasher[T any] interface {
// Hash updates hash to reflect the contents of value.
//
// If two values are [Equal], they must also Hash the same.
// Specifically, if Equal(a, b) is true, then Hash(h, a) and Hash(h, b)
// must write identical streams to h.
Hash(hash *maphash.Hash, value T)
Equal(T, T) bool
}
(This is the as previously accepted.)
Comment From: jimmyfrasche
Should there also be a ComparableHasher
that uses maphash.Comparable
and ==
? It seems like that would need to be implemented all over the place otherwise so it would be good to just have that from the start in one place.
Comment From: adonovan
Should there also be a ComparableHasher that uses maphash.Comparable and ==? It seems like that would need to be implemented all over the place otherwise so it would be good to just have that from the start in one place.
Indeed. That was in the CL (https://go-review.googlesource.com/c/go/+/657296/8/api/next/70471.txt) though not in the proposal. I meant to (but forgot) to reflect the change here. So I propose we also add:
package maphash
// Default is an implementation of [Hasher] for comparable types.
// Its Equal(x, y) method is consistent with x == y.
type Default[T comparable] struct{}
Comment From: jimmyfrasche
I don't think Default is a good name. maphash.Default
makes it seem like it's related to Hash rather than Hasher. I think it should have Hasher in the name.
DefaultHasher would work but it's not really the default per se. maphash.ComparableHasher
is the most accurate name but it is a bit long so maybe BasicHasher would be a decent compromise?
package maphash
// BasicHasher is an implementation of [Hasher] for comparable types.
// Its Equal(x, y) method is consistent with x == y and
// its Hash(hash, value) method is consistent with [WriteComparable].
type BasicHasher[T compable] struct{}
Comment From: adonovan
"Basic" might suggest "good only for basic types" (which is what maphash.Comparable is for, among other things).
DefaultHasher seems fine: it is the default in the sense that it is the hash function that corresponds to the default equivalence relation--the one defined by ==
.
ComparableHasher seems fine too; it is long but it's not a name one would use often.
Comment From: mateusz834
Do we have an answer for https://github.com/golang/go/issues/70471#issuecomment-2810371814?
I think it is only partial, consensus is that it should be specified as a type parameter, but is the zero value always a valid hasher?
Comment From: Merovius
Just quickly noting that if we want that, an easy way to it would be to put ~struct{}
into the Hasher
interface. Which also means it can only be used as a type-parameter. Which might be a good thing or not, depending on your disposition.