Proposal Details

Proposal Details

I propose to add the Zip function to the slices package. The zip function is a critical part of the Python toolkit in the context of data science and machine learning.

Zip can be considered a supper set of transpose but is less restrictive regarding input shape. Silently adjusting output on a best-effort basis.

Example

Implementation

// Zip returns a slice, where the i-th slice contains the i-th element from each argument.
// If all the slices are the same length, the operation is equivalent to a matrix transpose.
// Excessive elements are discarded.
// Length of the returned slice equals the length of the shortest slice passed as an argument.
// Calling Zip again on the result will reverse the process. However, the result may be different from the original argument.
func Zip[T ~[]E, E any](unzipped ...T) [][]E {
    if unzipped == nil {
        return nil
    }

    if len(unzipped) == 0 {
        return [][]E{}
    }

    cols := len(unzipped)
    rows := len(unzipped[0])
    for _, slice := range unzipped {
        if len(slice) < rows {
            rows = len(slice)
        }
    }

    zipped := make([][]E, rows)
    arr := make([]E, rows*len(unzipped))
    for i := 0; i < rows; i++ {
        row := arr[:cols:cols]
        arr = arr[cols:]
        for j := 0; j < cols; j++ {
            row[j] = unzipped[j][i]
        }
        zipped[i] = row
    }

    return zipped
}

Edit: The allocation pattern changed as suggested by @adonovan

Test

func TestZip_matrix(t *testing.T) {
    cases := map[string]struct {
        given     [][]int
        exp, exp2 [][]int
    }{
        "nil": {
            given: nil,
            exp:   nil,
            exp2:  nil,
        },
        "empty": {
            given: [][]int{},
            exp:   [][]int{},
            exp2:  [][]int{},
        },
        "one": {
            given: [][]int{
                {1, 2, 3},
            },
            exp: [][]int{
                {1},
                {2},
                {3},
            },
            exp2: [][]int{
                {1, 2, 3},
            },
        },
        "transpose": {
            given: [][]int{
                {1, 3, 5},
                {2, 4, 6},
            },
            exp: [][]int{
                {1, 2},
                {3, 4},
                {5, 6},
            },
            exp2: [][]int{
                {1, 3, 5},
                {2, 4, 6},
            },
        },
        "uneven": {
            given: [][]int{
                {1, 2, 3},
                {4, 5},
            },
            exp: [][]int{
                {1, 4},
                {2, 5},
            },
            exp2: [][]int{
                {1, 2},
                {4, 5},
            },
        },
    }
    for hint, c := range cases {
        t.Run(hint, func(t *testing.T) {
            got := sliceutil.Zip[[]int](c.given...)
            if cmp.Diff(c.exp, got) != "" {
                t.Fatalf("unexpected result, diff: %s", cmp.Diff(c.exp, got))
            }

            got = sliceutil.Zip[[]int](got...)
            if cmp.Diff(c.exp2, got) != "" {
                t.Fatalf("unexpected second result, diff: %s", cmp.Diff(c.exp2, got))
            }
        })
    }
}

func TestZip_tensor(t *testing.T) {
    cases := map[string]struct {
        given     [][][]int
        exp, exp2 [][][]int
    }{
        "nil": {
            given: nil,
            exp:   nil,
            exp2:  nil,
        },
        "empty": {
            given: [][][]int{},
            exp:   [][][]int{},
            exp2:  [][][]int{},
        },
        "uneven": {
            given: [][][]int{
                {{1, 2}, {2, 3}, {4, 5}},
                {{6, 7}, {8, 9}},
            },
            exp: [][][]int{
                {{1, 2}, {6, 7}},
                {{2, 3}, {8, 9}},
            },
            exp2: [][][]int{
                {{1, 2}, {2, 3}},
                {{6, 7}, {8, 9}},
            },
        },
    }
    for hint, c := range cases {
        t.Run(hint, func(t *testing.T) {
            got := sliceutil.Zip[[][]int](c.given...)
            if cmp.Diff(c.exp, got) != "" {
                t.Fatalf("unexpected result, diff: %s", cmp.Diff(c.exp, got))
            }

            got = sliceutil.Zip[[][]int](got...)
            if cmp.Diff(c.exp2, got) != "" {
                t.Fatalf("unexpected second result, diff: %s", cmp.Diff(c.exp2, got))
            }
        })
    }
}

Benchmark

func BenchmarkZip(b *testing.B) {
    data := [][]int{
        {1, 3, 5},
        {2, 4, 6},
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = sliceutil.Zip[[]int](data...)
    }
}

Comment From: seankhliao

how useful is this when all slices have to be of the same type?

Comment From: piotrkowalczuk

how useful is this when all slices have to be of the same type?

@seankhliao The first thing that comes to my mind is machine learning and, to be more precise, tensor processing. In this context, data must be normalized.

Std lib lacks the capabilities for that field; maybe this function could fit better under some newly formed package.

Comment From: thediveo

I've seen this coming up as part of the Go iterator activities. Seeing the Python implementation working on iterators I wonder if this would be better move to the Go iterator field?

Comment From: piotrkowalczuk

I've seen this coming up as part of the Go iterator activities. Seeing the Python implementation working on iterators I wonder if this would be better move to the Go iterator field?

Interesting point @thediveo. I don't know the overall direction or whether the community want iter.Seq to become the way to go for processing collections in general. I would enjoy the iter package if it would be an opt-in bridge between various types. At the same time, having access to a similar API layer in both maps and slices packages. To keep simple things simple.

Comment From: adonovan

I agree that this function should wait till the question of generalized sequences and iterators is settled.

BTW: it might be slightly faster (and better locality) to make a single allocation a slice of m*n elements upfront, and then break off pieces (setting the cap each time) for the individual rows of the result (like this).

Comment From: piotrkowalczuk

Now that iterators are available I would like to revisit this proposal.

This function is intended to work only with ordered sequences of data. Iterators here would introduce some ambiguity as there is no way to enforce this requirement on the signature level. I see two courses of action from here:

  • keep the signature hard to misuse (as it is how)
  • change the signature to be interoperable with iterators and explain caveats in the documentation.

Comment From: adonovan

Now that iterators are available I would like to revisit this proposal.

This function is intended to work only with ordered sequences of data. Iterators here would introduce some ambiguity as there is no way to enforce this requirement on the signature level. I see two courses of action from here:

  • keep the signature hard to misuse (as it is how)
  • change the signature to be interoperable with iterators and explain caveats in the documentation.

I'm not sure what ambiguity you're referring to; an iterator (iter.Seq) defines an ordered sequence. It should be possible to write a Zip function that zips n > 0 iterators over (possibly infinite) input sequences and returns an iterator over the output sequence (of tuples).

The challenge is how to represent the tuple. If the element types are all constrained to be equal, as in your example, then the element type of the output sequence would be a []E slice. However, without a way to create arbitrary n-ary product types, the only way to make it general is to use []any for the output type, which discards type information. Either choice rather diminishes the utility of the Zip function.

Comment From: ianlancetaylor

Note that the x/exp/xiter proposal #61898 includes a Zip function. It seems to make more sense to provide a Zip function on iterators than one on slices.

Comment From: piotrkowalczuk

@adonovan Zip would behave non-deterministically if at least one iterator produced a sequence from a map. As iterating over a map is non-deterministic. It makes maps, not the best candidate to be used in conjunction with Zip.

@ianlancetaylor I see, then this one can be closed.

Comment From: ianlancetaylor

OK, closing in favor of #61898. Thanks.

Comment From: adonovan

Note that the x/exp/xiter proposal https://github.com/golang/go/issues/61898 includes a Zip function.

Ah, I see it constrains n=2, allowing it to use a generic pair struct type for the zipped elements, preserving type information. Python's zip allows zipping of n iterators, but I suspect the utility diminishes fairly rapidly above n=2.