Golang proposal: archive/tar: support zero-copy reading/writing

Proposal Details

the container ecosystem (podman,docker) spends its days creating and consuming huge .tar files. There is potential for significant speed-up here by having the tar package use zero-copy file transport.

The change is straightforward, but involves an API change, so opening a proposal.

with the following change, tarring up a 2G file from tmpfs to tmpfs goes from 2.0s to 1.3s

diff -u /home/hanwen/vc/go/src/archive/tar/writer.go hacktar/writer.go
--- /home/hanwen/vc/go/src/archive/tar/writer.go    2024-08-22 14:56:29.586690369 +0200
+++ hacktar/writer.go   2024-12-12 15:01:22.150045055 +0100
@@ -9,6 +9,7 @@
    "fmt"
    "io"
    "io/fs"
+   "log"
    "path"
    "slices"
    "strings"
@@ -491,7 +492,7 @@
 //
 // TODO(dsnet): Re-export this when adding sparse file support.
 // See https://golang.org/issue/22735
-func (tw *Writer) readFrom(r io.Reader) (int64, error) {
+func (tw *Writer) ReadFrom(r io.Reader) (int64, error) {
    if tw.err != nil {
        return 0, tw.err
    }
@@ -550,6 +551,16 @@
 }

 func (fw *regFileWriter) ReadFrom(r io.Reader) (int64, error) {
+   log.Println("hanwen")
+   if _, ok := fw.w.(io.ReaderFrom); ok {
+       n, err := io.Copy(fw.w, r)
+       if n > fw.nb {
+           return n, fmt.Errorf("read %d bytes, beyond max %d", n, fw.nb)
+       }
+       fw.nb -= n
+       return n, err
+   }
+
    return io.Copy(struct{ io.Writer }{fw}, r)
 }

Comment From: hanwen

A similar optimization exists for the reading side, of course.

Comment From: ianlancetaylor

Just to spell it out, I believe that the API change here is to define a new method on archive/tar.Writer:

// ReadFrom implements [io.ReaderFrom].
func (tw *Writer) readFrom(r io.Reader) (int64, error)

Note that I think you could get a similar effect without the API change by writing

    if tw, ok := fw.w.(*Writer); ok {
        return tw.readFrom(r)
    }

CC @dsnet

Comment From: hanwen

your suggestion certainly improves regFileWriter.ReadFrom, but nobody calls that unless Writer.ReadFrom is exported. Am I missing something?

Comment From: hanwen

for the reader, this works.

```// // TODO(dsnet): Re-export this when adding sparse file support. // See https://golang.org/issue/22735 -func (tr Reader) writeTo(w io.Writer) (int64, error) { +func (tr Reader) WriteTo(w io.Writer) (int64, error) { if tr.err != nil { return 0, tr.err } @@ -688,6 +688,12 @@ }

func (fr *regFileReader) WriteTo(w io.Writer) (int64, error) { + _, ok1 := fr.r.(io.WriterTo) + wrf, ok2 := w.(io.ReaderFrom) + if ok1 && ok2 { + return wrf.ReadFrom(&io.LimitedReader{R: fr.r, N: fr.nb}) + } return io.Copy(w, struct{ io.Reader }{fr}) }


**Comment From: ianlancetaylor**

It's probably me that was missing something.

**Comment From: mvdan**

cc @dsnet given the TODO above

**Comment From: Jorropo**

Do we want to include logic to pad out the tar to align content files to the destination's blocksize if it's an `*os.File` in the writer ?
Performance improvements go from single digit `x` improvement to multiple thousands through reflink at the cost of making the exact bytes of the tar dependent on the output.

Tar natively pad to 512 :'(
https://github.com/golang/go/blob/e39e965e0e0cce65ca977fd0da35f5bfb68dc2b8/src/archive/tar/format.go#L143

**Comment From: hanwen**

@Jorropo - Fascinating insight, thanks! I can confirm that on btrfs, if I set the blockSize to 4096, I can write a 2G tar file in 0.08s which is amazing. 

Unfortunately, it appears that the block size is not variable in the tar format, so this needs to be done in a different way. Fortunately, one could simply add as many empty files to pad out the tar file to 4096 or whatever the destination block size is. This can be done without changing the tar package at all.

**Comment From: Jorropo**

I am not suggesting we change that field, 512 is hardcoded part of the tar format.
If we want to do this we might be able to figure out a way to inject no-op fields parsers will ignore in order to bump in the correct 512 bucket to be 4KiB (or whatever) alligned.

**Comment From: hanwen-flow**

> If we want to do this 

having the bytes generated depend on a non-obvious property of the destination sounds confusing to me; I think we wouldn't want to do this. Regardless, it should go into its own proposal.  Let's not side-track this discussion.

**Comment From: hanwen**

A change with some numbers at https://go-review.googlesource.com/c/go/+/642736. For tmpfs and ext4, it yields a 10-20% speed improvement. 

I have trouble measuring consistent results with BTRFS on Fedora 41. A standalone program can seemingly copy around large files no time at all (speeds of ~1000 Gb/s), but these aren't reflected in the benchmarks. Despite this, `strace` shows copy_file_range being executed in the same way between the standalone program and the benchmark. 

**Comment From: Jorropo**

@hanwen see https://github.com/golang/go/issues/70807#issuecomment-2543339037
Tar align content to 512bytes, BTRFS defaults to alignment of 4096. Assuming random file sizes, 1/4 times tar's alignment will be equal or greater to BTRFS's leading to [reflinks](https://btrfs.readthedocs.io/en/latest/Reflink.html) being made.
We could force that by injecting no-op data in the tar file.

**Comment From: hanwen**

> Tar align content to 512bytes, BTRFS defaults to alignment of 4096. Assuming random file sizes, 1/4 times tar's alignment > > will be equal or greater to BTRFS's leading to reflinks being made.
>We could force that by injecting no-op data in the tar file.

I am injecting nop data to align to 4k. The problem is that BTRFS has a certain statefulness that  I don´t quite understand:

$ rm input output ; for i in 0 1 2 3 ; do go run copyfilerange.go -write input${i} output${i}; go run copyfilerange.go -write input${i} output${i}; go run copyfilerange.go input${i} output-2-${i}; done 2025/01/15 17:27:40 wrote input0 (1310720000 byte) in 786.06707ms: 1.552925 gb/s 2025/01/15 17:27:41 copy input0 -> output0: 1310720000 bytes in 781.925033ms: 1.561151 gb/sec 2025/01/15 17:27:43 wrote input0 (1310720000 byte) in 1.730116289s: 0.705561 gb/s 2025/01/15 17:27:43 copy input0 -> output0: 1310720000 bytes in 171.372842ms: 7.123084 gb/sec 2025/01/15 17:27:44 copy input0 -> output-2-0: 1310720000 bytes in 306.201µs: 3986.607245 gb/sec 2025/01/15 17:27:45 wrote input1 (1310720000 byte) in 680.93875ms: 1.792677 gb/s 2025/01/15 17:27:46 copy input1 -> output1: 1310720000 bytes in 773.442031ms: 1.578274 gb/sec 2025/01/15 17:27:48 wrote input1 (1310720000 byte) in 1.698480634s: 0.718703 gb/s 2025/01/15 17:27:48 copy input1 -> output1: 1310720000 bytes in 184.603247ms: 6.612577 gb/sec 2025/01/15 17:27:48 copy input1 -> output-2-1: 1310720000 bytes in 117.583µs: 10381.629360 gb/sec 2025/01/15 17:27:49 wrote input2 (1310720000 byte) in 671.39695ms: 1.818154 gb/s 2025/01/15 17:27:50 copy input2 -> output2: 1310720000 bytes in 1.036467151s: 1.177754 gb/sec 2025/01/15 17:27:52 wrote input2 (1310720000 byte) in 1.620079021s: 0.753484 gb/s 2025/01/15 17:27:53 copy input2 -> output2: 1310720000 bytes in 1.339927295s: 0.911022 gb/sec 2025/01/15 17:27:54 copy input2 -> output-2-2: 1310720000 bytes in 140.894µs: 8663.982320 gb/sec 2025/01/15 17:27:55 wrote input3 (1310720000 byte) in 672.611772ms: 1.814870 gb/s 2025/01/15 17:27:56 copy input3 -> output3: 1310720000 bytes in 952.568565ms: 1.281486 gb/sec 2025/01/15 17:27:58 wrote input3 (1310720000 byte) in 1.615837186s: 0.755462 gb/s 2025/01/15 17:27:59 copy input3 -> output3: 1310720000 bytes in 1.354446291s: 0.901256 gb/sec 2025/01/15 17:28:00 copy input3 -> output-2-3: 1310720000 bytes in 405.642µs: 3009.311474 gb/sec

$ cat copyfilerange.go package main

import ( "bytes" "flag" "io" "log" "os" "time" )

func gbyteps(n int, d time.Duration) float64 { return float64(n) / float64(1<<30) / d.Seconds() }

func writeLargeFile(fn string, N int) error { start := time.Now() src, err := os.Create(fn) if err != nil { return err } blockSize := 4096 writeSize := 16 * blockSize block := bytes.Repeat([]byte{42}, int(writeSize))

size := 0
for i := 0; i < N; i++ {
    if n, err := src.Write(block); err != nil {
        return err
    } else {
        size += n
    }

}
if err := src.Close(); err != nil {
    return err
}
dt := time.Now().Sub(start)
log.Printf("wrote %s (%d byte) in %v: %f gb/s", fn, size, dt, gbyteps(size, dt))
return nil

}

func main() { wr := flag.Bool("write", false, "") sleep := flag.Duration("sleep", 0, "") flag.Parse()

in := flag.Arg(0)
if *wr {
    if err := writeLargeFile(in, 20000); err != nil {
        log.Fatal(err)
    }
    if *sleep > 0 {
        log.Printf("sleeping %v", *sleep)
        time.Sleep(*sleep)
    }
}

out := flag.Arg(1)

fin, err := os.Open(in)
if err != nil {
    log.Fatal(err)
}
fout, err := os.Create(out)
if err != nil {
    log.Fatal(err)
}
if _, err := fout.Write(bytes.Repeat([]byte{42}, 4096)); err != nil {
    log.Fatal(err)
}

start := time.Now()
n, err := io.Copy(fout, fin)
if err != nil {
    log.Fatal(err)
}

fin.Close()
if err := fout.Close(); err != nil {
    log.Fatal(err)
}
dt := time.Now().Sub(start)
log.Printf("copy %s -> %s: %d bytes in %s: %f gb/sec", in, out, n, dt,
    float64(n)/float64(1<<30)/dt.Seconds())

}


**Comment From: Jorropo**

This is quite off the rail:
> Regardless, it should go into its own proposal. Let's not side-track this discussion.

If you want join the discord gophers channel https://discord.gg/golang and I'll explain why it does this.

**Comment From: hanwen**

it looks like btrfs needs to have fsync called on the file, for it to be eligible for metadata only copies. Adding `src.Sync()` to the writeLargeFile function yields

2025/01/15 17:39:27 wrote input0 (1310720000 byte) in 1.64345491s: 0.742766 gb/s 2025/01/15 17:39:27 copy input0 -> output0: 1310720000 bytes in 483.089µs: 2526.870049 gb/sec 2025/01/15 17:39:29 wrote input0 (1310720000 byte) in 1.873767098s: 0.651470 gb/s 2025/01/15 17:39:29 copy input0 -> output0: 1310720000 bytes in 310.081µs: 3936.723388 gb/sec 2025/01/15 17:39:30 copy input0 -> output-2-0: 1310720000 bytes in 137.397µs: 8884.496204 gb/sec ```