Go version

go version go1.21.1 linux/amd64

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='amd64'
GOBIN=''
GOCACHE='/xxx/.cache/go-build'
GOENV='/xxx/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOMODCACHE='/xxx/go/pkg/mod'
GOOS='linux'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVERSION='go1.21.1'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build789774400=/tmp/go-build -gno-record-gcc-switches'

What did you do?

When iterating over files within a zip archive using the Go standard library's zip package, there is an inconsistency in filename encoding. Specifically, when a file is located at the root level of the zip archive, the filename is retrieved with invalid encoding, displaying characters such as question marks instead of the original characters. However, the filename is correctly encoded, when the same file is within a folder structure in the zip archive.

What did you see happen?

Steps to Reproduce: Create a zip archive containing files with filenames that include non-ASCII characters, such as "·". Iterate over the files in the zip archive using the zip package in Go. Observe the filenames retrieved when files are located at the root level versus within a folder structure.

Actual Behavior: Filenames retrieved from files at the root level of the zip archive exhibit incorrect encoding, displaying invalid characters such as question marks. Filenames within folders in the zip archive are correctly encoded.

What did you expect to see?

Filenames retrieved during iteration should maintain consistent encoding regardless of their location within the zip archive. The original characters in the filenames, including non-ASCII characters, should be preserved.``

Comment From: seankhliao

can you provide an example and code for a reproducer?

Comment From: ZeinabAshjaei

@seankhliao 1. Create a zip archive containing a file named file·name.xml at the root level. 2. Iterate over the files in the zip archive using the zip package. 3. Observe the retrieved filename for file·name.xml. 4. Create another zip archive, but this time place the file file·name.xml inside a folder, e.g., test/file·name.xml. 5. Iterate over the files in the new zip archive using the zip package in Go. 6. Observe the retrieved filename for test/file·name.xml.

func readZipFile(file *os.File,) {
    zipFile, _ := zip.OpenReader(file.Name())   // reading zip file content

    for _, fileEntry := range zipFile.File {  // Iterating over zip file entries
        fmt.Println(fileEntry.Name)  
    }
}

Comment From: gabyhelp

Similar Issues

  • https://github.com/golang/go/issues/44187
  • https://github.com/golang/go/issues/36760
  • https://github.com/golang/go/issues/30627
  • https://github.com/golang/go/issues/32617
  • https://github.com/golang/go/issues/41402

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: rsc

@ZeinabAshjaei Here is a Go program that creates Unicode files in the root and subdirectories and it seems to work fine: https://go.dev/play/p/T6tNxT1HH8M?v=gotip.

What program are you using to create the zip file? My guess is that program is writing bad zip file entries, or at least entries that are incompatible with Go's zip package. If you can attach a small example of a zip file that Go does not handle correctly, that would be helpful. Thanks.

Comment From: ZeinabAshjaei

@rsc Thanks for the investigation, I agree, It seems only the zip file I tested is not producing the correct file names. The attached zip file includes 3 png files, generated by AI.

GHTest.zip

Comment From: ianlancetaylor

Thanks. In the zip file you provided I see the same results using Go's archive/zip package and using unzip -l running on Linux system. In both cases I see DALL�E, where the non-UTF-8 character is \372. Do you see different results that suggest an inconsistency in archive/zip rather than in whatever is generating the zip file?

Comment From: rsc

This is working as intended. The archive/zip reader never attempts to translate the names found in the zip file to valid UTF-8. It simply presents the bytes in the zip file, which in the test file are "DALL\x{fa}E" as Ian said.

% hexdump -C GHTest.zip |grep DAL
0012eed0  00 00 44 00 09 00 44 41  4c 4c fa 45 20 32 30 32  |..D...DALL.E 202|
0042f6b0  44 41 4c 4c fa 45 20 32  30 32 33 2d 30 37 2d 31  |DALL.E 2023-07-1|
0042f810  00 00 00 00 00 d0 f1 29  00 44 41 4c 4c fa 45 20  |.......).DALL.E |
% 

The zip reader does set f.NonUTF8 for these names as a signal to client code that they might need to be careful.