Golang cmd/go: revisit allowed set of characters in module, import, and file paths

Currently, import paths have the following lexical restrictions (see module.CheckImportPath):

Must consist of valid path elements, separated by slashes. Must not begin or end with a slash.
A valid path element is a non-empty string that consists of ASCII letters, ASCII digits, and the punctuation characters - . _ ~. Must not end with a dot or contain two dots in a row.
A path element prefix up to the first dot must not be a reserved name on Windows, regardless of case (CON, com1, ...). An element must not have a suffix of a tilde followed by ASCII digits (like a Windows short name).

Module paths have the same restrictions as import paths, with additional constraints (see module.CheckPath:

The first path element (by convention, a domain name) must const only lower-case ASCII letters, ASCII digits, dots, and dashes. It must contain at least one dot and must not start with a dash.
If the path ends with /vN where N consists of ASCII digits and dots, N must not begin with 0, must not be 1, and must not contain any dots (there's a separate special case for gopkg.in/... module paths).
No path element may begin with a dot.

File paths have the same restrictions as import paths, but the set of allowed characters is larger (see module.CheckFilePath):

Path elements may consist of Unicode letters, ASCII digits, ASCII spaces, and ASCII punctuation characters ! # $ % & ( ) + , - . = @ [ ] ^ _ { } ~. The remaining ASCII punctuation characters " * < > ? ` ' | / \ : are excluded.

These restrictions are generally in place for good reasons (see Unicode restrictions):

Module paths are frequently written and encoded into URLs, and we don't want to allow strings that interfere with that (for example, non-ASCII domain names).
Module contents are extracted into directories on a variety of systems. We don't want to allow strings that aren't valid file names or might collide with a different string (on case-insensitive or Unicode normalizing systems). We don't want to allow strings that are reserved, might be interpreted by the shell, might be interpreted as a flag (starting with -), or might be interpreted as a repository (.git).

That being said, these restrictions more English-centric than necessary (#45507). They're also more restrictive than GOPATH (#29101).

We should come up with a wider set of characters that may be allowed without causing compatibility problems, particularly for import and file paths.

cc @bcmills @matloob

Comment From: duolabmeng6

Please support Chinese characters

Comment From: ddbxyrj

For culture diversity, maybe we should take more uncode tyep into consideration.

Comment From: FiloSottile

Related: the handling of punycode domains. #20210

Comment From: FiloSottile

Also related, the conclusion that it's up to review tooling to keep homoglyph or LTR/RTL attacks at bay. https://research.swtch.com/trojan

Comment From: sxin0

Please support Chinese characters

Comment From: FiloSottile

Also related, #44970 discusses spec interactions.

Comment From: CodeNightOwl

Please support Chinese characters

go1.15.15 (This version is normal, and errors are reported in subsequent versions)

Comment From: cx-shahar-septon

Proposal: skip checking resource file names For example. the package of "github.com/google/wuffs" contains a filename named 😻.txt . The file is not part of the module, but a resource used for tests. It's path is within Unicode standards. I would like to think the rules can be more flexible here ;)

Comment From: yangyile1990

when I use go 1.15 without go.mod, my go package can name as "ACM题目小马过河"。

while after I use go.mod in go1.20 or go1.21，it says. not support.

I think the "ACM题目小马过河" is easy to be understood for me. easy more than "ACM topic Pony Crossing the River".

So I think it's important to support native languages。

If you think it can make some mistakes. you can use a flag such as "support_native_language", when I open it, my package can not be popular but only for fun.

Comment From: CodeNightOwl

我目前直接用1.15版本，有解决办法再交流。

一直永远 @.***

------------------ 原始邮件 ------------------ 发件人: "golang/go" @.>; 发送时间: 2023年9月8日(星期五) 晚上9:05 @.>; @.**@.**>; 主题: Re: [golang/go] cmd/go: revisit allowed set of characters in module, import, and file paths (#45549)

when I use go 1.15 without go.mod, my go package can name as "ACM题目小马过河"。

while after I use go.mod in go1.20 or go1.21，it says. not support.

I think the "ACM题目小马过河" is easy to be understood for me. easy more than "ACM topic Pony Crossing the River".

So I think it's important to support native languages。

If you think it can make some mistakes. you can use a flag such as "support_native_language", when I open it, my package can not be popular but only for fun.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Comment From: SgtCoDFish

Since #66243 was closed as a dupe of this issue, it's worth pointing out here that this issue seems to break the Go Sum DB. As an example, https://sum.golang.org/lookup/github.com/!doppler!h!q/cli@v0.5.9 currently has the following output:

not found: create zip: docker/node:alpine: malformed file path "docker/node:alpine": invalid char ':' docker/python:alpine: malformed file path "docker/python:alpine": invalid char ':' docker/ruby:alpine: malformed file path "docker/ruby:alpine": invalid char ':'

This seems to be because there are files in the repo which have colons in.

(It seems like maybe a separate bug that the Go sum DB prints errors like that as output)

Comment From: matloob

Closing this issue since #67562 has been opened as a proposal to do something similar.