Current proposal status: https://github.com/golang/go/issues/67434#issuecomment-2565780150

This is a proposal for a new package to aid in testing concurrent code.

// Package synctest provides support for testing concurrent code.
package synctest

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form a group.
// Run waits for all goroutines in the group to exit before returning.
//
// Goroutines in the group use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
// Time advances when every goroutine is idle.
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
func Run(f func())

// Wait blocks until every goroutine within the current group is idle.
//
// A goroutine is idle if it is blocked on a channel operation,
// mutex operation,
// time.Sleep,
// a select with no cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// The caller of Wait must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Wait panics.
func Wait()

This package has two main features:

It permits using a fake clock to test code which uses timers. The test can control the passage of time as observed by the code under test.
It permits a test to wait until an asynchronous operation has completed.

As an example, let us say we are testing an expiring concurrent cache:

type Cache[K comparable, V any] struct{}

// NewCache creates a new cache with the given expiry.
// f is called to create new items as necessary.
func NewCache[K comparable, V any](expiry time.Duration, f func(K) V) *Cache {}

// Get returns the cache entry for K, creating it if necessary.
func (c *Cache[K,V]) Get(key K) V {}

A naive test for this cache might look something like this:

func TestCacheEntryExpires(t *testing.T) {
    count := 0
    c := NewCache(2 * time.Second, func(key string) int {
        count++
        return fmt.Sprintf("%v:%v", key, count)
    })

    // Get an entry from the cache.
    if got, want := c.Get("k"), "k:1"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }

    // Verify that we get the same entry when accessing it before the expiry.
    time.Sleep(1 * time.Second)
    if got, want := c.Get("k"), "k:1"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }

    // Wait for the entry to expire and verify that we now get a new one.
    time.Sleep(3 * time.Second)
    if got, want := c.Get("k"), "k:2"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }
}

This test has a couple problems. It's slow, taking four seconds to execute. And it's flaky, because it assumes the cache entry will not have expired one second before its deadline and will have expired one second after. While computers are fast, it is not uncommon for an overloaded CI system to pause execution of a program for longer than a second.

We can make the test less flaky by making it slower, or we can make the test faster at the expense of making it flakier, but we can't make it fast and reliable using this approach.

We can design our Cache type to be more testable. We can inject a fake clock to give us control over time in tests. When advancing the fake clock, we will need some mechanism to ensure that any timers that fire have executed before progressing the test. These changes come at the expense of additional code complexity: We can no longer use time.Timer, but must use a testable wrapper. Background goroutines need additional synchronization points.

The synctest package simplifies all of this. Using synctest, we can write:

func TestCacheEntryExpires(t *testing.T) {
        synctest.Run(func() {
                count := 0
                        c := NewCache(2 * time.Second, func(key string) int {
                        count++
                        return fmt.Sprintf("%v:%v", key, count)
                })

                // Get an entry from the cache.
                if got, want := c.Get("k"), "k:1"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }

                // Verify that we get the same entry when accessing it before the expiry.
                time.Sleep(1 * time.Second)
                synctest.Wait()
                if got, want := c.Get("k"), "k:1"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }

                // Wait for the entry to expire and verify that we now get a new one.
                time.Sleep(3 * time.Second)
                synctest.Wait()
                if got, want := c.Get("k"), "k:2"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }
        })
}

This is identical to the naive test above, wrapped in synctest.Run and with the addition of two calls to synctest.Wait. However:

This test is not slow. The time.Sleep calls use a fake clock, and execute immediately.
This test is not flaky. The synctest.Wait ensures that all background goroutines have idled or exited before the test proceeds.
This test requires no additional instrumentation of the code under test. It can use standard time package timers, and it does not need to provide any mechanism for tests to synchronize with it.

A limitation of the synctest.Wait function is that it does not recognize goroutines blocked on network or other I/O operations as idle. While the scheduler can identify a goroutine blocked on I/O, it cannot distinguish between a goroutine that is genuinely blocked and one which is about to receive data from a kernel network buffer. For example, if a test creates a loopback TCP connection, starts a goroutine reading from one side of the connection, and then writes to the other, the read goroutine may remain in I/O wait for a brief time before the kernel indicates that the connection has become readable. If synctest.Wait considered a goroutine in I/O wait to be idle, this would cause nondeterminism in cases such as this,

Tests which use synctest with network connections or other external data sources should use a fake implementation with deterministic behavior. For net.Conn, net.Pipe can create a suitable in-memory connection.

This proposal is based in part on experience with tests in the golang.org/x/net/http2 package. Tests of an HTTP client or server often involve multiple interacting goroutines and timers. For example, a client request may involve goroutines writing to the server, reading from the server, and reading from the request body; as well as timers covering various stages of the request process. The combination of fake clocks and an operation which waits for all goroutines in the test to stabilize has proven effective.

@gabyhelp's overview of this issue: https://github.com/golang/go/issues/67434#issuecomment-2593973640

Comment From: aclements

I really like how simple this API is.

Time advances when every goroutine is idle.

How does time work when goroutines aren't idle? Does it stand still, or does it advance at the usual rate? If it stands still, it seems like that could break software that assumes time will advance during computation (that maybe that's rare in practice). If it advances at the usual rate, it seems like that reintroduces a source of flakiness. E.g., in your example, the 1 second sleep will advance time by 1 second, but then on a slow system the checking thread may still not execute for a long time.

What are the bounds of the fake time implementation? Presumably if you're making direct system calls that interact with times or durations, we're not going to do anything about that. Are we going to make any attempt at faking time in the file system?

If every goroutine is idle and there are no timers scheduled, Run panics.

What if a goroutine is blocked on a channel that goes outside the group? This came to mind in the context of whether this could be used to coordinate a multi-process client/server test, though I think it would also come up if there's any sort of interaction with a background worker goroutine or pool.

or is the goroutine calling Wait.

What happens if multiple goroutines in a group call Wait? I think the options are to panic or to consider all of them idle, in which case they would all wake up when every other goroutine in the group is idle.

What happens if you have nested groups, say group A contains group B, and a goroutine in B is blocked in Wait, and then a goroutine in A calls Wait? I think your options are to panic (though that feels wrong), wake up both if all of the goroutines in group A are idle, or wake up just B if all of the goroutines in B are idle (but this block waking up A until nothing is calling Wait in group B).

Comment From: neild

How does time work when goroutines aren't idle?

Time stands still, except when all goroutines in a group are idle. (Same as the playground behaves, I believe.) This would break software that assumes time will advance. You'd need to use something else to test that case.

What are the bounds of the fake time implementation?

The time package: Now, Since, Sleep, Timer, Ticker, etc.

Faking time in the filesystem seems complicated and highly specialized, so I don't think we should try. Code which cares about file timestamps will need to use a test fs.FS or some such.

What if a goroutine is blocked on a channel that goes outside the group?

As proposed, this would count as an idle goroutine. If you fail to isolate the system under test this will probably cause problems, so don't do that.

What happens if multiple goroutines in a group call Wait?

As proposed, none of them ever wake up and your test times out, or possibly panics if we can detect that all goroutines are blocked in that case. Having them all wake at the same time would also be reasonable.

What happens if you have nested groups

Oh, I didn't think of that. Nested groups are too complicated, Run should panic if called from within a group.

Comment From: apparentlymart

This is a very interesting proposal!

I feel worried that the synctest.Run characteristic of establishing a "goroutine group" and blocking until it completes might make it an attractive nuisance for folks who see it as simpler than arranging for the orderly completion of many goroutines using other synchronization primitives. That is: people may be tempted to use it in non-test code.

Assuming that's a valid concern (if it isn't then I'll retract this entire comment!), I could imagine mitigating it in two different ways: - Offer "goroutine groups" as a standalone synchronization primitive that synctest.Run is implemented in terms of, offering the "wait for completion of this and any other related goroutines" mechanism as a feature separate from synthetic time. Those who want to use it in non-test code can therefore use the lower-level function directly, instead of using synctest.Run. - Change the synctest.Run design in some way that makes it harder to misuse. One possible idea: make synctest.Run take a testing.TB as an additional argument, and then in every case where the proposal currently calls for a panic use t.FailNow() instead. It's inconvenient (though of course not impossible) to obtain a testing.TB implementation outside of a test case or benchmark, which could be sufficient inconvenience for someone to reconsider what they were attempting.

(I apologize in advance if I misunderstood any part of the proposal or if I am missing something existing that's already similarly convenient to synctest.Run.)

Comment From: neild

The fact that synctest goroutine groups always use a fake clock will hopefully act as discouragement to using them in non-test code. Defining goroutines blocked on I/O as not being idle also discourages use outside of tests; any goroutine reading from a network connection defeats synctest.Wait entirely.

I think using idle-wait synchronization outside of tests is always going to be a mistake. It's fragile and fiddly, and you're better served by explicit synchronization. (This prompts the question: Isn't this fragile and fiddly inside tests as well? It is, but using a fake clock removes much of the sources of fragility, and tests often have requirements that make the fiddliness a more worthwhile tradeoff. In the expiring cache example, for example, non-test code will never need to guarantee that a cache entry expires precisely at the nanosecond defined.)

So while perhaps we could offer a standalone synchroniziation primitive outside of synctest, I think we would need a very good understanding of when it would be appropriate to use it.

As for passing a testing.TB to synctest.Run, I don't think this would do much to prevent misuse, since the caller could just pass a &testing.T{}, or just nil. I don't think it would be wise to use synctest outside of tests, but if someone disagrees, then I don't think it's worth trying to stop them.

Comment From: gh2o

Interesting proposal. I like that it allows for waiting for a group of goroutines, as opposed to all goroutines in my proposal (#65336), though I do have some concerns:

Complexity of implementation: Having to modify every time-related function may increase complexity for non-test code. Would it make more sense to outsource the mock time implementation to a third party library? The Wait() function should be sufficient for the third party library to function deterministically, and goroutines started by Run() would behave like normal goroutines in all aspects.
Timeouts: In my proposal, WaitIdle() returns a <-chan struct{} since it allows for a test harness to abort the test if it takes too long (e.g. 30 seconds in case the test gets stuck in an infinite loop). Would it make sense for the Wait() function here to return a chan too to allow for timeouts?

Comment From: neild

One of the goals of this proposal is to minimize the amount of unnatural code required to make a system testable. Mock time implementations require replacing calls to idiomatic time package functions with a testable interface. Putting fake time in the standard library would let us just write the idiomatic code without compromising testability.

For timeouts, the -timeout test flag allows aborting too-slow tests. Putting an explicit timeout in test code is usually a bad idea, because how long a test is expected to run is a property of the local system. (I've seen a lot of tests inside Google which set an explicit timeout of 5 or 10 seconds, and then experience flakiness when run with -tsan and on CI systems that execute at a low batch priority.)

Also, it would be pointless for Wait to return a <-chan struct{}, because Wait must be called from within a synctest group and therefore the caller doesn't have access to a real clock.

Comment From: neild

I wanted to evaluate practical usage of the proposed API.

I wrote a version of Run and Wait based on parsing the output of runtime.Stack. Wait calls runtime.Gosched in a loop until all goroutines in the current group are idle.

I also wrote a fake time implementation.

Combined, these form a reasonable facsimile of the proposed synctest package, with some limitations: The code under test needs to be instrumented to call the fake time functions, and to call a marking function after creating new goroutines. Also, you need to call a synctest.Sleep function in tests to advance the fake clock.

I then added this instrumentation to net/http.

The synctest package does not work with real network connections, so I added an in-memory net.Conn implementation to the net/http tests.

I also added an additional helper to net/http's tests, which simplifies some of the experimentation below:

var errStillRunning = errors.New("async op still running")

// asyncResult is the result of an asynchronous operation.
type asyncResult[T any] struct {}

// runAsync runs f in a new goroutine,
// and returns an asyncResult which is populated with the result of f when it finishes.
// runAsync calls synctest.Wait after running f.
func runAsync[T any](f func() (T, error)) *asyncResult[T]

// done reports whether the asynchronous operation has finished.
func (r *asyncResult[T]) done() bool

// result returns the result of the asynchronous operation.
// It returns errStillRunning if the operation is still running.
func (r *asyncResult[T]) result() (T, error)

One of the longest-running tests in the net/http package is TestServerShutdownStateNew (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/serve_test.go#5611). This test creates a server, opens a connection to it, and calls Server.Shutdown. It asserts that the server, which is expected to wait 5 seconds for the idle connection to close, shuts down in no less than 2.5 seconds and no more than 7.5 seconds. This test generally takes about 5-6 seconds to run in both HTTP/1 and HTTP/2 modes.

The portion of this test which performs the shutdown is:

shutdownRes := make(chan error, 1)
go func() {
    shutdownRes <- ts.Config.Shutdown(context.Background())
}()
readRes := make(chan error, 1)
go func() {
    _, err := c.Read([]byte{0})
    readRes <- err
}()

// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second

t0 := time.Now()
select {
case got := <-shutdownRes:
    d := time.Since(t0)
    if got != nil {
        t.Fatalf("shutdown error after %v: %v", d, err)
    }
    if d < expectTimeout/2 {
        t.Errorf("shutdown too soon after %v", d)
    }
case <-time.After(expectTimeout * 3 / 2):
    t.Fatalf("timeout waiting for shutdown")
}

// Wait for c.Read to unblock; should be already done at this point,
// or within a few milliseconds.
if err := <-readRes; err == nil {
    t.Error("expected error from Read")
}

I wrapped the test in a synctest.Run call and changed it to use the in-memory connection. I then rewrote this section of the test:

shutdownRes := runAsync(func() (struct{}, error) {
    return struct{}{}, ts.Config.Shutdown(context.Background())
})
readRes := runAsync(func() (int, error) {
    return c.Read([]byte{0})
})

// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second

synctest.Sleep(expectTimeout - 1)
if shutdownRes.done() {
    t.Fatal("shutdown too soon")
}

synctest.Sleep(2 * time.Second)
if _, err := shutdownRes.result(); err != nil {
    t.Fatalf("Shutdown() = %v, want complete", err)
}
if n, err := readRes.result(); err == nil || err == errStillRunning {
    t.Fatalf("Read() = %v, %v; want error", n, err)
}

The test exercises the same behavior it did before, but it now runs instantaneously. (0.01 seconds on my laptop.)

I made an interesting discovery after converting the test: The server does not actually shut down in 5 seconds. In the initial version of this test, I checked for shutdown exactly 5 seconds after calling Shutdown. The test failed, reporting that the Shutdown call had not completed.

Examining the Shutdown function revealed that the server polls for closed connections during shutdown, with a maximum poll interval of 500ms, and therefore shutdown can be delayed slightly past the point where connections have shut down.

I changed the test to check for shutdown after 6 seconds. But once again, the test failed.

Further investigation revealed this code (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/server.go#3041):

st, unixSec := c.getState()
// Issue 22682: treat StateNew connections as if
// they're idle if we haven't read the first request's
// header in over 5 seconds.
if st == StateNew && unixSec < time.Now().Unix()-5 {
    st = StateIdle
}

The comment states that new connections are considered idle for 5 seconds, but thanks to the low granularity of Unix timestamps the test can consider one idle for as little as 4 or as much as 6 seconds. Combined with the 500ms poll interval (and ignoring any added scheduler delay), Shutdown may take up to 6.5 seconds to complete, not 5.

Using a fake clock rather than a real one not only speeds up this test dramatically, but it also allows us to more precisely test the behavior of the system under test.

Another slow test is TestTransportExpect100Continue (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/transport_test.go#1188). This test sends an HTTP request containing an "Expect: 100-continue" header, which indicates that the client is waiting for the server to indicate that it wants the request body before it sends it. In one variation, the server does not send a response; after a 2 second timeout, the client gives up waiting and sends the request.

This test takes 2 seconds to execute, thanks to this timeout. In addition, the test does not validate the timing of the client sending the request body; in particular, tests pass even if the client waits

The portion of the test which sends the request is:

resp, err := c.Do(req)

I changed this to:

rt := runAsync(func() (*Response, error) {
    return c.Do(req)
})
if v.timeout {
    synctest.Sleep(expectContinueTimeout-1)
    if rt.done() {
        t.Fatalf("RoundTrip finished too soon")
    }
    synctest.Sleep(1)
}
resp, err := rt.result()
if err != nil {
    t.Fatal(err)
}

This test now executes instantaneously. It also verifies that the client does or does not wait for the ExpectContinueTimeout as expected.

I made one discovery while converting this test. The synctest.Run function blocks until all goroutines in the group have exited. (In the proposed synctest package, Run will panic if all goroutines become blocked (deadlock), but I have not implemented that feature in the test version of the package.) The test was hanging in Run, due to leaking a goroutine. I tracked this down to a missing net.Conn.Close call, which was leaving an HTTP client reading indefinitely from an idle and abandoned server connection.

In this case, Run's behavior caused me some confusion, but ultimately led to the discovery of a real (if fairly minor) bug in the test. (I'd probably have experienced less confusion, but I initially assumed this was a bug in the implementation of Run.)

At one point during this exercise, I accidentally called testing.T.Run from within a synctest.Run group. This results in, at the very best, quite confusing behavior. I think we would want to make it possible to detect when running within a group, and have testing.T.Run panic in this case.

My experimental implementation of the synctest package includes a synctest.Sleep function by necessity: It was much easier to implement with an explicit call to advance the fake clock. However, I found in writing these tests that I often want to sleep and then wait for any timers to finish executing before continuing.

I think, therefore, that we should have one additional convenience function:

package synctest

// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
    time.Sleep(d)
    Wait()
}

The net/http package was not designed to support testing with a fake clock. This has served as an obstacle to improving the state of the package's tests, many of which are slow, flaky, or both.

Converting net/http to be testable with my experimental version of synctest required a small number of minor changes. A runtime-supported synctest would have required no changes at all to net/http itself.

Converting net/http tests to use synctest required adding an in-memory net.Conn. (I didn't attempt to use net.Pipe, because its fully-synchronous behavior tends to cause problems in tests.) Aside from this, the changes required were very small.

My experiment is in https://go.dev/cl/587657.

Comment From: rsc

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

Comment From: gh2o

Commenting here due to @rsc's request:

Relative to my proposal #65336, I have the following concerns:

Goroutine grouping: the only precedent for goroutine having a user-visible identity is runtime.LockOSThread(), and even then, it is set-only: a goroutine can not know whether it is locked to a thread or not without parsing runtime.Stack() output. Having these special "test mode" goroutines feels like a violation of goroutines being interchangeable anonymous workers, insofar as the Go runtime hides the goroutine ID from user code. Having a global wait is acceptable in the case of tests since it is unlikely for background goroutines to be present to interfere with the wait (and possibly actually desirable to catch those too).
Overriding standard library behavior: again, there is no precedent for standard library functions to behave differently based on what goroutine they are called from. The standard idiomatic way to do this is to define an interface (e.g. fs.FS) and direct all calls through the interface, and the implementation of the interface can be mocked at test time. If it is desirable to keep the current Run()/Wait() API, I would still strongly advocate for not changing the behavior of the standard time package, and instead incorporate a mock clock implementation in another package (likely under testing).

Comment From: neild

Regarding overriding the time package vs. providing a testing implementation:

The time package provides a number of commonly-used, exported functions, where code that makes direct use of these functions cannot be properly tested. I think this makes it unique in the standard library. For example, code which directly calls time.Sleep cannot be tested properly, because inserting a real delay inherently makes a test slow, and because there is no non-flaky way for a test to ensure that a certain amount of time has elapsed.

In contrast, we can test code which calls os.Open by providing it with the name of a file in a test directory. We can test code which calls net.Listen by listening on a loopback interface. The io/fs.FS interface may be used to create a testable seam in a system, but it isn't required.

Time is fundamentally different in that there is no way to use real time in a test without making the test flaky and slow.

Time is also different from an fs.File or a net.Conn in that there is only one plausible production implementation of time. A fs.FS might be the local filesystem, or an embedded set of static files, or a remote filesystem of some kind. A net.Conn might be a TCP or TLS connection. But it is difficult to come up with occasions outside of tests when time.Sleep should do anything other than sleep for the defined amount of time.

Since we can't use real time in tests, we can insert a testable wrapper around the time package as you propose. This requires that we avoid the idiomatic and easy-to-use time package functions. We essentially put an asterisk next to every existing function in the time package that deals with the system clock saying, "don't actually use this, or at least not in code you intend to test".

In addition, if we define a standard testable wrapper around the clock, we are essentially declaring that all public packages which deal with time should provide a way to plumb in a clock. (Some packages do this already, of course; crypto/tls.Config.Time is an example in std).

That's an option, of course. But it would be a very large change to the Go ecosystem as a whole.

Comment From: DmitriyMV

the only precedent for goroutine having a user-visible identity is runtime.LockOSThread()

The pprof.SetGoroutineLabels disagrees.

insofar as the Go runtime hides the goroutine ID from user code

It doesn't try to hide it, more like tries to restrict people from relying on numbers.

Having a global wait is acceptable in the case of tests since it is unlikely for background goroutines to be present to interfere with the wait (and possibly actually desirable to catch those too).

If I understood proposal correctly, it will wait for any goroutine (and recursively) that was started using go statement from the func passed to Run. It will not catch anything started before or sidewise. Which brings the good question: @neild will it also wait for time.AfterFunc(...) goroutines if time.AfterFunc(...) was called in the chain leading to synctest.Run?

Comment From: neild

@neild will it also wait for time.AfterFunc(...) goroutines if time.AfterFunc(...) was called in the chain leading to synctest.Run?

Yes, if you call AfterFunc from within a synctest group then the goroutine started by AfterFunc is also in the group.

Comment From: gh2o

Given that there's more precedent for goroutine identity than I had previously thought, and seeing how pprof.Do() works, I am onboard with the idea of goroutine groups.

However, I'm still a little ambivalent about goroutine groups affecting time package / standard library behavior, and theoretically a test running in synctest mode may want to know the real world time for logging purposes (I guess that could be solved by adding a time.RealNow() or something similar). The Wait() primitive seems to provide what is necessary for a third-party package to provide the same functionality without additional runtime support, so it could be worth exploring this option a bit more.

That being said, I agree that plumbing a time/clock interface through existing code is indeed tedious, and having time modified to conditionally use a mock timer may be the lesser evil. But it still feels a little icky to me for some reason.

Comment From: aclements

Thanks for doing the experiment. I find the results pretty compelling.

I think, therefore, that we should have one additional convenience function: [synctest.Sleep]

I don't quite understand this function. Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?

Nested groups are too complicated, Run should panic if called from within a group.

Partly I was wondering about nested groups because I've been scheming other things that the concept of a goroutine group could be used for. Though it's true that, even if we have groups for other purposes, it may make sense to say that synctest groups cannot be nested, even if in general groups can be nested.

Comment From: neild

Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?

You're right that sleeping past the deadline of a timer is sufficient. The synctest.Wait function isn't strictly necessary at all; you could use time.Sleep(1) to skip ahead a nanosecond and ensure all currently running goroutines have parked.

It's fairly natural to sleep to the exact instant of a timer, however. If a cache entry expires in some amount of time, it's easy to sleep for that exact amount of time, possibly using the same constant that the cache timeout was initialized with, rather than adding a nanosecond.

Adding nanoseconds also adds a small but real amount of confusion to a test in various small ways: The time of logged events drifts off the integer second, rate calculations don't come out as cleanly, and so on.

Plus, if you forget to add the necessary adjustment or otherwise accidentally sleep directly onto the instant of a timer's expiry, you get a race condition.

Cleaner, I think, for the test code to always resynchronize after poking the system under test. This doesn't have to be a function in the synctest package, of course; synctest.Sleep is a trivial two-liner using exported APIs. But I suspect most users of the package would use it, or at least the ones that make use of the synthetic clock.

I've been scheming other things that the concept of a goroutine group could be used for.

I'm very intrigued! I've just about convinced myself that there's a useful general purpose synchronization API hiding in here, but I'm not sure what it is or what it's useful for.

Comment From: rsc

For what it's worth, I think it's a good thing that virtual time is included in this, because it makes sure that this package isn't used in production settings. It makes it only suitable for tests (and very suitable).

Comment From: rsc

It sounds like the API is still:

// Package synctest provides support for testing concurrent code.
package synctest

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form a group.
// Run waits for all goroutines in the group to exit before returning.
//
// Goroutines in the group use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
// Time advances when every goroutine is idle.
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
func Run(f func())

// Wait blocks until every goroutine within the current group is idle.
//
// A goroutine is idle if it is blocked on a channel operation,
// mutex operation,
// time.Sleep,
// a select with no cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// The caller of Wait must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Wait panics.
func Wait()

Damien suggested adding also:

// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
    time.Sleep(d)
    Wait()
}

The difference between time.Sleep and synctest.Sleep seems subtle enough that it seems like you should have to spell out the Wait at the call sites where you need it. The only time you really need Wait is if you know someone else is waking up at that very moment. But then if they've both done the Sleep+Wait form then you still have a problem. You really only want some of the call sites (maybe just one) to use the Sleep+Wait form. I suppose that the production code will use time.Sleep since it's not importing synctest, so maybe it's clear that the test harness is the only one that will call Sleep+Wait. On the other hand, fixing a test failure by changing s/time.Sleep/synctest.Sleep/ will be a strange-looking bug fix. Better to have to add synctest.Wait instead. If we really need this, it could be synctest.SleepAndWait but that's what statements are for. Probably too subtle and should just limit the proposal to Run and Wait.

Comment From: gh2o

Some additional suggestions for the description of the Wait() function:

// A goroutine is idle if it is blocked on a channel operation, // mutex operation (...), // time.Sleep, // a select operation with or without cases, // or is the goroutine calling Wait. // // A goroutine blocked on an I/O operation, such as a read from a network connection, // is not idle. Tests which operate on a net.Conn or similar type should use an // in-memory implementation rather than a real network connection. // // A goroutine blocked on a direct syscall (via the syscall package) is also not idle, // even if the syscall itself sleeps.

Additionally, for "mutex operation", let's list out the the exact operations considered for implementation/testing completeness: - sync.Cond.Wait() - sync.Mutex.Lock() - sync.RWMutex.Lock() - sync.RWMutex.RLock() - sync.WaitGroup.Wait()

Comment From: nightlyone

The API looks simple and that is excellent.

What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect.

Imagine you unit test your code but are unable to mock out a dependency. Maybe due to lack of experience or bad design of existing code I have to work with.

That dependency that suddenly starts calling a syscall (e.g. to lazily try to tune the library using a sync.Once instead of on init time and having a timeout).

Without support in testing you will never detect that now and only your tests will suddenly time out after an innocent minor dependency update.

Comment From: nightlyone

May I ortgogonally to the previous comment suggest to limit this package to standard library only to gather more experience with that approach before ?

That would also allow to sketch out integration with the testing package in addition to finding more pitfalls.

Comment From: neild

What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect.

Can you expand more on what you mean by undetected regressions?

If the code under test (either directly, or through a dependency) unexpectedly calls a blocking syscall, Wait will wait for that syscall to complete before proceeding. If the syscall completes normally (the code is using os/exec to execute a subprocess, for example), then everything should operate as expected--the operation completes and the test proceeds. If the syscall is waiting on some event (reading from a network socket, perhaps), then the test will hang, which is a detectable event. You can look at goroutine stacks from the timed-out test to analyze the reason for the hang.

Without support in testing

What kind of support are you thinking of?

Comment From: ChrisHines

What does this do?

func TestWait(t *testing.T) {
    synctest.Run(func() {
        synctest.Wait()
    })
}

Does it succeed or panic? It's not clear to me from the API docs because:

If every goroutine is idle and there are no timers scheduled, Run panics.

A goroutine is idle if it [...] is the goroutine calling Wait.

This is obviously a degenerate case, but I think it also applies if a test wanted to get the fake time features when testing otherwise non-concurrent code.

Comment From: gh2o

What does this do?

func TestWait(t *testing.T) { synctest.Run(func() { synctest.Wait() }) }

In this case, the goroutine calling synctest.Wait() should never enter idle because there's nothing to wait for, and hence a panic should not occur.

Comment From: neild

func TestWait(t *testing.T) {
    synctest.Run(func() {
        synctest.Wait()
    })
}

This should succeed.

Perhaps the Wait documentation would be more clearly phrased as:

// Wait blocks until every goroutine within the current group,
// other than the current goroutine,
// is idle.
//
// A goroutine is idle if it is blocked on:
//     * a channel send or receive
//     * a select statement
//     * sync.Mutex.Lock
//     * sync.RWMutex.Lock or sync.RWMutex.RLock
//     * sync.Cond.Wait
//     * time.Sleep

This also resolves the question of what happens if two goroutines call Wait at the same time: Deadlock, because each is waiting for the other.

Comment From: gh2o

If multiple goroutines call Wait at the same time, I'm not sure if deadlock is the best solution since it's feasible for multiple goroutines to call it at once (e.g. multiple calls to the original synctest.Sleep() implementation).

Some possible alternative behaviors:

Wake up all Wait callers at the same time.
Wake up the first Wait caller in a FIFO manner, and keep the other Wait callers blocked.
Wake up a random Wait caller, similar to how select selects a random channel if multiple are ready.

Comment From: neild

What's the use case for supporting multiple simultaneous Wait calls? I think I'd want to see a concrete example before attempting to define semantics.

Comment From: neild

I tried my hand at implementing the runtime changes: https://go.dev/cl/591997

I have no idea what I'm doing in the runtime, so this implementation is likely mostly or all wrong, but it may suffice for testing out the proposal in practice.

Comment From: gopherbot

Change https://go.dev/cl/591997 mentions this issue: testing/synctest: new package for testing concurrent code

Comment From: rsc

The new API in https://go-review.googlesource.com/c/go/+/591997/8/api/next/67434.txt looks great. How do you feel about the implementation as far as being confident the API is right? How are the tests?

Comment From: rsc

There is an interesting question about what happens when creating a timer or timer channel outside the bubble and then using it inside the bubble, or vice versa (or move a timer from one bubble to another). To start with it seems like we should crash in those cases. If we find meaningful uses for crossing bubble boundaries we can always un-crash later.

Comment From: neild

I had just started writing a comment to that effect. Implementing the proposal made it clear that moving timers across bubble boundaries is confusing and difficult to implement correctly (if you can even define what "correctly" is). I agree that crashing is the right behavior. We should document that you aren't allowed to call methods on a time.Timer created in a different bubble.

We might also want to document that you aren't allowed to listen on a timer channel created in a different bubble, although I'm not sure if this is meaningfully different from operating on any other channel that crosses bubble boundaries.

I'm pretty confident in the API being right, based on https://go.dev/cl/591997. (I'm much less confident about the actual implementation being right; I'm not terribly familiar with the runtime guts. This has been an interesting learning experience.) I've tried using this in various net/http tests, and it's worked very well.

Test coverage in my implementation isn't great yet, since the last test I wrote exposed a race condition I haven't had a chance yet to figure out.

Comment From: neild

I think we should also include something like:

package synctest

// Running reports whether the current goroutine is or descends from a goroutine started by Run.
func Running() bool

The testing package can use this to avoid accidental calls of t.Run from within synctest.Run. (As mentioned above, this produces extremely confusing results, and we should just disallow it.) We could achieve that without an exported API, but this seems potentially useful and harmless to provide.

Comment From: kmorkos

What is the behavior of time.Now in a synctest.Run block? Would it report the current system clock time, or would it only report an advanced time after calls to time.Sleep? For example, I'm considering code like this:

func TestTimeNow(t *testing.T) {
        synctest.Run(func() {
                now1 := time.Now()

                // do stuff that does not call time.Sleep ...

                now2 := time.Now()

                time.Sleep(time.Millisecond)

                now3 := time.Now()

                // is now1 guaranteed to be equal to now2?
                // is now3 guaranteed to be now2+1ms?
        })
}

Comment From: neild

What is the behavior of time.Now in a synctest.Run block?

time.Now reports the synthetic clock. In your example, now1 is UTC 2000-01-01 00:00:00 in the local timezone, now2 is the same as now1, and now3 is now1+1ms.

In a synctest.Run block time.Now does not include a monotonic component, so now3.Sub(now1) is 1ms.

Comment From: narqo

This package has two main features: - It permits using a fake clock to test code which uses timers. The test can control the passage of time as observed by the code under test. - It permits a test to wait until an asynchronous operation has completed.

Have you explored if it could be method in testing.T, that would "freeze" time inside the test function where it was invoked?

That is, the call to t.Freeze() puts the calling test function into stasis, causing suspension of time:

func TestCacheEntryExpires(t *testing.T) {
    tf := t.Freeze()

    c := NewCache(2*time.Second, ···)

    // Get an entry from the cache.
    if got, want := c.Get("k"), "k:1"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }

    // Verify that we get the same entry when accessing it before the expiry.
    time.Sleep(1 * time.Second)
    tf.Wait()

    if got, want := c.Get("k"), "k:1"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }

    // Wait for the entry to expire and verify that we now get a new one.
    time.Sleep(3 * time.Second)
    tf.Wait()

    if got, want := c.Get("k"), "k:2"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }
}

Above, the call to t.Freeze returns a handler tf, with one method Wait. The method behaves similar to what is described in the original proposal.

I feel such API would be more cohesive with standard library, given that the test functions are already carefully controlled and orchestrated by a "runtime" in testing, and that there are already special methods in testing.T, that tweaks the behaviour around the functions (e.g. T.Skip or T.Parallel).

Comment From: neild

Putting the API in the testing package would be straightforward. At least in my draft implementation, all the real changes are in the runtime package, so the exported API can go anywhere we want.

Moving a goroutine into a synctest group (bubble, frozen environment, isolate), as with the proposed Freeze call, has more edge cases than having each goroutine consistently participate or not participate as part of a group. If we were to put this API in the testing package, I think it'd look something like (modulo naming):

func Test(t *testing.T) {
    t.RunIsolated("name", func(t *testing.T, group *testing.Isolate) {
        // do things
        group.Wait()
        // check things
    })
}

I don't know how much advantage this has over a separate package; the testing package is already large and contains many features. A separate synctest package gives us a good place to put documentation and clearly lays out the boundaries of this feature.

Comment From: narqo

Similar to what @apparentlymart mentioned above, I think the exposed API should prohibit the functionality from being used outside the testing context. So I think binding the API to testing.T could be a plus. It seems logical that everything under the testing/xyz package is for testing. But being able to "control time" of a running program is a dangerously powerful but very attractive idea.

Although, as a counterargument to that point, there might be an argument for providing this functionality for testing libraries that (can) work in parallel with typical testing functions (e.g. ginkgo or godog). I'm not sure that this should be a goal, though.

Comment From: bmizerany

During testing of this promising new feature, I observed something unexpected (which I did not find unexpected, since this patch still may contain bees). I'm posting here since this work is still in progress and a proper issue probably isn't warranted.

I ran the below test with go test -count=10, expecting to see every log of the context error to be "context canceled", but instead I saw it print "context canceled" sometimes, and other times "nil".

// go test -count=10
func TestSyncTestContextDeadline(t *testing.T) {
    synctest.Run(func() {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()

        t.Logf("before sleep: %s", time.Now())
        time.Sleep(10 * time.Second)
        t.Logf("after sleep: %s", time.Now())
        t.Logf("ctx.Err(): %v", ctx.Err())
    })
}

Go version:

go version devel go1.23-6bb09323ba Fri Jun 21 16:00:46 2024 -0700 darwin/arm64

go env

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/bmizerany/Library/Caches/go-build'
GOENV='/Users/bmizerany/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT='rangefunc'
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/bmizerany/go/pkg/mod'
GONOPROXY='github.com/ollama/*'
GONOSUMDB='github.com/ollama/*'
GOOS='darwin'
GOPATH='/Users/bmizerany/go'
GOPRIVATE='github.com/ollama/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/Users/bmizerany/sdk/gotip'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/bmizerany/sdk/gotip/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='devel go1.23-6bb09323ba Fri Jun 21 16:00:46 2024 -0700'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/bmizerany/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/bmizerany/src/pqlite/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/db/svmm3t1x3yn4d1skpbq3ddv00000gn/T/go-build2933758817=/tmp/go-build -gno-record-gcc-switches -fno-common'

Comment From: magical

@bmizerany I think (if i've understood the proposal correctly, which is not guaranteed) that you need a synctest.Wait() call between the time.Sleep and ctx.Err calls. After the context's timer expires, it launches a goroutine which is responsible for canceling the context (setting Err and other bookkeeping). Under normal circumstances, 10 seconds would be plenty of time for that to happen, but inside synctest.Run, Sleep happens instantaneously so it's very possible that the goroutine has not had a chance to run yet. That's what synctest.Wait is for.

Comment From: bmizerany

@magical Thank you.

I've updated, and encountered a deadlock. It's probably something I'm misunderstand, but sharing in case it's helpful:

package x

import (
    "context"
    "testing"
    "testing/synctest"
    "time"
)

// go test -count=10
func TestSyncTestContextDeadline(t *testing.T) {
    synctest.Run(func() {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()

        t.Logf("before sleep: %s", time.Now())
        time.Sleep(10 * time.Second)
        synctest.Wait()
        t.Logf("after sleep: %s", time.Now())
        t.Logf("ctx.Err(): %v", ctx.Err())
    })
}

; gotip test -count=1000   
--- FAIL: TestSyncTestContextDeadline (0.00s)
    x_test.go:16: before sleep: 1999-12-31 16:00:00 -0800 PST
    x_test.go:19: after sleep: 1999-12-31 16:00:10 -0800 PST
    x_test.go:20: ctx.Err(): <nil>
panic: deadlock: all goroutines in group are blocked [recovered]
    panic: deadlock: all goroutines in group are blocked

goroutine 724 [running]:
testing.tRunner.func1.2({0x10030a280, 0x100335780})
    /Users/bmizerany/sdk/gotip/src/testing/testing.go:1632 +0x1bc
testing.tRunner.func1()
    /Users/bmizerany/sdk/gotip/src/testing/testing.go:1635 +0x334
panic({0x10030a280?, 0x100335780?})
    /Users/bmizerany/sdk/gotip/src/runtime/panic.go:785 +0x124
testing/synctest.Run(0x140002f9ca0)
    /Users/bmizerany/sdk/gotip/src/runtime/synctest.go:169 +0x198
x.TestSyncTestContextDeadline(0x14000324d00)
    /private/var/folders/db/svmm3t1x3yn4d1skpbq3ddv00000gn/T/tmp.gfmtFoMUYC/x_test.go:12 +0x5c
testing.tRunner(0x14000324d00, 0x100334d00)
    /Users/bmizerany/sdk/gotip/src/testing/testing.go:1690 +0xe4
created by testing.(*T).Run in goroutine 1
    /Users/bmizerany/sdk/gotip/src/testing/testing.go:1743 +0x314
exit status 2
FAIL    x   0.102s

Comment From: neild

It's subtle, but in this case you shouldn't need a synctest.Wait because the context expires 5s after the test starts and the sleep is for 10s. Expected behavior is for the runtime to detect that all goroutines are idle, advance the synthetic clock to the time of the next event (T+5s), wait for all goroutines to become idle again, and then advance the clock again to T+10s.

I think this test is running afoul of a bug in CL 591997. I've uploaded a new version of the CL with a couple fixes that make this test pass consistently for me, although there are some other bugs in there that (optimistically) I haven't worked out yet.

Surprising-to-me edge cases I've discovered in implementation so far:

A goroutine waking from a select briefly exists in a state where it is still parked, but has committed to reading from or writing to one of the channels in the select. This state is indicated by g.selectDone being set to 1. We need to consider a goroutine in this state as not being idle.
A goroutine doing GC mark assist work can block. We need to consider a goroutine in this state as not idle.

(So far, the bugs I've encountered in implementation all seem to indicate that this feature is tricky to implement, but not problems with the user-facing API.)

Comment From: bmizerany

I too am now able to get 1000 runs without fail. Excellent! This is a very exciting feature/addition. I'm confident you'll suss out all the edge cases. :)

(So far, the bugs I've encountered in implementation all seem to indicate that this feature is tricky to implement, but not problems with the user-facing API.)

Agreed. The API seems solid.

Comment From: magical

It's subtle, but in this case you shouldn't need a synctest.Wait because the context expires 5s after the test starts and the sleep is for 10s. Expected behavior is for the runtime to detect that all goroutines are idle, advance the synthetic clock to the time of the next event (T+5s), wait for all goroutines to become idle again, and then advance the clock again to T+10s.

Ah, interesting - thanks for the clarification. (In fact i see that you and Austin already discussed this. My bad.)

If i understand correctly, it sounds like we can break Sleep's behaviour down into three cases: Upon Sleeping until time t, for all events scheduled for time u... * if u < t: the event fires and we wait for all its goroutines to become idle * if u = t: the event fires but we do not wait for its goroutines * if u > t: the event does not fire (yet)

The u=t case seems like a sharp edge, since it's the only time when the user would actually have to call Wait. As Austin pointed out, sleeping even 1ns past t seems to fill the same role as Sleep+Wait. They asked if that makes Wait unnecessary because it could be replaced by Sleep(1ns). I would turn it around and ask, why is that extra nanosecond necessary? Is there any reason why Sleep can't always Wait afterwards?

(I imagine maybe there is a good reason, like it would inject too much synchronization into the code under test, but i haven't completely thought it through)

Continuing along the same vein... is there any difference between these functions?

synctest.Run(func(){
    time.AfterFunc(5*time.Second, func() { println(5) } )
    time.AfterFunc(10*time.Second, func() { println(10) } )
    time.AfterFunc(15*time.Second, func() { println(15) } )

    time.Sleep(5*time.Second) // +5
    time.Sleep(5*time.Second) // +10
    time.Sleep(5*time.Second) // +15
    time.Sleep(5*time.Second) // +20
})

synctest.Run(func(){
    time.AfterFunc(5*time.Second, func() { println(5) } )
    time.AfterFunc(10*time.Second, func() { println(10) } )
    time.AfterFunc(15*time.Second, func() { println(15) } )

    time.Sleep(20*time.Second)
})

In the first one, the Sleep and AfterFunc times line up perfectly, so we should only encounter the u=t and u>t cases, so there shouldn't be any implicit Wait - unless Sleep unconditionally Waits when it starts? Is Sleep(0) equivalent to Wait?

Comment From: neild

I would turn it around and ask, why is that extra nanosecond necessary? Is there any reason why Sleep can't always Wait afterwards?

What happens if two Sleep calls happen at the same time? How do we decide which one waits for the other?

We could say that Sleep is lower priority than timers: First wake non-Sleep timers, and then wake Sleep calls when all other timers have finished. But that seems more subtle than an explicit Wait, and we still have the question of multiple Sleeps happening at the same time.

Another possibility might be to say that the goroutine started by Run has a lower priority than any other: It only wakes when every other goroutine is idle. But again, this is subtle.

Comment From: paskozdilar

I love this proposal so far. I have ran into a problem of long-running tests that just call time.Sleep (or other languages' equivalents) so many times, and having to refactor them to be more testable. This would alleviate (or ease) that problem in many Go tests.

Maybe this is out of scope, but I am looking at this from the perspective of Communicating Sequential Processes (CSP) - in CSP, a concurrent program is correct iff each trace (possible sequence of events) is correct. E.g. when we call time.Sleep (with close-enough values) from two goroutines A and B, the program is correct iff "A wakes up before B" is correct and "B wakes up before A" is correct too.

This would, of course, complicate the implementation, since we will not be able to use the "idle" state as a global synchronization point, but we'd have a DAG of "idle" states many of which would be independent from others.

Would it be possible to generate all possible traces of goroutines (using "idle" as the "event" in context of CSP) and test whether or not all of them result in correct behavior (i.e. pass the test)?

Comment From: QuinnCiccoretti

This may be irrelevant, but the go playground docs mention

The playground version of the runtime maintains its own internal clock. When the modified scheduler detects a deadlock it checks whether any timers are pending. If so, it advances the internal clock to the trigger time of the earliest timer and then wakes the timer goroutine. Execution continues and the program believes that time has passed, when in fact the sleep was nearly instantaneous. These changes to the scheduler can be found in proc.c and time.goc.

It looks like these changes may have since been abandoned?

Comment From: ianlancetaylor

Those changes still exist and can be enabled using go build -tags=faketime. But it builds the runtime in a different mode such that it doesn't work normally, which is not the goal of this proposal.

Comment From: neild

I think the draft CL (https://go.dev/cl/591997) is approaching something real. Still needs more tests, but I think I've squashed all the race conditions I know about.

Currently open questions/suggestions:

@rsc suggests (https://github.com/golang/go/issues/67434#issuecomment-2192252265) that moving a timer between bubbles (what I've been calling a "synctest group") should crash. I agree, and the current version of CL 591997 implements this.

I believe we can make channel operations and coroutine switches that cross bubble boundaries crash as well. Should we? Specifically, we would disallow:

A channel send, receive, or close which unparks a goroutine in a different group. Channel operations which do not unpark a goroutine would not cause a crash, so under many circumstances the crash would be nondeterministic. For example, you could put an item in a buffered channel in one bubble and consume it in another, so long as neither sender nor receiver blocks.
A coroutine switch where the two goroutines are in different bubbles.

@narqo suggests (https://github.com/golang/go/issues/67434#issuecomment-2195486907) putting the API in the testing package. Perhaps something like:

func Test(t *testing.T) {
    t.RunIsolated("name", func(t *testing.T, group *testing.Isolate) {
        // do things
        group.Wait()
        // check things
    })
}

I don't have particularly strong feelings about this, but my inclination is to stick with a separate synctest package. The testing package API is quite large, and a separate package gives us a good place to document these features in isolation.

I suggest adding one more function to synctest, to report whether the current goroutine is in a bubble or not:

package synctest

// Running reports whether the current goroutine was started by Run,
// or is a descendant of a goroutine started by Run.
func Running() bool

Comment From: adamluzsi

Hello. If you have components which contain time-bound logic, and testing them is difficult without being affected by slow running tests, or keep a consistent timeline between components that each expects a time-related dependency, then the testcase library has a few useful packages.

I'm not opposing the proposal, just wanted to share the package that I made for a similar purpose, just a bit more generic to use. (More of a timetest than a synctest, but perfectly suitable for testing sync related topics as well)

https://pkg.go.dev/go.llib.dev/testcase
https://github.com/adamluzsi/testcase
clock -> a time like a package that can partially replace the usage of time package. Its functionality behaves exactly like the related functions in the time package, except it reacts to time travelling.
go pkg examples
timecop -> allows you to manipulate time in your tests such as travelling, freezing, speeding it up, slowing it down, etc

The following cases are covered:

clock.Now()

By default, it just returns the current time
when Timecop is moving in time
then the time it just returns the current time
then time is still moving forward
and time moved to a specific time given in the non-local format
- then the time it just returned in the same Local as time.Now()
- then the time is what Travel set
when Timecop is altering the flow of time
then the speed of time is affected

clock.Sleep

By default, it just sleeps as time.Sleep()
when Timecop change the flow of time
then the time it takes to sleep is affected

clock.After

By default, it behaves as time.After()
when Timecop change the flow of time's speed
then clock.After goes faster
when time travel happens while waiting on the result of clock.After, then it will affect them.
when the wait time is zero, clock.After returns instantly

clock.NewTicker

By default, it behaves as time.NewTicker
when Timecop change the flow of time's speed
then ticks will adapt to the new flow
when time travel happens
then ticks will react to the time-travelling

example from the proposal would look like

func TestCacheEntryExpires(t *testing.T) {
    count := 0
    c := NewCache(2 * time.Second, func(key string) int {
        count++
        return fmt.Sprintf("%v:%v", key, count)
    })

    // Get an entry from the cache.
    if got, want := c.Get("k"), "k:1"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }

    // Verify that we get the same entry when accessing it before the expiry.
    timecop.Travel(t, 1 * time.Second)
    if got, want := c.Get("k"), "k:1"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }

    // back to the future
    timecop.Travel(t, 3 * time.Second)

    if got, want := c.Get("k"), "k:2"; got != want {
        t.Errorf("c.Get(k) = %q, want %q", got, want)
    }
}

Comment From: adamluzsi

Could we consider a more flattened API such as:

func TestXXX(t *testing.T) {
    synctest.Enable(t)
    // or
    stop := synctest.Start()
    defer stop()
}

This approach allows helper functions to handle test arrangements/setups. It would enable this feature's usability in popular testing frameworks that rely on helper methods, like ginkgo, godog, or goconvey.

The API would remain straightforward, and we could defer the teardown process using testing.TB.Cleanup:

func TestXXX(t *testing.T) {
    t.Cleanup(synctest.Start())
}

Comment From: neild

Could we consider a more flattened API

I don't think this offers any significant advantages over synctest.Run, and has some disadvantages.

With Run, a goroutine is always part of a synctest group or not. Within a goroutine, time.Now either returns a fake time value or it does not. The extent of Run's influence is clearly delineated, and there's no possibility of accidentally leaking that influence somewhere unintended.

For example, if we use a switch like Enable or Start, the testing package now needs to consider the possibility that the goroutine running a test returns from the test function using a fake clock.

I don't see how Enable or Start would be needed by testing frameworks either. A testing helper function that uses synctest can accept a function to run inside a synctest.Run environment.

Comment From: cherrymui

we would disallow: A channel send, receive, or close which unparks a goroutine in a different group. Channel operations which do not unpark a goroutine would not cause a crash, so under many circumstances the crash would be nondeterministic.

Is it important to distinguish whether it unparks? Or disallowing the nonblocking case would make it too restrictive?

The original question was about timers. The new restriction would apply to all type of values? (This probably makes sense, otherwise we'd have to distinguish which values could contain time, which couldn't.)

Comment From: neild

Is it important to distinguish whether it unparks? Or disallowing the nonblocking case would make it too restrictive?

I don't know how to detect any other cases.

We can detect when a channel operation unparks a goroutine in a separate bubble--when waking the blocked goroutine, we just check to see whether the goroutine performing the channel operation and the one being unparked are in the same bubble.

But unless we associate each channel with a bubble, I don't know how to detect the case where a goroutine in one bubble closes a channel or performs some other non-blocking operation, and at a later time a goroutine in a different bubble observes the result of the operation.

We could associate channels with bubbles, but that's an added field on runtime.hchan. I've been trying to avoid increasing the size of any runtime structures, aside from one additional pointer on g to record bubble membership. (g is already pretty big, so hopefully one more word isn't a meaningful cost.)

I think a reasonable compromise is to say that channel operations aren't allowed to cross bubble boundaries, to panic when we can detect that this restriction has been violated, but to not worry about cases where we can't easily detect a violation.

The original question was about timers. The new restriction would apply to all type of values?

Yes.

The original question (I think) was about operations on a time.Timer. For example, if a goroutine in one bubble creates a time.Timer (which will use the bubble's fake clock) and a goroutine in a different bubble resets the timer, which clock is the timer now using? Panicking seems reasonable here, since this doesn't seem like the sort of thing one should be doing intentionally.

Comment From: rsc

I think the panicking really needs to be only about timer channels, not all channels. Consider a goroutine serving as a mutex protecting some shared state. That goroutine can be created during func init (outside any bubbles) and then code in a bubble can send and receive to and from that goroutine to access the shared state. Surely that should be allowed, just as code in a bubble can lock and unlock a global mutex. (The same is true for a mutex that may or may not park or unpark across a bubble.)

Restricting to timer channels also makes the panic behavior independent of buffering, so that you get consistent panics as oppposed to panics that depend on scheduling details.

Comment From: neild

@prattmic raises a very interesting (in the worst sense of the word) point in review https://go.dev/cl/591997, which I'm copying here for visibility:

A goroutine blocked in sync.Mutex.Lock is idle.

Run panics if all goroutines in the bubble are idle (deadlock).

crypto/rand.Reader is a global with a mutex. reflect.FuncOf has a global mutex protecting a cache. There are probably other cases of global mutexes in std. In general, a synctest goroutine can't safely access any resource guarded by a global lock, because it will appear idle if it blocks in lock acquisition. That's a problem for common functions with a hidden lock, like reflect.FuncOf.

One possible way to resolve this would be to say that goroutines blocked in mutex acquisition are not idle. For the common case of a mutex held for a brief period of time while accessing some state, this would cause no issues for synctest users. This change would make synctest unsuitable for testing code which holds a mutex for a long time across a blocking operation, however. That may be an uncommon enough case to not be a problem. (I don't think this limitation would affect synctest's usability in testing net/http, for example.)

Another possibility might be to distinguish between mutexes acquired by a goroutine within a synctest bubble and ones not. We might be able to set a bit on Mutex.sema indicating that it's held by a bubbled goroutine, and only consider a goroutine idle if its blocked on a mutex with that bit set. I haven't worked this option through in much detail.

Either approach would still leave the same problem with a global channel serving as a mutex, as @rsc describes in the previous comment. We could say we don't cover this case; a synctest goroutine cannot communicate with a goroutine outside its bubble using channels. Another possibility would be to associate channels with bubbles: If a channel is created by a goroutine within a bubble, it is tied to the bubble. A goroutine blocked on a channel within its bubble is idle, but one blocked on a channel from outside its bubble is not. This would be a fairly straightforward change to make (although I'm not immediately sure how to do it without increasing the size of runtime.hchan, which would be unfortunate).

Comment From: Merovius

@neild Is it critical for Run to panic if all goroutines in a bubble are idle? ISTM that for debugging, it's fine to rely on the normal deadlock-detection, as dumping all the stacks will also contain the information on which test is stuck. It's a bit harder to find in that case, but that seems okay. Or is there an implementation reason why that panic is important?

Comment From: neild

It isn't strictly critical for Run to panic if all goroutines in a bubble are idle, but if goroutines can become non-durably "idle" in, for example, a call to reflect.FuncOf then the bubble's fake clock may be advanced prematurely and Wait calls may return before every active goroutine has truly idled.

Having slept on this, I propose the following amendments to the proposal:

A goroutine blocked on a mutex lock operation is not idle. This change addresses the reflect.FuncOf (etc.) problem, and has no notable impact on code using synctest when irrelevant when mutexes are held for only a short time and no blocking operations are performed while a mutex is held. Code doing something trickier with mutexes may not be testable with synctest; so it goes.
A goroutine blocked in sync.Cond.Wait is still idle.
Channels created within a synctest bubble are associated with that bubble.
A goroutine blocked on a channel that is not in a synctest bubble is not idle (from synctest's perspective). A goroutine blocked on a channel in a bubble is idle. This addresses the case where some shared state outside a bubble is guarded by a channel.
It is a fatal error for a goroutine not in a synctest bubble to send to, receive from, close, or select on a channel that is in a bubble. This enforces the property that an idle goroutine in a bubble is durably idle--no waking up unexpectedly based on external inputs.

As an implementation detail, to avoid increasing the size of runtime.hchan, we'll only track whether a channel is in some bubble (requires one bit, and hchan has free bits) rather than which bubble it is in (would require one word).

We could make idle detection work with mutexes: When locking a mutex, record whether it was locked by a goroutine in a synctest bubble or not. A goroutine blocked on a mutex operation is idle if the mutex is held by a bubble, non-idle otherwise. However, that's much more complexity than I want to propose adding to the mutex code, and I don't think it justifies the cost.

I'll try amending the implementation with this change to see how it works in practice.

Comment From: prattmic

A goroutine blocked on a mutex operation is idle if the mutex is held by a bubble, non-idle otherwise. However, that's much more complexity than I want to propose adding to the mutex code, and I don't think it justifies the cost.

Plus critical sections don't really have an owner. That is, Unlock does not have to be called by the same goroutine as Lock. This "feature" has stopped many of our clever ideas...

Comment From: neild

Plus critical sections don't really have an owner. That is, Unlock does not have to be called by the same goroutine as Lock.

I don't think that would be a problem here.

I AM NOT PROPOSING THE FOLLOWING, this is a thought experiment.
We define a new sync.Mutex.state bit indicating that a mutex is held by a bubble. Mutex lock operations set this bit if the locking goroutine is in a bubble.
When locking a bubbled mutex, we never spin (always act as if the mutex is in starvation mode).
When blocking on a bubbled mutex, we use a different wait reason (waitReasonSyncMutexLockSynctest). Goroutines in this state are idle in synctest terms, ones in normal mutex-lock-wait are not.
When unlocking a mutex, we throw if the unlocking goroutine's bubble state doesn't match that of the mutex.

I don't think this requires any inefficiencies for non-synctest code. It lets us identify mutexes held by a goroutine within a bubble, and goroutines blocked on a mutex held within a bubble. It prevents weird cases where a mutex is locked inside a bubble and unlocked outside it or vice-versa.

The rules are a bit complicated to lay out, but you wouldn't need to think about them in practice: If you're using synctest, Run idles when all goroutines in the bubble are blocked waiting for other goroutines in the bubble, and it doesn't idle when goroutines are blocked waiting for something outside the bubble.

It's probably not even all that complicated to implement.

But I'm doubtful that it's necessary. The fact that mutex lock operations spin before parking makes mutexes strictly inferior to channels for the case where you expect the lock to block for an extended period of time, which is the only case where excluding mutexes from synctest's consideration would be a problem.

Comment From: hherman1

Why is the solution for channels different than the solution for mutexes? Why shouldn’t they both have the in/out of bubble check

Comment From: neild

I have updated https://go.dev/cl/591997.

The current package documentation from that CL is:

// Package synctest provides support for testing concurrent code.
package synctest

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form
// an isolated "bubble".
// Run waits for all goroutines in the bubble to exit before returning.
//
// Goroutines in the bubble use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
//
// Time advances when every goroutine in the bubble is idle.
// For example, a call to time.Sleep will block until all other
// goroutines are idle and return after the bubble's clock has
// advanced.
//
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
//
// Channels, time.Timers, and time.Tickers created within the bubble
// are associated with it. Operating on a bubbled channel, timer, or ticker
// from outside the bubble panics.
func Run(f func())

// Wait blocks until every goroutine within the current bubble,
// other than the current goroutine, is idle.
//
// A goroutine is idle if it is blocked on:
//   - a send or receive on a channel from within the bubble
//   - a select statement where every case is a channel within the bubble
//   - sync.Cond.Wait
//   - time.Sleep
//
// A goroutine executing a system call or waiting for
// an external event such as a network operation is never idle.
// For example, a goroutine blocked reading from an network connection
// is not idle, even if no data is currently available on the connection.
//
// A goroutine is not idle when blocked on a send or receive on a channel
// that was not created within its bubble.
func Wait()

Changes from the previous version:

The documentation to uses the word "bubble" rather than "group" to describe the semi-isolated environment a collection of goroutines started by Run exists in. It seems we've naturally shifted to using "bubble" in discussion here, which makes me think it's a more intuitive term.
Channels are now associated with bubbles:
- A bubbled goroutine blocked on a bubbled channel is idle.
- A bubbled goroutine blocked on a non-bubbled channel is not idle.
- A non-bubbled goroutine operating on a bubbled channel panics.
Bubbled goroutines blocked on a mutex are no longer idle.
A non-bubbled goroutine using sync.Cond.Signal or sync.Cond.Broadcast to wake a bubbled goroutine throws.

To recap the rationale for the changes to channels and mutexes:

A goroutine in a synctest bubble can become temporarily blocked while acquiring some globally synchronized resource. For example, crypto/rand.Read performs lazy initialization of its randomness source and/or guards access to that source with a mutex. reflect.FuncOf contains a mutex-guarded cache.

My initial thought when proposing synctest was that code under test can and should avoid accessing global mutexes. However, I think the case of reflect.FuncOf shows that this isn't practical: There are functions in the standard library which acquire a global mutex, and where that fact isn't something the user can be expected to be aware of.

To avoid this problem, we define the rules on when a goroutine in a synctest bubble is "idle" to be more limited: A goroutine is idle when we can say with a high degree of certainty that it is blocked on another goroutine in its bubble. A goroutine blocked on something outside the bubble is not idle. This is the same principle that says that a bubbled goroutine blocked on a network operation or syscall is not idle: We can't say with confidence whether the goroutine is durably idle, or will be woken by some event from outside the bubble.

For channel operations, we handle this by associating channels with the bubble (if any) that creates the channel. A channel created within a bubble may only be used by goroutines in that bubble, and so a goroutine blocked on a bubbled channel can only be woken by some other goroutine in the bubble.

Mutexes are not explicitly created, so we cannot associate a sync.Mutex with a bubble at creation time. We could instead mark a mutex at Lock time as held by a bubbled goroutine, and then apply similar rules as for channels: A bubbled goroutine blocked on a bubbled mutex is idle, a bubbled goroutine blocked on an unbubbled mutex is not idle, and it is a fatal error for an unbubbled goroutine to unlock a bubbled mutex.

However, mutexes are generally not well-suited for cases where the lock operation blocks for an extended period of time. Mutex locks spin before parking, making acquisition inefficient in this case. There is no way to interrupt a blocked lock operation on timeout or cancelation. I hypothesize that there is little to no benefit to taking on more complexity in the mutex implementation to support synctest, and so the current proposal just assumes goroutines blocked on mutex acquisition are not durably idle.

sync.Cond is a different case: Condition variables are intended to waited on for longer periods of time. A Cond is explicitly constructed with NewCond, so we could choose to associate Conds with the bubble (if any) they are constructed in. For simplicity of implementation, I've instead implemented this as a prohibition against an unbubbled goroutine waking a bubbled goroutine from cond.Wait.

One final implementation detail: For simplicity and efficiency, the current CL does not attempt to track which bubble a channel is associated with. This means we will not detect cases where a channel created in one bubble is used in a different one. I don't think this situation is likely to come up often.

Proposal committee: Do these changes seem acceptable?

Comment From: neild

Why is the solution for channels different than the solution for mutexes? Why shouldn’t they both have the in/out of bubble check

One big difference is that channels are explicitly created, but mutexes are not. We can't distinguish mutexes created within a bubble from ones created outside one.

We could, as I describe above, distinguish mutexes locked by a bubble. There is no association between a mutex and the goroutine that locked it, but we could set a bit on a mutex that was locked by a bubbled goroutine. This would be sufficient to make mutexes "just work" with synctest.

The argument against is that this is more complexity in mutexes, which are already complicated and also highly performance critical. I don't think it's a lot of complexity, but I'm also not certain that it's necessary. Proposal committee, do you have opinions?

Comment From: prattmic

A channel created within a bubble may only be used by goroutines in that bubble.

I suspect that this will be OK in practice, though there is a small worry in the back of my head about lazy initialization of a global channel occurring inside the bubble and thus panicking when used outside. A theoretical example of this would be a sync.Once to initialize a channel inside of some opaque API, but I don't have concrete examples.

Comment From: prattmic

crypto/internal/randutil.MaybeReadByte does exactly this and is used by a variety of public crypto APIs. :(

Edit: FWIW, I don't understand why this function is implemented this way. The select implementation ends up using runtime.cheaprand, not some particularly fancy RNG. Perhaps this issue here was just the import graph? But that is mostly beside the point; this is just an example that such lazy initialization exists in practice.

Comment From: cherrymui

@rsc suggested above https://github.com/golang/go/issues/67434#issuecomment-2315909083 that we restrict the panic behavior to timer channels. That would eliminate @prattmic 's concern.

The updated proposal/CL applies the restriction to all bubbled channels, regardless of timers or data type. While the new update does solve the issue about buffering and (mostly) the issue about global channels, is there a good reason to do all channels instead of just timer channels? Or would time channels suffice for the intended testing cases?

Comment From: rsc

"Just timers" does seem better if we can make it work.

Someone can implement what is semantically a mutex using (a) sync.Mutex, (b) channels, or (c) sync.Cond. It seems weird that one of these methods has different behavior than the other two. It seems like all three should be acceptable, not just sync.Mutex.

Comment From: neild

To recap, the proposed panic behavior is: If a non-bubbled goroutine operates on a bubbled channel, panic.

The rationale for having a distinction between bubbled and unbubbled channels is to allow a bubbled goroutine to access channel-synchronized resources from outside its bubble. For example, let's imagine a simple case where a channel is being used as a lock:

var lockChan = make(chan struct{}, 1) // locked when the channel contains a value

// Get fetches some resource.
func Get() T {
  lockChan <- struct{}{} // lock acquired
  v := acquireResource()
  <-lockChan // lock released
  return v
}

If we didn't have a distinction between bubbled and unbubbled channels, then a bubbled goroutine calling Get while the lock is held from outside the bubble will run into problems:

synctest.Run(func() {
  Get()
})

Get blocks writing to lockChan, the only goroutine in the bubble is now idle, Run panics because all goroutines are deadlocked.

Making a distinction between bubbled and unbubbled channels means that instead when Get blocks on lockChan, synctest can recognize that it is blocked on something outside the bubble and not actually idle.

If lockChan is lazily created, however, it might be inadvertently created within a synctest bubble. Now we fall back to the previous behavior: Some goroutine outside the bubble acquires lockChan, Get blocks on lockChan, lockChan is in the bubble, Run panics. But things are much more confusing than before, because the behavior depends on when lockChan was created.

To avoid this confusion, we panic when the unbubbled goroutine writes to lockChan. An unbubbled goroutine accessing a bubbled channel indicates that something has gone wrong. The fix to the problem will probably be to ensure any lazy initialization happens outside the bubble.

I think that if we distinguish between bubbled and unbubbled chans, then we need to prevent unbubbled goroutines from accessing bubbled chans to avoid confusion. If we don't distinguish between bubbled and unbubbled chans, then the overall model is simpler, but bubbled goroutines can't access global resources synchronized by a channel, which is unfortunate.

Comment From: bmizerany

I've greatly enjoyed using the preview of this addition. It's been very useful in my work. However, now that synctest is internal, it's become challenging to test and use outside stdlib.

Would it be possible to make it accessible through a less restrictive means than a stdlib internal package? Perhaps a GOEXPERIMENT flag could work? The functionality is valuable enough that I'd consider vendoring it if it weren't so tightly coupled with Go's internals.

This is such an awesome new addition. I'm eager to keep using it, even in an experimental state. :)

Comment From: neild

Out of curiosity, I tried implementing the sync.Mutex semantics I described above: https://go.dev/cl/613515

To be clear, I am not currently proposing we do this. This is just an experiment to see how intrusive the changes to sync.Mutex might be.

To recap, this adds the following rules:

A locked sync.Mutex tracks whether the locking goroutine was in a bubble or not.
A goroutine blocked on sync.Mutex.Lock is idle if and only if the locking goroutine was in a bubble.

This essentially means that a mutex used within a bubble counts for idleness detection, but a global mutex shared with goroutines outside the bubble (such as the reflect type cache mutexes) does not.

The changes required to sync.Mutex are not huge, but are not entirely trivial either. I'm still not convinced the value of mutex support in synctest is worth changing such a performance-critical type.

Comment From: aclements

I am not fully caught up on this, but my inclination is that we need to take a step back and rethink the concept of bubbled and non-bubbled synchronization primitives. Given that the heap is shared, it seems fundamental that synchronization between inside and outside a bubble needs to work without panicking.

Comment From: neild

To summarize the current state of affairs as I understand them:

The synctest package depends on identifying when bubbled goroutines are durably blocked. ("Durably": The goroutine is not just parked, it isn't going to unpark without some other goroutine in its bubble taking an action.) The synctest fake clock advances when all bubbled goroutines are durably blocked, and the Wait function lets tests wait for background work to complete.

A bubbled goroutine can block on some non-bubbled resource. For example, reflect.TypeOf has a mutex-guarded cache, so a bubbled goroutine which calls TypeOf can block waiting for a non-bubbled goroutine. This goroutine is not durably blocked--it will resume executing when the non-bubbled goroutine releases the mutex.

The reflect.TypeOf case demonstrates that synctest must gracefully handle the case of a bubbled goroutine non-durably blocked on a goroutine outside the bubble. We can impose limitations on what you can do inside a synctest bubble, but "don't call reflect.TypeOf" is too much of a limitation.

There are three types of resource a bubbled goroutine can durably block on: Mutexes (sync.Mutex, sync.RWMutex), condition variables (sync.Cond), and channels. In all cases, a bubbled goroutine can block waiting for some other goroutine in the bubble, or block waiting for some global resource held by a goroutine outside the bubble.

For a bubbled goroutine blocked on any of these resources, we can - always say that it is durably blocked; - always say that it is not durably blocked; - or make a decision based on the state of the resource it is blocked on.

In the last case, we distinguish between a resource (mutex, channel, cond) "in the bubble" and one "out of the bubble". (For efficiency and implementation simplicity, this probably takes the form of a boolean "in some bubble" state, rather than tracking the actual bubble, but that's an implementation detail.)

A goroutine blocked on a mutex is always blocked because some goroutine acquired the mutex and has not released it. We could move a mutex into a bubble when a bubbled goroutine locks it, and move it out when it is unlocked. (This is demonstrated in https://go.dev/cl/613515.)

This trick doesn't work for channels and conds. However, channels and conds are created with a constructor, so we could mark them as bubbled or non-bubbled at creation time.

If resources have a bubbled/non-bubbled state, then there are several scenarios to consider:

A bubbled goroutine acts on a bubbled resource: If the action blocks the goroutine is durably idle.
A bubbled goroutine acts on a non-bubbled resource: Fine, but if the action blocks the goroutine is not durably idle.
A non-bubbled goroutine acts on a non-bubbled resource: Business as usual.
A non-bubbled goroutine acts on a bubbled resource: Something has gone wrong, and we should panic.

This last case is the one where panicking can (and I think must) occur. If a bubbled goroutine has been marked durably idle, it should not be woken by some event outside the bubble--the entire notion of "durably idle" is that the goroutine is waiting only for events produced within the bubble. If we can mark channels as being bubbled, then it is an error for a bubbled channel to be operated on from outside the bubble, since a "bubbled channel" is specifically a channel that isn't supposed to escape its bubble.

That's a lot of theory. I think the practical choices available to us are:

Mutexes:
Bubbled goroutines blocked on a mutex are not idle.

Pro: Simple. Allows synchronization between inside and outside a bubble. Con: A bubble doesn't become idle if a goroutine is durably blocked on a mutex. 2. Bubbled goroutines blocked on a mutex are idle if and only if the mutex was locked by a bubbled goroutine.

Pro: Allows synchronization between inside and outside a bubble. Mutexes inside a bubble just work. Con: A bit more complexity in Mutex. See https://go.dev/cl/613515. - Channels: 1. Bubbled goroutines blocked on a channel are always idle.

Pro: Simple. Channels are used less often than mutexes for global synchronization. Con: Does not allow synchronization between inside and outside a bubble. 2. Channels are marked as bubbled or non-bubbled. We panic if a non-bubbled goroutine accesses a bubbled channel.

Pro: Allows synchronization between inside and outside a bubble. Everything pretty much just works. Con: If a global synchronization channel is lazily constructed from within a bubble, the next non-bubbled goroutine to access it panics. - sync.Cond: 1. Bubbled goroutines blocked on sync.Cond.Wait are always idle.

Pro: Simple. Does anybody ever use a sync.Cond for global synchronization? Con: Does not allow synchronization between inside and outside a bubble. 2. Do whatever we do for channels.

Pro: Consistent. Con: If we mark channels as bubbled/non-bubbled, a bit more code in sync.Cond.

The simplest set of choices would be option 1 from all the above: blocking on a mutex is not idle, blocking on a chan or cond is always idle. That might be good enough, but it means synctest bubbles must not access global channels. That might be fine. (We can fix crypto/internal/randutil to not use a chan.)

Comment From: neild

We had a VC discussion about how to progress. To summarize my understanding of our conclusions:

For the moment, we're going to go with: - Channels are marked as bubbled when created in a bubble. Bubbled goroutines are idle when blocked on bubbled channels. A non-bubbled goroutine acting on a bubbled channel panics. - Bubbled goroutines are not idle when blocked on a mutex. - Bubbled goroutines are idle when blocked on sync.Cond.Wait.

This is the approach currently implemented in https://go.dev/cl/613515.

The rationale for these choices is based on what we're least likely to regret: - If we don't make it a panic for a non-bubbled goroutine to act on a bubbled channel now, we probably can't add it in the future. We can remove the panic if we decide it's a bad idea, though. - If we implement the more complex mutex behavior (idle only when blocking on a mutex acquired by another bubbled goroutine), we probably can't remove it in the future. We can add it later if we decide we want it, though. - sync.Cond generally doesn't get used for global resource locks, so we can do the simple thing now and decide to be more complicated later if necessary.

We will initially add the package with a GOEXPERIMENT=synctest guard to give people a chance to try out the API before we commit to it. (The package will only exist when using Go compiled with GOEXPERIMENT=synctest.)

I will also pick out a few existing third-party modules that use fake clocks in tests, and try rewriting their tests to use synctest instead to provide some more examples of whether it provides any significant benefit compared to existing approaches to testing.

Comment From: aclements

@neild , just for logistics, would you mind creating a new mini proposal issue for landing this with your stated semantics as a GOEXPERIMENT?

Comment From: neild

Done: #69687

Comment From: neild

I've encountered an unexpected problem caused by the new channel semantics (specifically: sending to a bubbled channel from outside the bubble panics).

I have a net/http test which (highly simplified) does something like this:

func Test(t *testing.T) {
  synctest.Run(func() {
    ch := make(chan struct{})
    t.Cleanup(func() {
      close(ch)
    })
    // ...
  })
}

(The actual test contains many more layers, of course, and the channel creation and t.Cleanup happen a couple levels of object initialization down.)

Since the t.Cleanup func executes after synctest.Run returns, it is executed outside the bubble and the close(ch) operation panics.

One obvious solution here is to say that I shouldn't do this. I can arrange for cleanup to happen inside the bubble. This is a bit unfortunate, however, since t.Cleanup really is tremendously useful and it would be a shame to take that tool off the shelf.

Or perhaps this indicates that panicking when operating on a bubbled channel from outside the bubble is a mistake. I'm not convinced, though: Waking up an "idle" bubble from outside the bubble is never correct. On the other hand, in this particular case nothing in the bubble is being woken--the bubble is already gone.

Another possibility is to add API to the testing package (as proposed by https://github.com/golang/go/issues/67434#issuecomment-2195486907 and others) to run a subtest in a bubble:

package testing

// RunIsolated runs f as a subtest of t, in a bubble.
// Cleanup funcs run in the bubble.
func (t *testing.T) RunIsolated(name string, f func())

This could either be in addition to the proposed synctest package, or a complete replacement of it: synctest.Run becomes testing.T.RunIsolated, and synctest.Wait becomes (perhaps) testing.T.Wait.

I am currently leaning towards saying that this case indicates that, while I'm reluctant to propose increasing the testing package's API surface, there needs to be a version of T.Run that starts a bubble and arranges for cleanup to happen within it. Probably on B and F as well for consistency, although I'm very dubious that any benchmark using a bubble will produce useful results.

Comment From: Merovius

@neild why not package synctest; func Run(t *testing.T, name string, f func(*testing.T))? That is, why does it have to be a method on *testing.T, instead of being a top-level function accepting a *testing.T? That would not increase the testing API, while still allowing it to be a subtest with its own cleanup.

Comment From: neild

@Merovius That's an interesting idea.

Having a version of t.Run that isn't in the testing package seems a bit strange to me, but perhaps that would be okay.

Implementation would be messy. I think the only way I see to do it is to implement the Run function in package testing and linkname it over to testing/synctest. The problem is that Run(t *testing.T, name string, f func(*testing.T)) needs to run f and the cleanup funcs registered by f within a bubble, but needs to run the surrounding test infrastructure (which accounts for test timeouts, among other things) outside the bubble--it can't just call testing.T.Run, because either it calls it outside a bubble (and cleanup funcs run outside the bubble) or inside a bubble (and the testing package's timing mechanisms run with the fake clock).

I think the messiness of implementation indicates that this isn't feasible. Maybe I'm missing a good way to do it.

Comment From: neild

I'm going to propose two options:

Option 1: Keep testing/synctest, add a method to testing.TB.

We keep all of the existing proposal unchanged. We add one method in the testing package to T, B, F, and TB:

package testing

// RunIsolated runs f as a subtest of t called name.
// It runs f in a synctest bubble (as if called by [synctest.Run]).
// Cleanup functions run in the bubble.
// RunIsolated otherwise behaves like [t.Run].
func (t *T) RunIsolated(name string, f func()) bool

This option has the advantage of keeping all the complicated synctest documentation in its own package.

Option 2: Move the API to the testing package.

We drop the testing/synctest package and add the following to the testing package:

package testing

// RunIsolated runs f as a subtest of t called name.
// Run reports whether f succeeded (or at least did not fail before calling t.Parallel).
//
// Run runs f in a new goroutine.
// This goroutine and any goroutines transitively started by it form
// an isolated "bubble".
// Run waits for all goroutines in the bubble to exit, or for f to call t.Parallel.
//
// Goroutines in the bubble use a fake time implementation.
// The initial time is midnight UTC 2000-01-01.
//
// A goroutine in the bubble is idle if it is blocked on:
//   - a send or receive on a channel created within the bubble
//   - a select statement where every case is a channel within the bubble
//   - sync.Cond.Wait
//   - time.Sleep
//
// The above cases are the only times a goroutine is idle.
// In particular, a goroutine is NOT idle when blocked on:
//   - system calls
//   - cgo calls
//   - I/O, such as reading from a network connection with no data
//   - sync.Mutex.Lock or sync.Mutex.RLock
//
// Time advances when every goroutine in the bubble is idle.
// For example, a call to time.Sleep will block until all goroutines
// are idle, and return after the bubble's clock has advanced
// by the sleep duration.
//
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
//
// Channels, time.Timers, and time.Tickers created within the bubble
// are associated with it. Operating on a bubbled channel, timer, or ticker
// from outside the bubble panics.
func Run(f func(*Sync))

// A Sync is used by RunIsolated for running isolated subtests.
type Sync

// Wait blocks until every goroutine within the current bubble,
// other than the current goroutine, is idle.
func (s *Sync) Wait()

This option has the advantage of not adding a synctest.Run that, presumably, just about nobody will ever use (because testing.T.RunIsolated is the preferable choice), but it adds a big chunk of documentation to the testing package.

Of the two, I lean towards option 2, but only slightly.

Comment From: ianthehat

Would it be possible to unbubble channels that are owned by a bubble when the bubble goes away, such that operating on them no longer panics after the bubble is ended? it might be too expensive to implement but I think it would logically solve this case in a clean way?

Comment From: neild

Unbubbling channels after the bubble ends is an interesting idea. It would fix the panic in the case that I have. It wouldn't help a test that starts a background goroutine and signals that goroutine to exit in a t.Cleanup. (Think: Start a server that listens for connections in a separate goroutine, stop it in cleanup.)

Implementation would be a bit tricky, since we need to store an association between channels and bubbles. The obvious and easy way is to put a pointer to the bubble on the channel, but that grows the size of a channel by a word; too expensive for a test-only feature. Alternatively, the bubble could have a set of weak references to bubbled channels, or we could have a global weak map of channel-to-bubble.

Comment From: neild

This is a report on attempting to apply the proposed synctest API to existing, real-world code.

etcd is a popular distributed key-value store. "github.com/jonboulle/clockwork" is an also-popular fake time package. The etcd repository (https://github.com/etcd-io/etcd) contains a number of Go modules, some of which use the clockwork package for testing.

In this experiment, I rewrote the tests for package "go.etcd.io/etcd/server/v3/etcdserver/api/v3compactor" to use the proposed synctest API.

To give away the ending: The synctest package was able to replace the fake clock, simplifying the system under test (SUT). The synctest package was also able to replace some complex manual synchronization between test and SUT, leading to a simpler, more robust test.

In the following, I will go over one specific test in detail, explaining its behavior before and after my changes. The test is TestPeriodicPause: https://github.com/etcd-io/etcd/blob/ac3d5d77ea5fdbc12ef07a6f6fe1722f06d75b24/server/etcdserver/api/v3compactor/periodic_test.go#L132-L175

System Under Test

The system under test (SUT) is a Periodic. A Periodic operates on a RevGetter and a Compactable:

type RevGetter interface {
    Rev() int64
}

type Compactable interface {
    Compact(ctx context.Context, r *pb.CompactionRequest) (*pb.CompactionResponse, error)
}

A Periodic maintains a background goroutine which periodically polls RevGetter.Rev, and calls Compactable.Compact when certain conditions are met. The internal details of RevGetter and Compactable are not important to the test, which uses a fake implementation of both.

This is a straightforward system: The inputs are time and the revisions, and the output is a series of Compact calls.

Test Infrastructure

TestPeriodicPause uses a testutil.Recorder to synchronize and monitor the SUT.

type Action struct {
    Name   string
    Params []interface{}
}

type Recorder interface {
    // Record publishes an Action (e.g., function call) which will
    // be reflected by Wait() or Chan()
    Record(a Action)
    // Wait waits until at least n Actions are available or returns with error
    Wait(n int) ([]Action, error)
    // Action returns immediately available Actions
    Action() []Action
    // Chan returns the channel for actions published by Record
    Chan() <-chan Action
}

A Recorder records a sequence of Actions (events) performed by the SUT. The fake implementations of RevGetter and Compactable record actions for each Rev/Compact call.

The implementation of the Record interface is interesting and relevant to us. The testutil package contains two implementations of Recorder.

A RecorderBuffered records each action to an internal slice of unbounded length. Record calls do not block. Wait calls attempt to wait for all pending record calls to finish before returning:

func (r *RecorderBuffered) Wait(n int) (acts []Action, err error) {
    // legacy racey behavior
    WaitSchedule()
    // ...
}

// WaitSchedule briefly sleeps in order to invoke the go scheduler.
// TODO: improve this when we are able to know the schedule or status of target go-routine.
func WaitSchedule() {
    time.Sleep(10 * time.Millisecond)
}

Note the comment about "legacy racey behavior", and the WaitSchedule function.

A recorderStream, in contrast, records each action to an unbuffered channel. Record calls block until a Wait call consumes the action. A recorderStream is created with a timeout, where a timeout of 0 indicates no timeout. Wait(n) waits up to the timeout (or indefinitely when timeout==0) or until n actions are received.

TestPeriodicPause uses blocking recorderStreams to synchronize with the SUT:

rg := &fakeRevGetter{testutil.NewRecorderStreamWithWaitTimout(0), 0}
compactable := &fakeCompactable{testutil.NewRecorderStreamWithWaitTimout(10 * time.Millisecond)}

Note that the fakeRevGetter Recorder is created with no timeout, while the fakeCompatable Recorder has a 10ms timeout. The test uses the fakeRevGetter's Recorder to synchronize with the Periodic's background goroutine. For example, when advancing over an interval of time, the test reads actions from the RevGetter:

func TestPeriodicPause(t *testing.T) {
    fc := clockwork.NewFakeClock()
    // ...

// tb will collect 3 hours of revisions but not compact since paused
for i := 0; i < n*3; i++ {
    waitOneAction(t, rg)
    fc.Advance(tb.getRetryInterval())
}

// ...
}

func waitOneAction(t *testing.T, r testutil.Recorder) {
    if actions, _ := r.Wait(1); len(actions) != 1 {
        t.Errorf("expect 1 action, got %v instead", len(actions))
    }
}

In each iteration of the loop above:

The SUT polls the RevGetter for the current revision. This blocks recording an action.
The test consumes the RevGetter action, unblocking the SUT.
The fake RevGetter automatically increments the current revision.
The test advances the fake clock.

This is a complicated dance. Every call of RevGetter.Get by the SUT must be paired with a Recorder.Wait call by the test. It would be fairly easy to desynchronize the SUT and the test, with confusing results. However, the use of the Recorder to create synchronization points between the SUT and test allows tests to create carefully orchestrated scenarios with a fake clock.

Using synctest

Using the synctest package, we can eliminate much of this test's infrastructure, and simplify what remained.

I made the following changes:

Use the time package in the SUT, rather than a testable wrapper.
Remove the use of testutil.Recorder entirely.
Remove the automatic incrementing of revisions in the fake RevGetter. Previously, every RevGetter.Rev call by the SUT was paired with a Recorder.Wait call by the test. Each Wait corresponded to an increment in the revision. We don't need the Wait calls for synchronization any more, but it seems useful to keep the changes in revision number explicit in the test.
Change the fake Compactable to record one compaction event. It reports an error if an unexpected compaction occurs. This is an improvement over the Recorder-based design, which could ignore unexpected compactions.

The rewritten test:

func TestPeriodicPause(t *testing.T) {
    synctest.Run(func() {
        testPeriodicPause(t)
    })
}
func testPeriodicPause(t *testing.T) {
    retentionDuration := time.Hour
    rg := &fakeRevGetter{rev: 1}
    compactable := newFakeCompactible(t)
    tb := newPeriodic(zaptest.NewLogger(t), retentionDuration, rg, compactable)
    defer tb.Stop()

    tb.Run()
    tb.Pause()

    n := tb.getRetentions()

    // tb will collect 3 hours of revisions but not compact since paused
    for i := 0; i < n*3; i++ {
        rg.IncRev()
        time.Sleep(tb.getRetryInterval())
    }

    compactable.Want(nil) // no compaction

    // tb resumes to being blocked on the clock
    // will kick off a compaction at T=3h6m by retry
    tb.Resume()
    time.Sleep(tb.getRetryInterval())
    compactable.Want(&pb.CompactionRequest{Revision: int64(1 + 2*n + 1)})
}

And, for contrast, the original:

func TestPeriodicPause(t *testing.T) {
    fc := clockwork.NewFakeClock()
    retentionDuration := time.Hour
    rg := &fakeRevGetter{testutil.NewRecorderStreamWithWaitTimout(0), 0}
    compactable := &fakeCompactable{testutil.NewRecorderStreamWithWaitTimout(10 * time.Millisecond)}
    tb := newPeriodic(zaptest.NewLogger(t), fc, retentionDuration, rg, compactable)

    tb.Run()
    tb.Pause()

    n := tb.getRetentions()

    // tb will collect 3 hours of revisions but not compact since paused
    for i := 0; i < n*3; i++ {
        waitOneAction(t, rg)
        fc.Advance(tb.getRetryInterval())
    }
    // t.revs = [21 22 23 24 25 26 27 28 29 30]

    select {
    case a := <-compactable.Chan():
        t.Fatalf("unexpected action %v", a)
    case <-time.After(10 * time.Millisecond):
    }

    // tb resumes to being blocked on the clock
    tb.Resume()
    waitOneAction(t, rg)

    // unblock clock, will kick off a compaction at T=3h6m by retry
    fc.Advance(tb.getRetryInterval())

    // T=3h6m
    a, err := compactable.Wait(1)
    if err != nil {
        t.Fatal(err)
    }

    // compact the revision from hour 2:06
    wreq := &pb.CompactionRequest{Revision: int64(1 + 2*n + 1)}
    if !reflect.DeepEqual(a[0].Params[0], wreq) {
        t.Errorf("compact request = %v, want %v", a[0].Params[0], wreq.Revision)
    }
}

The complete code is at:
https://github.com/neild/etcd/commit/57e8a4d1fc7117870c76f470e1674be39caf6307

What went well

The synctest package was effective at providing synchronization and fake time for this test.

The synctest version of the test is slightly shorter than the original, although some of that reduction in size is thanks to moving some functionality to the fakeCompactable type.

The synctest version of the test is, I believe, easier to modify: The original depends on precisely pairing every recorded action in the SUT with a Wait call in the test.

The original version of the test contains 10ms waits in various places, waiting for the SUT to stabilize. The synctest version just waits for the SUT to stabilize.

What went less well

The synctest.Run call indents all the test code by a level. Not really a big concern.

I forgot to follow a time.Sleep call with a synctest.Wait a couple times.

One test (TestRevisionPause) left a background goroutine running. This produces confusing results: The test hangs after executing, because synctest.Run keeps advancing the fake clock into the future and restarting the background goroutine. The fix was simple–stop the goroutine before finishing the test–but identifying the problem is a bit difficult.

These last two points make me wonder if it would be clearer to have a function that explicitly advances the fake clock. Or perhaps we should stop advancing the fake clock when the root goroutine started by synctest.Run returns.

Comment From: nightlyone

Since a common error condition in tests is to still have Go routines running when one actually doesn't want to, identifying and nothing Go routines started but not exited would contribute to the isolation of this test feature. But that is probably its own proposal.

When implementing synctest.Run as testing.T.RunIsolated, it would be great to keep that door open.

Comment From: cherrymui

If we want to keep synctest.Run and also t.Cleanup work as expected, perhaps we can have synctest.Run take a func(*testing.T), and we pass a wrapped t that runs t.Cleanup functions in the same bubble? The wrapping may need to access some internals of the testing package (I haven't looked into the detail), but we probably could make that happen. If we do this, it probably becomes synctest.Run(*testing.T, func(*testing.T)), where it takes the original t, and passed the wrapped one to the closure.

Comment From: neild

We can have synctest.Run(*testing.T, func(*testing.T)), but it'll probably need to do some internal linkname shenanigans to cooperate with the testing package. I think we'd probably implement the function in the testing package (it'll be simple, just a few lines long), and then use linkname to make it visible in synctest.

I'm fine with that, but it might be simpler to just say that if Run uses a *testing.T it belongs in the testing package.

Comment From: aclements

Personally I like the unbubbling idea, perhaps because I'm already a bit skeptical of the idea of bubbling in general. You could do it without increasing the size of chan by keeping a side table of bubbled channels.

One test (TestRevisionPause) left a background goroutine running.

It sounds like this should be easy to debug from a stack dump, no?

Or perhaps we should stop advancing the fake clock when the root goroutine started by synctest.Run returns.

Does that mean that in a situation like the background goroutine, sleeps would block the goroutine forever?

Comment From: neild

It sounds like this should be easy to debug from a stack dump, no?

This is easy to debug from a stack dump--it's just briefly confusing, because the failure mode is for the test to hang indefinitely rather than giving you an immediate stack dump. Once I ran the test with -timeout=2s the stacks made it clear what had happened.

Does that mean that in a situation like the background goroutine, sleeps would block the goroutine forever?

In the case with the background goroutine, the main test goroutine started by Run would return, and Run would then panic complaining that all remaining goroutines are blocked. (As opposed to advancing the fake clock, which is what it currently does.)

I think we should do this; I'm not sure there's a use case for continuing to run the fake clock after the main test goroutine has returned, and we can change our minds later if we want.

Comment From: neild

Personally I like the unbubbling idea, perhaps because I'm already a bit skeptical of the idea of bubbling in general. You could do it without increasing the size of chan by keeping a side table of bubbled channels.

I've been thinking about this, and I think unbubbling channels at the end of the bubble isn't the right choice.

The problem is that it's reasonable for a test to want to shut down background goroutines in a cleanup function. For example, a test may start a server listening on a fake network socket (with a background goroutine blocked in net.Listener.Accept) and stop the server in a cleanup function. If the cleanup function runs after the bubble exits, then it runs too late: a bubble never exits cleanly while any bubbled goroutines are still executing.

Cleanup functions registered in a bubble should execute in the bubble.

This is independent of the question of whether bubbling channels is a good idea or not--even if we don't associate channels with bubbles, we still want to be able to shut down a test completely before Run exits and its bubble ends.

Comment From: aclements

Putting on hold for experience with synctest under a GOEXPERIMENT (#69687). Discussion can of course continue, but this way we'll hold off on looking at this each week until there's more experience with it.

Comment From: aclements

Placed on hold.

Comment From: gopherbot

Change https://go.dev/cl/629735 mentions this issue: testing/synctest: add experimental synctest package

Comment From: gopherbot

Change https://go.dev/cl/629856 mentions this issue: runtime: avoid deadlock in synctest changegstatus when copying stacks

Comment From: mvdan

Should this issue be closed now that https://github.com/golang/go/issues/69687 is closed as implemented?

Comment From: cherrymui

Should this issue be closed now that https://github.com/golang/go/issues/69687 is closed as implemented?

No. #69687 is adding experimental support (behind a GOEXPERIMENT), to gain experience and perhaps adjust the API/implementation accordingly. This issue is for making it final and generally available.

Comment From: neild

70512 is an interesting case which could conceivably occur in user code:

syscall/js.FuncOf takes a Go function and returns a value which can be passed to JavaScript for execution. This is used by the js version of syscall to invoke callback-based JS APIs.

This causes a problem when the caller of FuncOf is in a synctest bubble. In #70512, the caller waits for an asynchronous JS API to finish by creating a channel and waiting for the FuncOf function to send a value on it. This caused a cross-bubble channel send, and a panic. (The panic in this case is helpful, and I think this is a point in favor of associating channels with bubbles: The bubbled goroutine waiting on the channel is not "durably blocked"--it's essentially blocked on a syscall, and will wake when the call completes.)

I'm fixing this in https://go.dev/cl/631055 by making syscall/js.FuncOf aware of synctest. If a bubbled goroutine creates a function with FuncOf, its bubble will not become idle until the function is released with Func.Release, and the Go callback function will be executed in the bubble of the creator. (Fortunately syscall/js.Func already requires manual cleanup with an existing Release method, so we can easily track the lifetime of these functions.)

For the common case, this makes everything just work and is transparent to the user. If users use syscall/js.FuncOf directly, however, it's a complication that they may need to be aware of.

Comment From: cherrymui

Interesting case!

I think it makes sense to consider an asynchronous callback to be in the same bubble. In some sense, it is conceptually like the caller creates the goroutine for the callback to run on.

On the other hand, I think a js.FuncOf function can also be called synchronously, e.g. Go -> JS -> Go, all synchronous calls. In this case the callback runs on the same goroutine for the synchronous call. Usually it is the same goroutine that creates the FuncOf, so the same bubble makes sense. But it could be also a different goroutine, potentially in a different bubble?

Comment From: danp

Thanks for this, @neild!

As part of writing up something to promote the experiment I noticed that tickers require a Stop for synctest.Run to return.

For example, with this test:

func Test(t *testing.T) {
    synctest.Run(func() {
        ctx := context.Background()

        ctx, cancel := context.WithCancel(ctx)

        var hits atomic.Int32
        go func() {
            tick := time.NewTicker(time.Millisecond)
            // no defer tick.Stop()
            for {
                select {
                case <-ctx.Done():
                    return
                case <-tick.C:
                    hits.Add(1)
                }
            }
        }()

        time.Sleep(3 * time.Millisecond)
        cancel()

        got := int(hits.Load())
        if want := 3; got != want {
            t.Fatalf("got %v, want %v", got, want)
        }
    })
}

It emits the message from Fatalf and then hangs.

Once I added the defer it returned. Since 1.23 came out I've gotten out of the habit of deferring Stop for Tickers and such so it was surprising.

Comment From: neild

As part of writing up something to promote the experiment I noticed that tickers require a Stop for synctest.Run to return.

Thanks for testing the package out!

I'm convinced now that advancing the fake clock after the root goroutine returns is confusing and a mistake. I ran into a similar confusion in https://github.com/golang/go/issues/67434#issuecomment-2430263934, where a background goroutine in a poll loop unexpectedly kept synctest.Run executing forever.

I think the right behavior should be:

When the function called by synctest.Run returns, time stops advancing in the bubble. Run waits for all other goroutines in the bubble to exit before returning, but it does not advance the fake clock. If all remaining goroutines are durably blocked, Run panics.

Comment From: gopherbot

Change https://go.dev/cl/635375 mentions this issue: _content/doc/go1.24: add issue link to synctest relnotes

Comment From: seankhliao

Could the same underlying implementation be used to replace the faketime in the playground?

Comment From: paskozdilar

I've tested out testing/synctest for some of my pet projects, and I've found a somewhat ugly, but functional, way for testing packages that use net.Listen and net.DialContext within the same process.

I've used Monkey for monkeypatching, channel-based mock-TCP implementation like this, and the following code in TestMain:

func TestMain(m *testing.M) {
    // Patch net.Listen to use mock.Listen
    monkey.Patch(net.Listen, mock.Listen)

    // Patch Dialer.DialContext to use mock.Dialer.DialContext
    monkey.PatchInstanceMethod(
        reflect.TypeOf(&net.Dialer{}),
        "DialContext",
        func(d *net.Dialer, ctx context.Context, network, address string) (net.Conn, error) {
            md := mock.Dialer{}
            return md.DialContext(ctx, network, address)
        },
    )
}

Running the tests that contain network calls using go test -gcflags=-l (as instructed in Monkey docs), all the tests that use DialContext/Listen, testing context cancellation, timeouts, etc. were passing instantly.

Comment From: neild

Could the same underlying implementation be used to replace the faketime in the playground?

Possibly, but there are enough differences in behavior that it might not be a simplification. I don't have any plans to try to unify the implementations at this time. If someone wants to give it a shot, it's probably first best to wait and see what the outcome of the synctest experiment is (possibly this all gets reverted!).

Comment From: jellevandenhooff

From working on https://github.com/jellevandenhooff/gosim, which simulates time like testing/synctest but also tries to do much more, I have some thoughts and questions: - with synctest the core Go runtime is 95% of the way there to deterministic simulation testing. Would it be possible and/or desirable to make the scheduler inside of a synctest bubble behave reproducibly? Running only one thread at a time seems doable, but always context-switching at the some spots seems much trickier. - for managing the bubble/non-bubble boundary, what would a design look like that runs bubbled code in a separate child process? In gosim the boundary between bubbled/non-bubbled code is more fragile because maps are not compatible across the boundary, and I moved all code-under-test to its own subprocess along with tooling to run tests running in a sub-process. With a child process there is never any question about unexpected interactions between bubbled and non-bubbled code. Additionally, it would mean that any bubble-only runtime logic need not be compiled into the main binary which might remove size or speed concerns for complicated detections of paused goroutines. - I imagine that users of synctest will want to mock eg. the network and filesystem. I think one thing this proposal demonstrates is the power of a global mock, where all users of channels and time end up using the same bubbled logic. Supporting OS-level APIs is not part of this proposal, but thinking about what to do to set those use-cases up for success seems good. Would you want a fs.FS interface with write support? A network.Network? Do they need to integrate with synctest? - A crazy thing to test is how programs behave when different nodes experience clock drift. I think that with synctest you could wrap an interface around time.Now (and others) that offsets clocks on different simulated nodes and then relies on synctest to advance time. That would require using that wrapper, which is too bad because it means all users of time need to be modified which seems unlikely to be adopted for eg. the timeouts in the gRPC or HTTP libraries. Supporting clock drift at the core of synctest, on the other hand, seems like a big request for a more niche use case. - Should clock values be this perfectly rounded? They are not in real applications. Maybe the perfectly synchronized clocks will catch some bugs where times are compared less-than-equals instead of equals (or vice-versa) which you would not find with fuzzier clocks. But maybe it hides other bugs?

Comment From: neild

Would it be possible and/or desirable to make the scheduler inside of a synctest bubble behave reproducibly?

This is an interesting idea, but probably out of scope for the synctest proposal.

Currently, the runtime does the opposite when running under the race detector: Goroutine scheduling is made less deterministic, to aid in detecting accidental dependencies on the current scheduler behavior.

for managing the bubble/non-bubble boundary, what would a design look like that runs bubbled code in a separate child process?

I don't know. I think that's a different feature than synctest is attempting to provide.

Inside Google, we have a test library to aid in executing test functions in a subprocess. It's used something like this:

func Test(t *testing.T) {
  cmd := exectest.Command(t, "unrecovered panic", func(t *testing.T) {
    os.Exit(2)
  })
  _, err := cmd.CombinedOutput()
  if err == nil {
    t.Fatalf("expected unsuccessful exit, got none")
  }
}

The exectest.Command function runs the current test program in a subprocess, rerunning the current test up to the Command call and executing its parameter. If a test contains multiple exectest.Command calls, Command returns an ErrSkipped error from every call except the current one.

This is useful for testing functions that terminate the current process (like os.Exit). However, it's a fairly sharp-edged package--any init funcs rerun in the subprocess and Command necessarily reruns the entire test function up to the point where it was called. I don't see how to use this approach in synctest without adding a number of confusing corner cases.

Supporting OS-level APIs is not part of this proposal, but thinking about what to do to set those use-cases up for success seems good. Would you want a fs.FS interface with write support? A network.Network?

Time is, I think, somewhat unique in terms of testing.

The time package does not provide any testable abstractions. It exports functions like time.Now which can't be replaced in tests, and concrete types like time.Timer which can't be replaced with fakes. One could imagine a different design for the package with an abstract clock type which can be passed around, but that's not what we have.

It's also not possible to use real time in tests, or at least not well: Tests which sleep for some amount of real time are always slow and/or flaky.

In contrast, the net package provides a number of useful abstractions: You can write code which operates on a net.Conn or a net.Listener and use a fake implementation in tests. It's also possible to use real network connections on a loopback interface in tests. The fs package's filesystem abstraction may not include write support yet, but you can use the real filesystem in tests.

Another difference is that there's less variation in what users may want from a fake time implementation: time.Sleep should wait until the fake clock has advanced by some amount of time, time.AfterFunc should execute a function when the fake clock advances by some amount, and so on. In contrast, a test using a fake net implementation might want to inject faults, or drop packets, or introduce latency.

I do think that if the synctest experiment is successful, it will increase the need for good fake implementations of net.Listener, net.Conn, and net.PacketConn, since it mostly isn't possible to use loopback network connections with synctest. However, these can be developed outside the standard library.

A crazy thing to test is how programs behave when different nodes experience clock drift.

Should clock values be this perfectly rounded?

Clock drift seems highly specialized and out of scope for what synctest is trying to provide.

The precision of time under synctest is definitely artificial: A timer set to fire at a certain time will always fire at exactly that time, and time does not pass except when goroutines block. I don't see a way to avoid this; we could introduce clock drift, but at the expense of making it harder to write deterministic tests.

Comment From: bmizerany

While profiling allocations using the new (*testing.B).Loop in some code that sleeps, I thought I'd make things spicy and try synctest to speed things up. It works great for finding allocations! I did, however, notice the ns/op reported was a little off. I'm assuming it's because the new Loop is not in the "bubble" or using some other timer or other?

; cat x_test.go 
package x

import (
    "testing"
    "testing/synctest"
    "time"
)

func Benchmark(b *testing.B) {
    synctest.Run(func() {
        for b.Loop() {
            time.Sleep(1 * time.Millisecond)
        }
    })
}
; go test -bench=.          
goos: darwin
goarch: arm64
pkg: x
cpu: Apple M3 Max
Benchmark-16           1200 -788254364987334 ns/op
PASS
ok      x   0.181s

Switching back to for range b.N { ... } does not present the same output:

; cat x_test.go             
package x

import (
    "testing"
    "testing/synctest"
    "time"
)

func Benchmark(b *testing.B) {
    synctest.Run(func() {
        for range b.N {
            time.Sleep(1 * time.Millisecond)
        }
    })
}
; go test -bench=.
goos: darwin
goarch: arm64
pkg: x
cpu: Apple M3 Max
Benchmark-16        2996448       390.9 ns/op
PASS
ok      x   1.756s

I would not assume the ns/op be accurate since the code is not really sleeping anymore, but the large negative duration is a little surprising.

Comment From: prattmic

b.Loop handles scaling b.N internally, which involves a time.Since on b.start, which is time.Now before calling the benchmark function. Since synctest uses a completely different clock, time.Since on a time from outside the bubble won't work properly.

With a normal b.N loop, scaling is handled by calling the benchmark function multiple times and timing the entire thing, which means all timing occurs outside the bubble.

Comment From: bmizerany

@prattmic Thank you for the insights. I had a hunch it was something along those lines.

I'm wondering if there is a way to detect the use of b.Loop in a bubble and show some helpful output to explain that timings are unable to be accurate and therefore are not displayed; or, if it is incorrect to mix synctest a b.Loop, refuse to build the benchmark?

Comment From: neild

We could also have B.Loop use the real clock, even when in a bubble. Not saying that's the right choice, but it's an option.

If we don't have B.Loop use the real clock, it should probably panic when invoked in a bubble.

Comment From: neild

Current proposal status

The testing/synctest package will be available in Go 1.24 as an experimental package. The package is not available by default. To use it, build your program with the environment variable GOEXPERIMENT=synctest.

The experimental package API is as follows:

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form
// an isolated "bubble".
// Run waits for all goroutines in the bubble to exit before returning.
//
// Goroutines in the bubble use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
//
// Time advances when every goroutine in the bubble is blocked.
// For example, a call to time.Sleep will block until all other
// goroutines are blocked and return after the bubble's clock has
// advanced. See [Wait] for the specific definition of blocked.
//
// If every goroutine is blocked and there are no timers scheduled,
// Run panics.
//
// Channels, time.Timers, and time.Tickers created within the bubble
// are associated with it. Operating on a bubbled channel, timer, or ticker
// from outside the bubble panics.
func Run(f func())

// Wait blocks until every goroutine within the current bubble,
// other than the current goroutine, is durably blocked.
// It panics if called from a non-bubbled goroutine,
// or if two goroutines in the same bubble call Wait at the same time.
//
// A goroutine is durably blocked if can only be unblocked by another
// goroutine in its bubble. The following operations durably block
// a goroutine:
//   - a send or receive on a channel from within the bubble
//   - a select statement where every case is a channel within the bubble
//   - sync.Cond.Wait
//   - time.Sleep
//
// A goroutine executing a system call or waiting for an external event
// such as a network operation is not durably blocked.
// For example, a goroutine blocked reading from an network connection
// is not durably blocked even if no data is currently available on the
// connection, because it may be unblocked by data written from outside
// the bubble or may be in the process of receiving data from a kernel
// network buffer.
//
// A goroutine is not durably blocked when blocked on a send or receive
// on a channel that was not created within its bubble, because it may
// be unblocked by a channel receive or send from outside its bubble.
func Wait()

As this package is experimental, it is not subject to the Go 1 compatibility promise. Depending on experience and feedback, we may promote testing/synctest to a fully-supported package, possibly with incompatible changes from the current version; continue the experiment in future versions, again possibly with amendments; or drop the experiment entirely and remove the package.

Your feedback is essential! If you try out testing/synctest, please report your experiences, both positive and negative, in this issue.

Known issues

Time advances forever after main test goroutine exits

This call to Run never returns:

synctest.Run(func() {
  time.NewTicker(1 * time.Second)
})

The problem is that Run does not return so long as advancing the fake clock causes some event to happen. The time.Ticker provides a never-ending source of events. A similar problem happens with any background goroutine that performs a periodic action:

synctest.Run(func() {
  go func() {
    for {
      time.Sleep(1 * time.Second)
      // do something
    }
  }()
}

A potential fix for this issue is to say that the fake clock stops advancing once the goroutine started by Run returns. With this change the first example with a time.Ticker would return, and the second example with a sleeping background goroutine would panic.

Confusing error when `go.mod` Go version is less than 1.23

Go 1.23 added a new implementation of channel-based timers (https://go.dev/wiki/Go123Timer), with slightly different semantics. The new implementation is used in programs where package main is in a module with a go.mod declaring go1.23 or later. The GODEBUG setting asynctimerchan can force the old or new semantics.

The current testing/synctest implementation is not compatible with the old implementation. If you try to use testing/synctest in a program where package main's go.mod declares a pre-1.23 version of Go, you get a confusing error about "synctest.Run not supported with asynctimerchan!=0".

We either need to support asynctimerchan=1 or improve the error message.

`(testing.TB).Cleanup` does not work well in synctest bubbles

In the following code, the cleanup function is executed outside the bubble, and panics due to operating on a bubbled channel from outside the bubble:

func Test(t *testing.T) {
  synctest.Run(func() {
    ch := make(chan struct{})
    t.Cleanup(func() {
      close(ch)
    })
  })
}

In general, T.Cleanup doesn't work well when used inside a synctest bubble. This is unfortunate, because T.Cleanup is tremendously useful.

`(*testing.B).Loop` does not work in synctest bubbles

When used inside a synctest bubble, B.Loop uses the fake clock and gets confused.

Either B.Loop should use the real clock, or it should panic when used inside a bubble.

Deadlock errors don't print useful stacks

When every goroutine in a bubble is durably blocked and advancing the current time won't unblock any of them, Run panics: "deadlock: all goroutines in bubble are blocked".

In a test, this panic results in the stack for the goroutine that called Run being printed, but not the stacks of any other goroutines. In this situation, the user almost certainly wants to see the stacks of all the goroutines in the bubble.

Comment From: gopherbot

Change https://go.dev/cl/641176 mentions this issue: testing/synctest: add some examples

Comment From: gopherbot

Change https://go.dev/cl/642422 mentions this issue: main.star: set GOEXPERIMENT=synctest for builders testing >=go1.24

Comment From: gopherbot

Change https://go.dev/cl/642675 mentions this issue: main.star: set GOEXPERIMENT=synctest for x/net builders testing >=go1.24

Comment From: gabyhelp

AI-generated issue overview

Original Post Summary

The original post proposes a new package, testing/synctest, for testing concurrent Go code. (@neild, issue) The package offers two primary features: simulated time control and synchronization for asynchronous operations. (@neild, issue)

The core API consists of synctest.Run, which executes a function in a separate goroutine with synthetic time, and synctest.Wait, which blocks until all goroutines within the Run-created group are idle. (@neild, issue) Synthetic time starts at "midnight UTC 2000-01-01" and only advances when all goroutines within a group are idle. (@neild, issue) The Run function waits for all goroutines in the group to exit before returning. (@neild, issue) A goroutine is considered idle if blocked on channel operations, mutexes, time.Sleep, selects without cases, or the Wait function itself. Goroutines blocked on I/O are specifically not idle. (@neild, issue)

The post includes an example of testing a concurrent cache, demonstrating how synctest eliminates flakiness and slowness compared to traditional time.Sleep based tests. (@neild, issue) The post also mentions experience with golang.org/x/net/http2 tests as motivation and acknowledges limitations related to I/O operations, suggesting in-memory replacements like net.Pipe for network connections. (@neild, issue) A later edit suggests a synctest.Sleep function that combines time.Sleep and Wait for convenience. (@neild, issue)

Discussion Themes

API Design and Placement: Discussion arose regarding the API's simplicity and potential misuse outside of testing. (@aclements, issue comment; @apparentlymart, issue comment) Suggestions included providing standalone synchronization primitives, requiring a testing.TB argument for Run, or limiting the package to the standard library. One commenter supported the inclusion of virtual time to explicitly prevent production use. (@rsc, issue comment) There were also discussions about making the API a method on testing.T for better integration with existing testing infrastructure. (@narqo, issue comment; @neild, issue comment) Some later comments suggested alternative APIs like synctest.Enable or synctest.Start for easier use in testing frameworks, but @neild expressed reservations about these approaches. (@adamluzsi, issue comment; @neild, issue comment)
Synchronization and Idle Goroutines: Commenters explored edge cases involving idle goroutine detection, nested groups, interactions with external goroutines, and the behavior of multiple simultaneous Wait calls. (@aclements, issue comment; @neild, issue comment) The implications of time freezing during non-idle periods and the scope of the fake time implementation were also discussed. The author clarified that nested groups are not supported and that Run should panic if called within a group. (@neild, issue comment) Later, further complications were discovered relating to goroutines blocked on global mutexes like those in crypto/rand and reflect. (@neild, issue comment)
Bubbling Behavior and Interaction with Standard Library: Debate focused on whether or not to bubble channels and the handling of timeouts. Concerns were raised about the complexity of modifying time-related functions and the potential for interference with background goroutines. (@gh2o, issue comment) Suggestions included using a third-party mock time implementation and implementing timeouts for Wait. Later, discussion centered on the behavior of channels crossing bubble boundaries, and whether or not to panic in such cases, particularly with respect to timer channels versus all channels. (@cherrymui, issue comment; @rsc, issue comment) A concern about lazy initialization of global channels within a bubble was raised. (@prattmic, issue comment) The interaction with the testing.T lifecycle, specifically t.Cleanup, presented a challenge, as cleanup functions running outside the bubble could panic when interacting with bubbled channels. (@neild, issue comment)
Real World Examples and Testing: @neild provided detailed reports and code examples of using a prototype implementation of synctest with the net/http package and the etcd project. (@neild, issue comment; @neild, issue comment) These experiments highlighted both successes and challenges in applying the proposed API, including issues with leaked goroutines, the need for synctest.Wait after time.Sleep, and complexities in handling network connections within tests. A subsequent attempt to apply synctest to a project using the clockwork fake time library showed potential for simplifying tests and the system under test. (@neild, issue comment) Another commenter shared a successful application of synctest with network mocking using a third-party library. (@paskozdilar, issue comment)

Next Steps

The testing/synctest package will be included in Go 1.24 as an experimental package behind the GOEXPERIMENT=synctest build tag. (issue comment) The proposal remains open for feedback and further iteration based on real-world usage during the experimental phase. Several known issues are documented, including the indefinite advancement of time after the main test goroutine exits, compatibility issues with older Go versions, interactions with t.Cleanup and B.Loop, and unhelpful stack traces in deadlock situations. These issues, along with community feedback, will inform the future direction of the package, potentially leading to its promotion, further amendments, or removal.

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: gopherbot

Change https://go.dev/cl/645675 mentions this issue: testing/synctest: add an example of testing networked code

Comment From: gopherbot

Change https://go.dev/cl/645719 mentions this issue: runtime: print stack traces for bubbled goroutines on synctest deadlock

Comment From: take0fit

@neild Thank you for this great proposal! I was really excited while reading through it, and while going over the example code, I happened to notice a very minor issue that’s unrelated to the main content of the proposal.

In the TestCacheEntryExpires function, the callback function is defined to return an int, but it uses fmt.Sprintf to return a string. This causes a small type mismatch.

c := NewCache(2 * time.Second, func(key string) int {
    count++
    return fmt.Sprintf("%v:%v", key, count)
})

Suggested correction:

c := NewCache(2 * time.Second, func(key string) string {
    count++
    return fmt.Sprintf("%v:%v", key, count)
})

and

func TestCacheEntryExpires(t *testing.T) {
        synctest.Run(func() {
                count := 0
                        c := NewCache(2 * time.Second, func(key string) int {
                        count++
                        return fmt.Sprintf("%v:%v", key, count)
                })

Suggested correction:

func TestCacheEntryExpires(t *testing.T) {
        synctest.Run(func() {
                count := 0
                        c := NewCache(2 * time.Second, func(key string) string {
                        count++
                        return fmt.Sprintf("%v:%v", key, count)
                })

This is just a minor detail and doesn’t affect the overall proposal, but I thought I’d point it out. Thanks again for the proposal—I’m really looking forward to seeing how this develops!

Comment From: huww98

Great feature! Really exciting! I've tried this on my project, and it works very well. And I've also learned a lot from reading your discussions.

You're right that sleeping past the deadline of a timer is sufficient. The synctest.Wait function isn't strictly necessary at all; you could use time.Sleep(1) to skip ahead a nanosecond and ensure all currently running goroutines have parked.

I found a case where synctest.Wait() maybe necessary:

var a int

func TestPrintA(t *testing.T) {
    synctest.Run(func() {
        go func() { println(a) }()
        synctest.Wait()
        a = 2
    })
}

Running this case with go test -race will pass. However, if I replace synctest.Wait() with time.Sleep(1), I will get race detected during execution of test. I think this should be the expected behavior. The one that uses sleep should behave like it were executed out of bubble. But we can still use synctest.Wait to access internals without triggering data race detector. This is very convenient. But note that this can only sequence test code after the system-under-test, not before, e.g., this will not work and still reports data race:

var a int

func printA() {
    time.Sleep(1 * time.Second)
    println(a)
}

func TestPrintA(t *testing.T) {
    synctest.Run(func() {
        go printA()
        a = 2
        synctest.Wait()
    })
}

But I can just move a = 2 before creating the goroutine.

However, while race detector is aware of synctest.Wait(), it seems not aware of that synctest.Run() also waits for all goroutines to finish. e.g. this will report an unexpected data race:

var a int

func TestUpdate(t *testing.T) {
    synctest.Run(func() {
        println(a)
    })
    synctest.Run(func() {
        a = 2
    })
}

==================
WARNING: DATA RACE
Write at 0x0001029e08d0 by goroutine 9:
  huww98.cn/playground/a.TestUpdate.func2()
      /Users/huww98/source/playground/a/synctest_test.go:15 +0x28

Previous read at 0x0001029e08d0 by goroutine 8:
  huww98.cn/playground/a.TestUpdate.func1()
      /Users/huww98/source/playground/a/synctest_test.go:12 +0x28

Goroutine 9 (running) created at:
  internal/synctest.Run()
      /Users/huww98/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24rc3.darwin-arm64/src/runtime/synctest.go:178 +0x104
  testing.tRunner()
      /Users/huww98/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24rc3.darwin-arm64/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /Users/huww98/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24rc3.darwin-arm64/src/testing/testing.go:1851 +0x40

Goroutine 8 (finished) created at:
  internal/synctest.Run()
      /Users/huww98/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24rc3.darwin-arm64/src/runtime/synctest.go:178 +0x104
  testing.tRunner()
      /Users/huww98/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24rc3.darwin-arm64/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /Users/huww98/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24rc3.darwin-arm64/src/testing/testing.go:1851 +0x40
==================
--- FAIL: TestUpdate (0.00s)
    testing.go:1490: race detected during execution of test

This can also cause failure when invoke the test with go test -race -count 2:

var a int

func TestUpdate(t *testing.T) {
    synctest.Run(func() {
        a = 2
        println(a)
    })
}

Comment From: neild

However, while race detector is aware of synctest.Wait(), it seems not aware of that synctest.Run() also waits for all goroutines to finish. e.g. this will report an unexpected data race:

Thanks, that's an interesting point I hadn't thought of. I think it should be straightforward to tell the race detector that there's a happens-before relationship between a bubble's goroutines exiting and Run returning.

Comment From: gopherbot

Change https://go.dev/cl/647755 mentions this issue: runtime: establish happens-before between goroutine and bubble exit

Comment From: pierrre

The release note https://go.dev/doc/go1.24#testing-synctest mentions the package documentation https://pkg.go.dev/testing/synctest But the link is broken :(

Comment From: rittneje

@pierrre That's a known issue - see https://github.com/golang/go/issues/71488.

Comment From: gopherbot

Change https://go.dev/cl/651555 mentions this issue: runtime: print stack traces for bubbled goroutines on synctest deadlock

Comment From: marwan-at-work

👋🏼 Hi there, not sure if this was mentioned but time.NewTicker that doesn't call Stop does not end up panicking. The test instead just hangs forever. Here's a repro:

func TestTicker(t *testing.T) {
    synctest.Run(func() {
        time.NewTicker(time.Second)
    })
}

The above test hangs forever. If I instead call Stop, then it will pass correctly.

I expect the above should panic similar to when a goroutine is left hanging when func ends like so:

func TestDanglingGoroutine(t *testing.T) {
    synctest.Run(func() {
        go func() {
            select {}
        }()
    })
}

More general feedback:

Overall, I really like how my concurrent tests are looking code-wise and the fact that tests run fast due to the fake clock is even better.
Onboarding was a little difficult because even though the synctest.Run(func () { ... }) API is simple, there are a lot of implicit rules such as: all goroutines must exit by the end of the of func. That said, the rule itself was great because it forced me to write more correct code that handled cleaning up goroutines and also gave me confidence that I'm not leaking any goroutines either.

Comment From: neild

Thanks for the report!

time.NewTicker that doesn't call Stop does not end up panicking. The test instead just hangs forever.

This is on my known issues list. (https://github.com/golang/go/issues/67434#issuecomment-2565780150)

I think that the fix here is going to be to stop advancing time once the goroutine started by Run exits. Any leftover Tickers or Timers will not fire. If there are any goroutines blocked reading from a ticker, Run will panic with a deadlock error.

there are a lot of implicit rules such as: all goroutines must exit by the end of the of func. That said, the rule itself was great because it forced me to write more correct code that handled cleaning up goroutines and also gave me confidence that I'm not leaking any goroutines either.

Thanks especially for the experience report on this behavior. It's a bit unusual to enforce goroutine cleanup (there's a lot of code out there that leaks goroutines, especially in tests), so I'm glad to hear that it seems to have worked out as intended for you.

Comment From: dsnet

Thank you @neild for "synctest". I was recently working on a task queue execution engine, and this would have been nearly impossible to thoroughly test without either lots of hooks, tests being slow, and/or being flaky. With "synctest", the tasks execute very quickly and I can easily craft the exact conditions for potential races and verifying that our locking patterns properly handle them. I am reasonably confident in the correctness of my code because of this package.

Some thoughts: * I found the default starting period of 2000-01-01 unsuitable for our use-cases. We have a database that must be initialized outside of the time bubble, and so has rows in it from the current wall clock. In order to maintain monotonic progression of time transitioning from wall clock to bubble clock, I was always doing this: go now := time.Now() synctest.Run(func() { time.Sleep(now.UTC().Sub(time.Now().UTC())) // align bubble to real time ... }) I wonder if the starting time should be an explicit argument to synctest.Run. I can imagine scenarios where you really do want a completely isolated bubble of time with no correlation to the wall clock. Other times (like in my case), you need to connect the wall clock and bubble clock in some way.

I believe that calling time.Now inside the bubble should not record a monotonic reading. Consider the above example, if you instead do: go now := time.Now() synctest.Run(func() { time.Sleep(now.Sub(time.Now())) // overflows in arithmetic ... }) this will not work correctly since calling time.Now in both cases will record a monotonic reading. Consequently, the time.Time.Sub method will try to use the monotonic readings to perform arithmetic, but be completely off since the monotonic readings are from a completely different clock context. I argue that time.Now within a bubble should not have a motonic reading since time can only ever progress forwards in time, so there is no need for a monotonic reading. Thus arithmetic between time.Time that both originated in the bubble work correctly, and also arithmetic between time.Time from inside and outside the bubble.
I sometimes found myself questioning whether "synctest" was working correctly (turned out to actual bugs in my code), but it would be interesting if there could be a mode where "synctest" runs with a real wall clock, instead of a synthetic clock. Obviously tests run slower, but provide the guarantee that "synctest" isn't masquerading some real problem.

Comment From: neild

I found the default starting period of 2000-01-01 unsuitable for our use-cases.

In the case where you want to start the synthetic time at some other point, I think time.Sleep(when.Sub(time.Now()) is not much worse than any other option. Most tests don't care about the specific time, so I think it's okay if we put a tiny bit more work on the ones that do rather than requiring everyone to specify a starting time.

2000-01-01 has a few advantages:

It's very obviously not the current time, so if you leave the bubble clock alone it's easy to distinguish real and fake times.
It's consistent. We could start the bubble time at the current time, but then each test run would be different.
It's in the past, so if you want to test behavior at some specific point in time you can time.Sleep until that time. For example, if you want to test your code's behavior at a leap year transition, you can pick any leap year after 2000 without worrying that your test will stop working at some point in the future.

I believe that calling time.Now inside the bubble should not record a monotonic reading.

Oops, that's a bug. I thought we weren't recording a monotonic reading. I'll fix that.

but it would be interesting if there could be a mode where "synctest" runs with a real wall clock, instead of a synthetic clock

That's an interesting thought. I think it would need to be a GODEBUG setting or similar; we don't want to make it easy to use synctest.Wait in non-test code. (Perhaps there's a good use case for a Wait-style operation in production code, but if so we should design that as a separate feature outside of synctest.)

Would the bubble clock still start at a synthetic base time? I can see some tricky concerns either way.

Comment From: gopherbot

Change https://go.dev/cl/657415 mentions this issue: runtime, time: don't use monotonic clock inside synctest bubbles

Comment From: mvdan

What is the general guidance for integration tests of HTTP servers which currently use https://pkg.go.dev/net/http/httptest? https://github.com/golang/go/issues/71345 correctly points out that the docs say:

A goroutine executing a system call or waiting for an external event such as a network operation is not durably blocked. For example, a goroutine blocked reading from an network connection is not durably blocked even if no data is currently available on the connection, because it may be unblocked by data written from outside the bubble or may be in the process of receiving data from a kernel network buffer.

While this is true in the general sense, in practice many tests are currently written with httptest in such a way that it's only the same Go test and goroutines which interact via HTTP clients. I have some of those tests which currently use a fake clock to mimic e.g. cookie or access token expiry, and it's not clear to me how I should attempt to use synctest in these tests.

Comment From: comerc

testing/synctest was not included in 1.24 ((

Comment From: paskozdilar

It is included with GOEXPERIMENT=synctest.

Comment From: seankhliao

I think for httptest we'd need #14200 net/http seems to already have an internal implementation at https://go.googlesource.com/go/+/bceade5ef8ab6d28ad363cd7ca60a9be89990a00/src/net/http/netconn_test.go#5

Comment From: neild

You can use synctest and httptest together, but it's a bit tricky at the moment. You need to provide a fake network, which you can inject by setting httptest.Server.Listener and http.Transport.DialContext to appropriate fake implementations.

A very barebones example: https://go.dev/play/p/HJQKRKuEYQf

The net/http package contains a more full-featured approach in its internal tests: https://go.googlesource.com/go/+/refs/tags/go1.24.0/src/net/http/clientserver_test.go#190

If synctest is promoted out of experimental status, I think we'll want to consider whether we should provide an easy-to-use fake network for httptest (#14200).

Comment From: SpikesDivZero

I've been using synctest and am pretty happy, so you all have my gratitude for this experiment.

One rough edge I've felt concerns its potential interaction with testing.T.Context, also introduced in go1.24. Should users regard it as safe to use t.Context().Done() inside a synctest bubble?

This is reasonably safe as of the current version but may not be safe in future versions.

If we want the T Context to be safe to use inside a bubble, then there's a case to be made for an added test. I took a look around and didn't see a test covering this.

This may also add a point in favor of the earlier T.RunIsolated (or similar) discussion above: preventing a user from accessing the Done channel before entering the synctest bubble.

// Trivialized example of how a user may use testing.T.Context.Done

func codeUnderTest(ctx context.Context, timeout time.Duration) {
    select {
    case <-ctx.Done():
    case <-time.After(timeout):
    }
}

func TestTContextDone(t *testing.T) {
    synctest.Run(func() {
        const timeout = time.Hour // as if from a table test

        t0 := time.Now()
        codeUnderTest(t.Context(), timeout)
        if d := time.Since(t0); d != timeout {
            t.Errorf("duration is %v, wanted %v", d, timeout)
        }
    })
}

Comment From: neild

I've been thinking on the interactions between the synctest and testing packages. T.Context is a problem, as is T.Cleanup.

I've got a few ideas so far that seem workable, with various tradeoffs.

The simplest I can think of is to add synctest.Start (and probably remove synctest.Run):

package synctest

// Start places the current goroutine in a new synctest bubble.
// The goroutine must not be in a bubble already.
func Start()

We could then make the testing package aware of bubbles: T.Context would return a different (bubbled) context, T.Cleanup would register functions that run inside the bubble, and the testing package would wait for bubbled goroutines to exit before finishing a test.

This has the advantage of simplicity, but the disadvantage of losing the clean delineation of the bubble lifetime that synctest.Run provides.

Another option would be to provide a function that starts a bubble with a testing.T, B, or F that is bubble-aware. This could be in the synctest package:

package synctest

// Test runs f in a bubble in a new goroutine.
func Test[T testing.TB](t T, f func(T))

Or in the testing package:

package testing

// RunIsolated runs f as a subtest in a new bubble.
func (t *T) RunIsolated(t *testing.T, name string, f func(*T))

Comment From: thatnealpatel

TL;DR

I am exceedingly happy with synctest.

My use-case is a single-binary that implements an execution platform. The binary is instrumented using concurrency primitives to handle tight-looped, time-based events alongside event-based triggers. In addition, it uses REST to communicate with a broker and SQLite for state management (via sqlc code generation).

My main (self-inflicted) pain points were:

New solution for database durability between epochs when using synctest
Re-instrumenting HTTP integrations to keep as much production code in the system under test (SUT) as possible

The only downside was giving up remote logging (via cloud.google.com/go/logging) in my integration tests. However, given the sheer utility of synctest, this was an extremely acceptable trade-off. Ironically, a majority of the value to me for having these remote logs was to aid in debugging test flakes due to injected clocks.

Despite those minor points, in <6 hours, I was able to:

Fully convert to use synctest idioms
Produce a net negative code diff
Reduce _practical_ testing execution time from 100s -> 8s*
Eliminate all non-deterministic behavior and test flakes
Increase expressiveness of integration test cases through generalization around durable test time
Catch subtle concurrency bugs (e.g. leaked testing goroutines)

It would be an understatement to say that I would be sad if synctest did not make it into go1.25.

as further noted below, I was virtually forced to use -race to get my tests to mostly* pass with injected clocks. With synctest, this is no longer true; therefore, this is a material performance gain to me. The "real diff" is approximately 60% faster integration test execution (18s->8s).

**run on M3 Macbook Air

Details

General Architecture

The following is an illustrative subset of the use-case I have; most notably it includes: * frequent network I/O via HTTP * frequent disk I/O via database interactions in SQLite * highly concurrent execution using mixed time-based / event-based channels

Prior to synctest, the below would use https://github.com/jmhodges/clock.Clock passed via injected Params for each of the components.

package broker

// Broker dispatchers HTTP to the broker via REST.
func Broker(ctx context.Context, ...) (chan<- Request, chan<- Refresh) {

  reqs := make(chan Request)
  refresh := make(chan Refresh)
  db, cleanup, _ := dataplane.OpenRWConnection(dataplane.Broker, ...)
  clientCache := make(map[dataplane.ExecutorT]*http.Client) // oauth2 details abstracted away here
  logclient, flush, _ := monitoring.NewCloudLogger(...)

  go func() {
    for {
      select {
      case <-ctx.Done():
        flush()
        cleanup()
        return
      case <-time.After(time.Minute): // lightweight thing
      case r := <-reqs:               // 5 QPS, network I/O thing, uses `clientCache`
      case r := <-refresh:            // infrequent, network I/O thing, mutates `clientCache`
      }
    }
  }()
  return reqs, refresh
}

package cache

// CacheLayer provides synchronized access to post-processed broker data.
func CacheLayer(ctx context.Context, dispatch chan<- broker.Request, ...) chan<- Request {

  reqs := make(chan Request)
  logclient, flush, _ := monitoring.NewCloudLogger(...)
  cache := make(map[any]any) // abstracting details away

  go func() {
    for {
      select {
      case <-ctx.Done():
        flush()
        return
      case <-time.After(time.Second): // heavy calculation (1000s of PDEs), network I/O via `dispatch`
      case r := <-reqs:               // 50 QPS, uses post-processed cache
      }
    }
  }()
  return reqs
}

package executor 

// Executor performs core business-logic using the data via the broker.
func Executor(ctx context.Context, dispatch chan<- broker.Request, oracle chan<- cache.Request, ...) {

  db, cleanup, _ := dataplane.OpenRWConnection(dataplane.Executor, ...)
  logclient, flush, _ := monitoring.NewCloudLogger(...)
  trig := make(chan struct{}, 1)

  go func() {
    for {
      select {
      case <-ctx.Done():
        flush()
        cleanup()
        return
      case <-time.After(time.Minute):   // uses `db`, `dispatch`, `oracle` [cheap, non-critical]
      case <-time.After(4*time.Second): // uses `db`, `dispatch`, `oracle` [expensive, critical]
      case <-trig:                      // uses `db`, `dispatch`, `oracle` [cheap, critical]
      }
    }
  }()
}

Testing Strategy

Previous to synctest, I had to use injected clocks (clock.FakeClock); I used https://github.com/jmhodges/clock which was perfectly fine for a simple API.

My testing platform implements 2 fake components: 1. simulated broker platform (to be used with httptest) that allows for dynamic injection of server-side logic to test my production client-side code in the SUT 2. simulated, time-dependent data platform (used in the simulated broker platform)

These components are implemented in a similar manner to the above; there are a mix of time-based and event-based triggers with similar costs and criticality. I chose this strategy due to the high coverage of production code in a SUT. Additionally, it covers more components than those listed here for free.

Some details are omitted below; however, it is a faithful representation of the use-case.

package executor 

func setupExecutorIntegrationTest(ctx context.Context) (chan<- sim.Request, func(), *sql.DB) { // with sqlc, you can just return `*<db-pkg>.Queries` in place of `*sql.DB`
  brokersim := sim.BrokerPlatform(ctx, ...)
  httpclient, netclean , _ := newTestHttpServer(ctx, brokersim) // large-ish implementation of switch { case ... } on real API paths

  dispatch, refresh := broker.Broker(ctx, ...) // via abstraction, httpclient.Client() is passed here
  oracle := cache.CacheLayer(ctx, dispath, ...)
  Executor(ctx, dispatch, oracle, ...) // in-memory db via abstraction

  db, dbclean, _ := dataplane.OpenReadOnlyConnection(dataplane.Executor, ...) // in-memory db

  return brokersim, func() { netclean(); dbclean() }, db
}

Downsides

There are numerous, well-known downsides to this sort of testing instrumentation; in my case, the most relevant: 1. speed of the integration test 2. ease of parallelization of the integration tests 3. flakiness of the integrations tests

For example, I virtually was forced to run all my tests with -race to eliminate most (but not all) flakiness associated with spinning the clock and time-dependent calculations:

func TestIntegration_SomeCase(...) {
  t.Parallel()
  ctx, cancel := context.WithTimeout(context.Background(), defaultTimeout)

  brokersim, clk, cleanup, db := setupExecutorIntegrationTest(ctx)
  t.Cleanup(cleanup)
  clk.Set(...)

  simdata(brokersim, ...) // do some set up
  for range 120 { // invariant is average flaky
    clk.Add(time.Second)
  }
  assert(t, db, ...) // invariant for event-based / time-based behavior

  simdata(brokersim, ...) // do some set up
  for range 50 { // invariant is less flaky
    clk.Add(time.Second)
  }
  assert(t, db, ...) // invariant for event-based / time-based behavior

  simdata(brokersim, ...) // do some set up
  for range 250 { // invariant is more flaky
    clk.Add(time.Second)
  }
  assert(t, db, ...) // invariant for event-based / time-based behavior
}

`synctest` to the rescue

Before synctest, I had no concrete way to write more expressive, generalized integration test cases; however, the determinism allowed me to do so.

The learning curve was not as steep as I expected; however, I spent most of the time tracking down leaked goroutines.

key challenge: giving up `cloud.google.com/go/logging` in SUT

Due to how fast synctest is, the asynchronous logging buffers fill up too quickly for the library to flush; though I did not put much effort into debugging that (<60 seconds), it appears that the library makes some call into the bubble causing a panic. This behavior does not happen when using -race which "naturally" slows down execution.

As noted above, I did not spend much effort here because the primary use-case for having remote logging in my integration tests was to debug the very problems synctest solved for me.

My solution was simply to make my monitoring package aware of test.v; instead of bringing up the actual CloudLogging client in my thin wrapper, it simply sets all of the logging interfaces to log.Default() then io.Discards them in tests.

key challenge: `http.Client -> http.Transport`

The snippet generously provided by @neild at https://go.dev/play/p/HJQKRKuEYQf was virtually a drop-in replacement. I threw it into a ./internal/fakehttp package (for the time being).

I only needed to change the semantics of using *http.Client everywhere (e.g. Broker) to using http.RoundTripper (via *http.Transport). This was material to me because it allowed me to keep production code in the SUT which had some oauth2 components used for token exchange.

Due to my general lack of cleanup knowledge, the order in which these resources needed to get cleaned up did take a few minutes to grok; however, once I found the correct permutation, it all made sense.

key challenge: `SQLite`

Prior, all the databases used in the SUT were in-memory (e.g. file:mydb?mode=memory).

Due to my specific use-case, there are many events that fire (as seen in the general architecture section) in a given day; in addition, there are many time+event based triggers that happen 6+ hours a part that are critical invariants to test.

Obviously, synctest majorly speeds up the testing vs. real wall-time; however, with a "congested" SUT, it still takes materially long to finish. With injected clocks, it was easy to just make a clk.Set(...) call.

In order to preserve this behavior of "run something for 17 seconds now, then in 6 hours check this invariant that affects that thing," my solution was to serialize the states of the SUT (both the SQLite portion and the sim.BrokerPlatform state) as .testdb and .json flat-files, respectively. For non-SQLite implementations, I imagine a similar approach would be taken to avoid making uninteresting network calls.

In effect, every test case uses ephemeral ./testdata/x.testdb and ./testdata/x.json that it is responsible for creating and cleaning up. This allowed me to use the following paradigm to greatly speed up invariant testing without loss of durability:

type epoch struct {
  start       time.Time
  timeout     time.Duration
  sleep       time.Duration
  sim:        []sim.Request
  invariants: map[invT]any // abstracting details away here
}

func runEpochs(t *testing.T, configs []config, epochs []epoch) {
  t.Helper()
  for _, e := range epochs {
    time.Sleep(e.start.Sub(time.Now())

    ctx, cancel := context.WithTimeout(context.Background(), e.timeout)
    defer cancel()
    timeout := false
    context.AfterFunc(ctx, func() { timeout = true })

    brokersim, cleanup, db := setupExecutorIntegrationTest(ctx, t.Name(), configs) // serializes to `.json`, `.testdb`
    defer cleanup()

    simdata(brokersim, e.sim)
    time.Sleep(e.sleep)
    synctest.Wait()

    if timeout {
      t.Fatalf("%s timed out", t.Name())
    }
    assert(t, e.invariants)

    cleanup()
    cancel()
    synctest.Wait()
  }
}

func TestIntegration_SomeCase(t *testing.T) { 
  t.Parallel()
  defer func() { os.Remove(...); os.Remove(...) }()
  configs := []config{{...}}
  epochs := []epoch{{...}, {...}, {...}}
  runEpochs(t, configs, epochs) // wow!
}

While it is true that generalizing some notion of epoch for test cases could have happened before synctest, I had virtually no confidence in doing so because non-durably blocked execution would have yielded a bloated struct that just made the flake-reducing behavior hidden and less readable.

Comment From: gh2o

@neild

I've been thinking on the interactions between the synctest and testing packages. T.Context is a problem, as is T.Cleanup.

To avoid introducing an additional level of nesting or function indirection, what do you think about adding a T.Isolated method that functions similarly to T.Parallel?

// Isolated signals that this test is to be run inside an isolated "bubble".
// The test function will first be run as normal. When this function is called
// for the first time, the test function will exit immediately (via runtime.Goexit()),
// and then the test function will be called again from a bubbled goroutine.
// The second call to this function will be a no-op.
func (t *T) Isolated()

We can then write tests in the following manner:

func TestSleep(t *testing.T) {
  t.Isolated()

  a := time.Now()
  time.Sleep(24 * time.Hour)
  b := time.Now()

  if b.Sub(a) != 24 * time.Hour {
    t.Error("sleep not exact")
  }
}

T.Context and T.Cleanup can now also interact correctly with the bubble, now that the T type itself is bubble-aware.

Comment From: neild

T.Isolated is essentially the same as synctest.Start with a different spelling. Both replace the current Run operation of running a function in a bubble with an operation that moves the current goroutine into a bubble.

T.Isolated has the advantage of being scoped to the testing package.

You still need to call synctest.Wait in many circumstances, so the bubble API is now spread across two packages. We could address this by dropping the synctest package entirely and moving Wait into the testing package. An advantage of the current layout is that all the documentation for bubbles naturally exists in testing/synctest. We would lose that, but perhaps the loss is balanced by better discoverability in the testing package.

If we did keep the testing/synctest package, the name T.Isolated doesn't suggest a connection to that package. Perhaps we could rename either T.Isolated or testing/synctest to make the relation clearer. Or if we move synctest.Wait into testing then this problem does not apply.

Comment From: apparentlymart

I personally like t.Isolated because it suggests that the scope of the bubble is connected with the test that called it, even if the other functions for manipulating an already-active bubble live elsewhere.

But if significant API changes are still on the table, a further idea that the above discussion inspired was to have t.Isolated return an object of a type that could itself still be defined in the synctest packages -- let's say synctest.Bubble for the sake of example -- and have the bubble manipulations be methods of that type rather than package-level functions.

This would mean that the bubble manipulations can be done only by code that has access to that "bubble" object. I could imagine arguing that both as an advantage (it means you don't need to worry about spooky action at a distance with random code elsewhere fiddling with the bubble in ways that are not obvious from the test code) and as a disadvantage (you now have another value in addition to t that needs to be passed around to all of your test helper functions).

Comment From: neild

I've created a sub-issue with a proposal for a revised version of the API that replaces Run with Start: #73062

(Creating as a separate proposal to keep the discussion of this specific change organized.)

Comment From: MarcoPolo

I've been using synctest lately, and it's pretty nice. I have one small request. Could we add to the comment on Wait to specifically call out that goroutines blocked on Mutexes are not durably blocked? I know that an astute reader would notice that Mutexes are not in list of operations that would durably block and thus infer that a call to Mutex.Lock would not durably block a goroutine in a bubble. However, a small change to the comment text would make this clearer.

Consider also that there were a lot of discussions about whether a goroutine blocked on a Mutex should be durably blocked, and the initial proposal text mentioned mutex operations as supported. To quote Russ:

Someone can implement what is semantically a mutex using (a) sync.Mutex, (b) channels, or (c) sync.Cond. It seems weird that one of these methods has different behavior than the other two. It seems like all three should be acceptable, not just sync.Mutex.

I agree it's weird and especially non-obvious. I understand the reasons for not support sync.Mutex right now. I'm not suggesting a change to support them. I am suggesting an addition to the comment on Wait to explicitly call out mutex operations as a non-durably blocking.

Comment From: stroiman

First of all, I haven't used synctest, so my comment is purely based on reading the docs. But I'd like to see more control over the wall clock.

For testing line-of-business applications, setting a specific wall clock is often extremely useful in tests. E.g, setting the time to 23:59:59 the last day of the month, vs. 00:00:01 the following day, to check if a rule was applied. This becomes even more critical for multi-time zone systems (i.e., the ability to control UTC wall clock for testing a multi-time zone system is critical)

According to the docs.

Goroutines in the bubble use a synthetic time implementation. The initial time is midnight UTC 2000-01-01.

This should make it technically possible to forward time to the desired simulated wall clock.

time.Sleep(MustParseISOTime("2025-04-01T12:00-01:00").Sub(time.Now()))

But it's not as explicit as something like SetTime(MustParseISOTime("2025-04-01T12:00-01:00")).

A suggestion could be passing a value to the bubble to interact with time itself?

type T interface {
    SetTime(time.Time)
}
Run(func(T))

Comment From: neild

This should make it technically possible to forward time to the desired simulated wall clock.

This is exactly the right way to set a precise simulated time: time.Sleep until the desired point in time.

We could add a SetTime function, but this would be additional API surface with no additional functionality. We'd have to define what happens when time goes backwards. The interactions with other goroutines also seem ambiguous: If one goroutine calls Sleep(1*Time.Second) and another calls SetTime(time.Now().Add(2*time.Second)), does the first goroutine wake after one simulated second or two?

Using time.Sleep is (in my opinion) clearer: Time can't go backward, and intervening events occur before the sleeping goroutine wakes.

Comment From: gopherbot

Change https://go.dev/cl/662155 mentions this issue: runtime, testing/synctest: stop advancing time when main goroutine exits

Comment From: aclements

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — aclements for the proposal review group

Comment From: aclements

Placed on hold. — aclements for the proposal review group

Comment From: jfancher

I've found synctest to be a perfect fit for a utility I was having a hard time testing. It's great! Like others have mentioned, I'd also appreciate more knobs in how to control the clock in tests; however, those situations often involve business logic, not just synchronization, so it may be hard to define something simple that would cover all needs.

That aside, I wanted to re-raise the interaction between the clock and the race detector. The following code reliably passes, but fails under -race:

func TestTimeRace(t *testing.T) {
    synctest.Run(func() {
        var val int
        go func() {
            val = 1
            time.Sleep(1)
            val = 2
        }()
        synctest.Wait() // ensure the goroutine has started
        time.Sleep(2)
        if val != 2 {
            t.Fatalf("expected 2, got %v", val)
        }
    })
}

This was mentioned as expected by https://github.com/golang/go/issues/67434#issuecomment-2642915522, but is that right? My reading of the docs on time under synctest implies that val = 2 definitely happens-before the read, since time freezes in a bubble until all goroutines are durably blocked. It's simple enough to work around (add a second synctest.Wait() after the time.Sleep(2)), but if my understanding is correct it'd be nice for the race detector to know about it. Otherwise, it sends mixed messages about how deterministic the flow of time is in a bubble.

Comment From: huww98

I think it is expected because I think synctest should still run race detector as if the code is in normal (non-test) go routine. i.e.: this test also fails:

func TestTimeRace(t *testing.T) {
    var val int
    go func() {
        val = 1
        time.Sleep(1)
        val = 2
    }()
    time.Sleep(2)
    if val != 2 {
        t.Fatalf("expected 2, got %v", val)
    }
}

In my mind, synctest just simulates an infinitely fast machine, which can execute infinite instructions even in a single nanoseconds. So we should not expect race detector to change its output based on how fast the machine runs. After all, race detector is designed to detect the race that not really happens.

Think this another way, here is a truly racy case, how can synctest differentiate them?

type codeUnderTest struct {
    val int
}

func (c *codeUnderTest) Go() {
    go func() {
        c.val = 1
        time.Sleep(1)
        c.val = 2
    }()
}

func (c *codeUnderTest) Get() int {
    return c.val
}

func TestTimeRace(t *testing.T) {
    synctest.Run(func() {
        var c codeUnderTest
        c.Go()
        synctest.Wait() // ensure the goroutine has started
        time.Sleep(2)
        if c.Get() != 2 {
            t.Fatalf("expected 2")
        }
    })
}

Comment From: jfancher

Hmm... to be clear, you're saying your second example is "truly racy" because application code would not be able to use it without races, not that execution would happen any differently when run under synctest, right?

I guess I wouldn't expect the race detector to detect anything when running under synctest for that, because a synctest bubble changes the flow of time in a way that eliminates the race. As you say, it simulates a machine that not only can, but is guaranteed to, run infinitely many instructions between time t and t+1 assuming no blocks. Or put another way, doesn't time.Sleep(>0) in a bubble necessarily imply something very close to synctest.Wait()? Time would not have advanced to t if all other goroutines were not durably blocked at t-1 (it's only not an exact equivalence because multiple goroutines could have unblocked at t, that could still be an actual race).

When I look at the solution, adding a synctest.Wait() after the time.Sleep(2), I just find it confusing. I read it and think "But what could it possibly need to wait for at this point?" After I've carefully constructed the timeline of concurrent events and convinced myself they're happening exactly in the order I expect, it's jarring to add a seemingly-unnecessary wait. I obviously understand it now as I'm writing the case, but I worry that in the future my at-a-glance reaction will be the same as seeing an arbitrary time.Sleep(..) in a non-synctest case. Like, "🤷 , something else is probably happening in the background, give it a bit to sort itself out". Which is the exact kind of thing we're trying to avoid by using synctest.

I'm sure I can get used to it if the design is kept as-is, but I do feel like it's a rough edge (again, unless I'm completely misunderstanding something and there is a race here somehow).

Comment From: huww98

application code would not be able to use it without races

Yes, exactly.

Maybe we can make race detector aware of the guarantee of bubble. But we will not discover bugs like this example then. Just to save a Wait()? I think it is not worth.

We don't want Sleep(..) because we don't know how long to sleep. If too long, the test is slow. If too short, the test is flaky. Wait() does not have this issue.

Comment From: mkadirtan

// SetDeadline sets the deadline associated with the listener.
// A zero time value disables the deadline.
func (l *UnixListener) SetDeadline(t time.Time) error {
    if !l.ok() {
        return syscall.EINVAL
    }
    return l.fd.SetDeadline(t)
}

Great feature, but is there anyway we can make this work with system calls, too? Following code waits till the actual timeout even with synctest:

listener.(*net.UnixListener).SetDeadline(time.Now().Add(10 * time.Second))
listener.Accept()

Comment From: neild

The race detector complains about the code in https://github.com/golang/go/issues/67434#issuecomment-2817266446, because time.Sleep doesn't establish a happens-before relationship. For example, to simplify it even further:

func Test(t *testing.T) {
  synctest.Run(func() {
    ran := false
    go func() {
      ran = true
    }()
    time.Sleep(1 * time.Nanosecond)
    // synctest.Wait()
    if !ran {
      t.Errorf("goroutine did not run")
    }
  })
}

This test consistently passes, because the time.Sleep call does block until every other goroutine in the bubble has blocked. However, it fails under the race detector, because the Sleep does not establish a happens-before relationship. This is intentional, because this code is obviously racy outside of a synctest bubble.

To make this test non-racy, we can insert an explicit synchronization point by uncommenting the synctest.Wait line. This waits for other goroutines to block and establishes a happens-before relationship on them.

I think this behavior is useful, because it lets us catch real races in code under test. In this example, all the goroutines and sleeps happen within test code, but we can imagine a system which contains a similar data race in real code.

As @huww98's suggests (https://github.com/golang/go/issues/67434#issuecomment-2818402973), you can think of a synctest bubble as an infinitely fast machine which executes all actions instantaneously, and where time only passes when a goroutine explicitly sleeps. Even on this infinitely fast machine the race detector doesn't consider time.Sleep to establish a happens-before relationship, allowing it to detect races that will occur on slower (and more realistic) machines.

Comment From: neild

Great feature, but is there anyway we can make this work with system calls, too?

Unfortunately, this is pretty much impossible, because we don't have any way to detect when a system call is durably blocked.

For example, consider two goroutines communicating over a localhost socket. One writes data to the socket, and the other reads from it. If we write data and call synctest.Wait, we would expect Wait to block until the reading goroutine reads the data. However, due to internal kernal buffering, there can be a window of time after the write succeeds before the data becomes available to read. There's no way I can see to detect this.

Comment From: paskozdilar

Having an infinitely-fast virtual network (akin to the infinitely-fast processor which executes all non-blocking code in an instant) in context of synctest would be very useful.

One could write code as to be able to "plug in" a mock net implementation, but same argument applies for "time" itself, which synctest overrides.

Would similar overriding of "net" package be acceptable in any form?

Comment From: mkadirtan

Great feature, but is there anyway we can make this work with system calls, too?

Unfortunately, this is pretty much impossible, because we don't have any way to detect when a system call is durably blocked.

For example, consider two goroutines communicating over a localhost socket. One writes data to the socket, and the other reads from it. If we write data and call synctest.Wait, we would expect Wait to block until the reading goroutine reads the data. However, due to internal kernal buffering, there can be a window of time after the write succeeds before the data becomes available to read. There's no way I can see to detect this.

Thanks for the explanation 🙏

Comment From: spenczar

I have found synctest completely delightful. It has solved real problems for me, and helped me find bugs.

My system has workers in separate processes which coordinate to allocate work. When a worker joins, a rebalance procedure takes place to reallocate work. Before synctest, this was very hard to test with go test. I had to reach deep into the coordination internals to set up hook points, using channels to imitate a kind of synchronization. But then I had to check those channels with timeouts, and things were flaky.

Synctest has made my tests very reliable. I was surprised at how much simpler it has made my tests, too. Gone are the callbacks and hooks and selects to check if a signal channel has closed within a timeout; now I can just synctest.Wait() for things to settle, and it all kind of works.

Two things that gave me (very minor!) difficulty: 1. My test leaked a goroutine, and thus didn't terminate. I spent a long time barking up wrong trees on this. I should have read the documentation more closely. I suppose I came in thinking synctest is about synchronization, and not about goroutine leaks. 2. I found it tempting to sprinkle synctest.Wait() all over the place. I wish I had a way to know which of my synctest.Wait() calls were pointless. If the bubble root's goroutine is the only goroutine that has advanced between two synctest.Wait() calls, maybe I should get a stern warning of some kind.

Overall, I immediately feel like synctest is something I will want to reach for and use to improve many concurrent tests. This is great!

Comment From: dfinkel

I found it tempting to sprinkle synctest.Wait() all over the place. I wish I had a way to know which of my synctest.Wait() calls were pointless. If the bubble root's goroutine is the only goroutine that has advanced between two synctest.Wait() calls, maybe I should get a stern warning of some kind.

Perhaps Wait() should return a count of goroutines that woke/slept since the beginning of the bubble?

Operating at a different level, but my go-clocks package's fake clock implementation includes a number of methods for returning the number of times certain things happened. In particular, AdvanceClock() returns the number of goroutines that were awoken. That doesn't translate directly, but Wait() could return the number of goroutines that became durably blocked in the bubble (possibly as an event-count since the start of the bubble)

Comment From: gh2o

I'm glad that the synctest framework has been getting positive feedback :)

One thing that I was a little iffy about was the fact that the time package is now serving double duty as both a real clock and a virtual clock, and I'm worried about extra complexity when reasoning about cases where bubbled and non-bubbled code interact with one another.

To that end, I'm of the opinion that synctest should not affect the behavior of the time package at all, and any clock mocking should be moved into another package altogether. This functionality can probably even be implemented in a third-party package, since the synctest.Wait() primitive is sufficient to implement such a fake clock.

For this explicit fake clock, I'm thinking it could have an interface like this one, and it could either be explicitly passed around test code, or implicitly passed around by exposing it on testing.T.

Comment From: neild

Previous discussion on overriding the time package vs. providing a testing implementation: https://github.com/golang/go/issues/67434#issuecomment-2138381376

Comment From: spenczar

I want to amplify this comment from the fake time discussion:

We essentially put an asterisk next to every existing function in the time package that deals with the system clock saying, "don't actually use this, or at least not in code you intend to test".

My experience with synctest, as a user, was this: we found a bug in production. We suspected it was due to a logical race. Synctest made it simple to verify this. If I had to use a fake time implementation, then I would be signing up for a fairly involved refactor, adding the fake time interface to both exported and unexported functions and types. Other test code would need to change to instantiate this fake object that would get handed off, even when time issues were not under test. It just would be very intrusive - so I would end up writing a flaky test.

synctest makes it easy to test the behavior of existing code, not stuff made with time testing in mind from the start.

Comment From: pierrre

Is it Ok to call t.Fatal() or t.FailNow() from the goroutine started by synctest.Run() ? According to testing documentation, we must not call these functions from other goroutines. I can't find anything about this in the issue, but maybe I didn't read properly.

Comment From: neild

Is it Ok to call t.Fatal() or t.FailNow() from the goroutine started by synctest.Run() ?

No, because as you point out, you can only call those functions from the test goroutine.

This is another point of interaction with the testing package that #73567 would address.

Comment From: glycerine

This is pretty awesome. I'm using it to write a channel based network emulator for my RPC system implementing Raft elections so I won't have to modify the production code's timers (timeouts are an essential mechanism in Raft leader election). I have a basic proof of principle working, but I'll be able to give more feedback once it is finished.

re other time packages--the best thing here is being able to test the actual code production code without modification--so I don't like the suggestion above about having to use a different time package.

re drawback: Deadlock errors don't print useful stacks? you just have to set export GOTRACEBACK=all first; I get useful stacks then.

also: re "Time advances forever after main test goroutine exits": This doesn't seem any different from the normal need to shutdown all the goroutines for a service in the middle or at the end of a test run. I developed standard operating procedures to do this years ago. It is why I wrote https://github.com/glycerine/idem which provides idempotently closable channels (essential when there are lots of shutdown execution paths and many goroutines want to tell everyone they are about to vamoose; i.e. almost always) and supervision trees. Hence I don't see this concern as much of a demerit.

edit: I did have a question too. Is there/should there be a warning if I "burst my bubble" by accident? I created a file on disk, and my sim still runs fast through all the sleeps... is the consequence just that Wait() hangs forever? After some thought, I can see how it would be nice to get input from "outside the bubble" like doing some network or disk OS calls, before doing wait. In fact that is probably super useful/essential. But it would be nice to be able to detect the situation where I think I should be able to Wait for an instant but in fact I've not isolated something... might be useful to have a real-time timeout on Wait() to let me detect such problems (that there are non-durably blocked things continuously in my test that I didn't expect)? Or maybe there is a way to do this I'm not seeing... I could use an "outside" timer (obviously an "inside the bubble" timer won't do...) but then ... can I force Wait to panic somehow to unstick? Ah this is a Chicken-and-egg problem; might need a small bit of additional API for it; I cannot use an outside timer from inside the bubble to detect if the bubble is not blocked, because that timer itself would prevent the bubble from blocking... Docs question: does this system still work after recovering from a Wait panic? I'd have to isolate the synctest.Wait() in a function to recover from its panic... which just seems ugly when a timeout (real-time timeout) parameter would solve this cleanly. Or have a parameter that is passed to wait that causes it to return an error? Maybe this has already been discussed somewhere? Just brainstorming...

Comment From: neild

I created a file on disk, and my sim still runs fast through all the sleeps... is the consequence just that Wait() hangs forever?

If you do file I/O within a bubble, the only consequence is that synctest.Wait waits for any file I/O operation to complete before returning. For example, synctest.Wait won't return while another goroutine is in the middle of os.ReadFile, because while it might be blocked in a read that read won't be "durably" blocking.

This is generally what you want. Unless you're using non-regular files (pipes, sockets, etc.), file access within a bubble will just work as expected.

Comment From: rittneje

Unless you're using non-regular files (pipes, sockets, etc.), file access within a bubble will just work as expected.

@neild Am I correct in assuming that one caveat here is the file mtime will still reflect real time and not synctest fake time, which could be a problem for some use cases?

Comment From: neild

@neild Am I correct in assuming that one caveat here is the file mtime will still reflect real time and not synctest fake time, which could be a problem for some use cases?

Yes, that's correct. The synctest bubble doesn't affect the OS at all, so the filesystem times will reflect the real clock.

Comment From: cederlys

I've used this to test idle timeouts in a HTTP/2 server, where I have some mock code for the underlying net.Conn. This works fantastically well. It has really been helpful.

I did find that deadlocks were hard to track down, though. I wish the GOTRACEBACK=all trick mentioned by @glycerine above had been mentioned earlier.

Instead, I ended up instrumenting my test code with tracker.Block() calls before each potentially blocking operation, and tracker.Unblock() after. I had one tracker per goroutine, and they all forwarded the block state to a central manager. If it wrote "Goroutines 1, 2 and 4 are currently potentially blocked" just before synctest.Run reported a deadlock, I knew where to look. (And Block actually reported the callsite.) Of course, this method only works when you have deadlock or goroutine leak in the test code itself. It wouldn't have helped if the issue was in the code being tested. And instrumenting the code like this is very intrusive. I don't recommend it.

TL;DR: synctest is great! But something needs to be done about how deadlocks are reported.

Comment From: glycerine

I finished writing a channel-based network simulator that uses synctest. My RPC tests and actor async-style peer/circuit tests are fairly thorough, and I am happy to report that synctest did not glitch under any of them. My simulator handles timers, network message sends, and network message receives within my RPC framework.

For others wanting something similar for testing network code, here is a link -- with the large caveat that of course any simulator will depend heavily on the RPC/network library that one uses. This is not, at the moment, a "general purpose" network mock. It wouldn't be a huge lift to make it so, but I did not need full generality for my use case, which allows me to skip a ton of slow serialization and deserialization. The simnet code is aimed squarely at testing Raft's randomized timeouts over rpc25519 (that part is up next; still todo). Still, this would be a massive head-start; you would just need to update the semantics to use []byte and implement net.Conn instead of readMessage()/sendMessage().

https://github.com/glycerine/rpc25519/blob/master/simnet.go

I am very stoked to be able to have test and production code remain identical; by using synctest.

The only advice I would have for users, and perhaps this is guidance worth mentioning in the docs... be sure to get your tests working well before you run them under synctest.

Since I was learning how to use synctest while simultaneously writing my network emulator, I ended up mixing up the two (new tests and synctest) and sometimes puzzled as to whether a hung test was my bug or my not understanding synctest semantics. Invariably thus far it has been my bug, as running without synctest enabled showed. In short, start with tests that are already debugged and working.

If synctest didn't exist, I would have had to implement something far cruder and more invasive to the final production code.

Comment From: glycerine

I did have one more thought. It is about stable API design while maintaining flexibility to fix bugs and provide advanced features. It is invariably useful to have a pointer to a struct being passed through a function API, like how the testing.T might get passed around in possible future version of Run. If it turns out that Run or Test of Bubble or whatever it gets called does not end up taking a t testing.T, then I recommend adding a pointer to an opaque struct instead.

The reason this is useful is that you can then later add control flags to that struct without changing the function signature and breaking API compatibility. Here is an example, in my suggested nomenclature...

Instead of

faketime.Bubble(func() {
        // user test code here
})

ctrl := faketime.BubbleControls()
faketime.Bubble(ctrl, func() {
     // user test code here
})

This may seem like a minor extra bit of boilerplate to ask users to invoke, but believe you me, this kind of design has saved my bacon so many times that I consider it a minor crime to design an API without such a "control" or "config" struct handle. Otherwise you end up not being able to modify the frozen API surface. Here the ctrl struct can have private members that can change or be added to in the future, sliding "right through the hole" in the frozen surface; they will need no user facing code changes.

Designing a new API that does not have users yet? Jump on it. This is your last chance to get this kind of future flexibility for free. It means you can fix major bugs in minor point releases. While I share Mssrs Pike, Thompson, and Griesemer's love of minimalist design, this is a one-line tiny up-front cost to get super stable but strong API in the future.

Comment From: cederlys

The only advice I would have for users, and perhaps this is guidance worth mentioning in the docs... be sure to get your tests working well before you run them under synctest.

Once my test was complete, I tried to run it without synctest, just to be able to tell my colleagues how much faster the test is thanks to synctest. ("It now runs in 0.2 ms even though it tests a 20-second timeout!") But it turns out that was hard to do. My testcase checked that the timeout actually happened after exactly 20 seconds. Without synctest, you can't compare times exactly, so I had to add several lines of code that did fuzzy comparisons. Some other checks that were stable under synctest turned out to be flaky when the time advanced in a normal fashion.

While it is true that a working test will run well under synctest, I think it is also true that you can test more things in a reliable way when you are using synctest. So I'm not sure this is good advice, in general. In my opinion, there is no need to limit yourself to only testing things that are testable without synctest.

Comment From: glycerine

I'm seeing time not reset to zero between tests, making some of my fresh tests when run in a batch fail. They run green when alone, not with other tests. I filed https://github.com/golang/go/issues/73615

Comment From: glycerine

I am really stumped about why A) time.Sleep and B) synctest.Wait do not return in two scenarios: https://gist.github.com/glycerine/48cea61f4d635085f76c36c6f95fc23b

I've looked at every goroutine in the stack dumps, and I can't see it. All the goroutines in the synctest group look durably blocked. My guess would be, if I had to guess, that the synctest is ignoring the mutex that the sync.Cond is using... but just a guess.

It would be incredibly helpful to my understanding if I could grok why synctest doesn't let the sleep or the wait proceed.

Comment From: glycerine

cederlys good point. I've realized as much with another day's worth of experience with synctest. I've started to picture it as its own universe, where you have time on the Real number line -- where the space between integers in infinite -- and for clock labelling, all events get slammed back to the floor (nearest minimum integer) of their actual R (for Reals) number. In this alternative synctest universe, time only increments when there is nothing left to do except take the next nearest timeout and jump to it's integer clock label... certainly it should still be a self contained correct, but the corresponding expanded into our actual universe code need not be "correct" ; bugs or not stand on their own in the synctest universe. The relationship between runs under synctest and a run without is that the synctest run is drawn from a subset of possible "behaviors" in Lamport's temporal logic terms, and that set has more determinism in its clock labels (not complete determinism, but more, which is why it is useful).

edit: Of course since the single behavior you draw on running a synctest program is a possible interleaving of events that your program might observer, you want your program to be correct in both universes, since your program could get a draw (single sample of behavior from either when running in production), but having tests be red in synctest and green in the non-synctest universe, or green in synctest and red without, says nothing about correctness when only considering the synctest universe's subset. Hence, as you point out, my heuristic is a useful guideline for program development, but not a strict necessity for correctness/safety/liveness concerns.

Comment From: glycerine

I re-read the synctest blog post after having some more experience with synctest. (reference https://go.dev/blog/synctest)

I'd like to describe my understanding at this point, so that

a) others can check if it is correct (please point out errors);

b) so that I can describe what I would ideally like synctest to do, even if it doesn't do that today; and

c) we can distinguish between what synctest should or can do (assuming no bugs in implementation--which I'm not convinced is the case at this point, per my comment and gist above in https://github.com/golang/go/issues/67434#issuecomment-2857318636; specifically https://gist.github.com/glycerine/48cea61f4d635085f76c36c6f95fc23b ) given the practical limitations of having to operate within the Go runtime and all existing package facilities.

Warning: this is a bit long, I apologize, but I cannot see how to abridge it without sacrificing comprehension. Grab a coffee and give yourself an uninterruptable space-time coordinate to concentrate in.

The context is that I'm writing and testing a Raft implementation. Since Multi-Paxos in general is easy to get wrong and hard to get right (Raft as a more tightly constrained subset of Multi-Paxos that makes it easier to get right, but it is still far from trivial). One of the best ways to tackle testing my implementation would be to have FoundationDB/Antithesis (https://www.youtube.com/watch?v=4fFDFbi3toc; https://antithesis.com/blog/is_something_bugging_you/) style complete determinism and repeatable simulation-test runs. I suggested a more deterministic Go scheduler algorithm (Kendo; http://groups.csail.mit.edu/commit/papers/09/asplos073-olszewski.pdf) to the Go team in 2017 (https://github.com/golang/go/issues/21916), but Russ and Austin felt it was too big a project to tackle at that point. I'm delighted that we are inching towards this today. While Jelle van den Hooff's Gosim gets close --https://github.com/jellevandenhooff/gosim/blob/main/docs/design.md -- it requires transcompiling Go into another Go program that mocks time/native maps/random number generators with deterministic versions he has written for the network, filesystem, and so forth, it is still a young project and getting a compiler right is so hard that I worry the source-to-source transpilation will not be faithful to the original in all cases; moreover having multiple independent approaches that bite off different parts of the pie in different ways gives a better statistical likelihood of finding bugs. I point out his project also since his mocks may be useful to others in conjunction with synctest.

With that background in mind, and knowing that we cannot expect fully repeatable, deterministic execution, lets take as our shared goal: create as close to a deterministic network simulator as possible (partly done; https://github.com/glycerine/rpc25519/blob/grid_testsync/simnet.go#L18 gives the state of it today, with my questions). What follows is a condensed version of that initial read on the synctest blog.

My point of view is that of the "master" scheduler goroutine that wants to manipulate the network simulator to introduce network partitions, message drops, delayed reads, re-ordered and duplicated messages, and general chaos to the Raft cluster of machines by restarting servers, partitioning them, and just halting them completely--in the most deterministic and repeatable way possible.

Looking at the "Testing Time" section, the snippet TestWithTimeout in the blog post, we see:

// Wait just less than the timeout.
time.Sleep(timeout - time.Nanosecond)
synctest.Wait()

This was immediately puzzling. A priori, I naively thought, "isn't that backwards?" What is the point of the sleep before the wait? Isn't it just wasted? If I've just slept, and then woken up, any number of other goroutines could also have woken at the exact same instant on matching timers, so how can I reason at all about their execution order?

My questions at this point were: we know faketime advances when all goroutines are otherwise blocked (in a bubble), but does faketime only advance then? Is that guaranteed to be the only way (by synctest)? Is it possible that faketime advances under any other circumstances? I would need to know, in order to reason rigorously. For the moment, lacking answers, let's make some assumptions and proceed. Again, please correct me if I've misunderstood anything, even in shades of gray. Let me know if you have answers. Also let me know if these are goals, even if they aren't fully met today.

Let's assume, then, that faketime can only advance when all goroutines are durably blocked. "Time advances in the bubble when all goroutines are blocked," is what the blog says, but I think (again, please correct me), I suspect this should have said durably blocked rather than omitting that rather critical adjective.

So we (the main goroutine that executed time.Sleep) are about to wake up from the time.Sleep call (we are assuming running under synctest throughout here). Other goroutines representing servers/nodes in the network were probably actively setting alarms, doing reads, doing writes, crashing, restarting, etc. At the instant that I, the master-network-scheduler goroutine, wake up and look at the current faketime reading, what do I know? I'm awake, but so is any other goroutine with matching timer deadline.

Now what does the next step, the synctest.Wait call do for us? I conjecture that it does (or ideally should) do this: we, the scheduler, get to wait for "everyone else" to finish first, and get durably blocked. Wow! This is great. I, the scheduler, don't have to worry about any other goroutine operating concurrently with me! I know that they are all done. If we think about all possible interleavings of the scheduler and other network node goroutines, there are a ton, but hopefully they all arrive at the same repeatable state, and we just did a barrier operation that has left all of them blocked, and only I, the scheduler, am awake. Fantastic! Now the scheduler can manipulate the network atomically, deterministically, and reproducibly! I can partition a server node and know that I don't have a data race with any of its reads/writes/alarm changes. The scheduler is sure that it is "going last", at the end of everyone else's execution, and this is a powerful and potent guarantee. If we get to go last, and since everyone else is blocked, then my actions are applied on top of whatever state the cluster has arrived at, and this dramatically increases the reproducibility and determinism of the simulation!

In effect, we have "frozen time" just at the end of this nanosecond. We can adjust our queues, match sends and reads, set new timers, process expired timers, change node state to partitioned, etc, with the full knowledge that we are not operating concurrently with anyone else, but just after all of them have done any business they are going to do at the top of this nanosecond. We are now "at the very end" of the nanosecond current nanosecond (at the supremum, perhaps, if you liked my Real analysis metaphor above), just before the next clock tick (if we so choose to tick the clock by sleeping again next). Nice.

What if we did the wait, then slept (the opposite order)?

After the wait, we also know that everyone else is durably blocked. So in fact, we don't need to sleep ourselves to know that nobody else will be doing anything concurrently. It is guaranteed that we are the last, no matter what time it is. In fact, we might want to have other goroutines do stuff for us now, respond to us, and take any actions they are going to take, get blocked on timers or reads or sends. They can do so, operating independently and concurrently with us, until the next time.Sleep or synctest.Wait(), IF we want to be operating with them concurrently (but that is MUCH more non-deterministic!) then this is fine. BUT if we want the determinism guarantee that we are "going last" and have seen everyone else's poker hand before placing our bets, then we will want to do the Sleep then Wait approach.

This is a great. It is a fabulous situation to be in with respect to testing Raft.

But: it is still just a conjecture. I've made assumptions about synctest above that may not hold.

Let's write a test to check that we are indeed "last" after the double shot of sleep + synctest.Wait? How can we know if there is anyone else "active"? Maybe this is simple? Block ourselves with a select{}. If we were the only ones active, then the synctest system will panic, reporting deadlock.

My efforts to test synctest in this way resulted in the gist above, and were not very successful at establishing whether synctest is operating under the principles just described. Specifically, I do not see deadlocks when I expect them (with N=2 or above). I cannot tell if that is due to bugs in synctest, or bugs in my understanding. In an attempt to resolve this question, I've outlined my understanding in detail above. I apologize for the length, but precision and pointing out areas of ambiguity will hopefully result in a robust bug-free synctest, clear documentation of its guarantees and lack thereof, and an improvement in my ability to use it correctly.

Comment From: neild

I am really stumped about why A) time.Sleep and B) synctest.Wait do not return in two scenarios:

You're running afoul of this (from the synctest.Wait documentation:

A goroutine is not durably blocked when blocked on a send or receive on a channel that was not created within its bubble, because it may be unblocked by a channel receive or send from outside its bubble.

The barrierUp channel is created at init time, outside of any synctest bubble, so a select which includes a case <-barrierUp is not durably blocked.

Comment From: glycerine

The barrierUp channel is created at init time, outside of any synctest bubble, so a select which includes a case <-barrierUp is not durably blocked.

The hazards of global variables strike again! Thanks Neil.

Comment From: neild

As of https://go.dev/cl/645719, the error reporting for bubble deadlocks should be better. When a "deadlock: all goroutines in bubble are blocked" panic occurs, the runtime will now by default print the stack for the panicking goroutine (the one which called synctest.Run, followed by stacks for all goroutines in the bubble.

Comment From: glycerine

Suggestion: It could be useful during development to have an env variable that would cause synctest to immediately abort and give the developer an error at test runtime if:

a) there is a sync.Wait() call in progress; and

b) a select within a bubble is not durably blocking because it is receiving on a channel from outside the bubble.

(I could have used such an env variable to have the synctest help me catch my mistake, without needing expert advice).

Obviously some "outside to inside the bubble" channel operations might be needed in general, and so should not be verboten, but I think this could be a big help during development. The system then teaches beginners to synctest where they are going wrong. It would allow the dev to convey intent to the synctest runtime, and would allow the synctest runtime in turn to provide helpful assistance in writing the intended test logic. (edit: Or a new test flag that did the same, equivalently. Just like we run with go test -race during testing, we could run with go test -durcheck ).

Comment From: neild

Perhaps the stack traces for bubbled goroutines should include an explicit mention of whether the goroutine is durably blocked or not. The goroutine status actually already tells you this, but you need to know an unreasonable amount about synctest internals to know what statuses are durably blocking and which are not.

So something like this, where the first goroutine is reading from a non-bubbled channel and the second is reading from a bubbled one:

goroutine 36 [chan receive, synctest bubble 34, not durably blocked]:
_.Test.func1.1()
    /tmp/s/main_test.go:12 +0x24
created by _.Test.func1 in goroutine 35
    /tmp/s/main_test.go:11 +0x60

goroutine 37 [chan receive (synctest), synctest bubble 34, durably blocked]:
_.Test.func1.2()
    /tmp/s/main_test.go:16 +0x24
created by _.Test.func1 in goroutine 35
    /tmp/s/main_test.go:15 +0xb4

Then if you have a test that's hanging unexpectely, you can run it with -timeout=1s (or some other appropriately short interval) and observe which goroutines are keeping Wait from returning.

Comment From: glycerine

include an explicit mention of whether the goroutine is durably blocked or not.

Oooh. That would be awesome. I like it.

I still think the -durcheck flag is also worth it, hopefully saving alot of support tickets and confused newcomers. I mean, if you like hand holding dozens of beginners, it's your time... or just say "run with go test -durcheck first". :)

edit: Also the problem might occur after running for several minutes; just like a data race condition caught by an extended run on go test -race. Since my simulations are "model checking light", I do expect to be running them overnight (16 hours+). (This might be an argument for an env variable rather than a test flag, so that it could be altered during a run... where setup might need to skip to initialize properly, then set it once steady state is reached... presuming it is not cached. Maybe it should just be an API method call...to allow caching it. I can imagine wanting to tell the simulator to pause on precisely the scenario that locates a problem to be debugged, e.g. everything set up to re-step through the problem trace... Bah. There easier ways to handle this use case. Just the flag for simple teaching of new users should suffice.)

I used my newfound intuition about outside channels preventing durable blocking to construct a "pausable simulation" proof-of-principle:

https://github.com/glycerine/rpc25519/blob/grid_testsync/play/pausable_scheduler_demo_test.go

Comment From: glycerine

I especially like the labeling of the durabilty in a stack trace because I'm still wondering if my conjecture above is correct, and at least I could spot check it. To extract the essential question from my previous long query:

Does the goroutine that does time.Sleep() and then finishes a synctest.Wait() now know that they are the only runnable goroutine, that all others are durably blocked?

I think so, but I've assumed there is only one synctest.Wait() possible at a time. If I assume that, then the Sleep wakes only those goro who should be awake at this point in faketime (not blocked on a timer), and the Wait lets every other goroutine durably block before returning to its caller. Yep. There it is, in the docs. That must be why it is guaranteed:

[synctest.Wait] panics if called from a non-bubbled goroutine, or if two goroutines in the same bubble call Wait at the same time.

Nevermind(!) I've answered my own question. :)

Comment From: coxley

Minor feedback: using synctest.Run means you can't take advantage of t.Context() auto-cancelling when the test finishes, since it's part of the test. It's not the end of the world, but tripped me up when converting existing tests.

It's easily worked around, so not sure it's worth polluting the API.

Basic example:

package main

import (
    "testing"
    "testing/synctest"
    "time"
)

func TestSystemServer(t *testing.T) {
    synctest.Run(func() {
        ctx := t.Context()
        // This is the "fix"
        // ctx, cancel := context.WithCancel(t.Context())
        // defer cancel()

        ch := make(chan struct{})
        go func() {
            timer := time.NewTimer(time.Second)
            for {
                select {
                case <-ctx.Done():
                    return
                case <-timer.C:
                    ch <- struct{}{}
                }
            }
        }()

        synctest.Wait()
        <-ch
    })
}

Wrapping in a sub-test is one work-around:

package main

import (
    "testing"
    "testing/synctest"
    "time"
)

func TestSystemServer(t *testing.T) {
    synctest.Run(func() {
        t.Run("child", func(t *testing.T) {
            ctx := t.Context()

            ch := make(chan struct{})
            go func() {
                timer := time.NewTimer(time.Second)
                for {
                    select {
                    case <-ctx.Done():
                        return
                    case <-timer.C:
                        ch <- struct{}{}
                    }
                }
            }()

            synctest.Wait()
            <-ch
        })
    })
}

Comment From: spenczar

Minor feedback: using synctest.Run means you can't take advantage of t.Context() auto-cancelling when the test finishes, since it's part of the test. It's not the end of the world, but tripped me up when converting existing tests.

Previous discussion on the subject: https://github.com/golang/go/issues/67434#issuecomment-2749224778

I find the synctest.RunIsolated API appealing, after some experience with Synctest. I like that it has similarities to t.Run.

It would be really nice if it were t.RunIsolated but I understand the reasons you want to approach that cautiously.

If you had a handle to *testing.T, you could use T.Log to print useful information, as well, like how many goroutines progressed in a synctest.Wait call (discussed somewhat in https://github.com/golang/go/issues/67434#issuecomment-2840284395).

Also, you could use T.FailNow instead of panicking, and include a clean error message (first suggested in https://github.com/golang/go/issues/67434#issuecomment-2118224156)

Comment From: gopherbot

Change https://go.dev/cl/670976 mentions this issue: runtime: print blocking status of bubbled goroutines in stacks

Comment From: glycerine

Neil wrote:

Perhaps the stack traces for bubbled goroutines should include an explicit mention of whether the goroutine is durably blocked or not. The goroutine status actually already tells you this, but you need to know an unreasonable amount about synctest internals to know what statuses are durably blocking and which are not.

Change https://go.dev/cl/670976 mentions this issue: runtime: print blocking status of bubbled goroutines in stacks

Sweet. Thanks for posting the CL, Neil.

I was stuck in some slow scenarios, and wondering if it was because of waiting for durable blocking. So I tried to decode the CL and put together this cheat sheet. If anyone else is blocked on decoding the durability hints in the goroutine status... I think it is right, but please double check it, as I could have misunderstood the CL...

from go1.24.2/src/runtime/runtime2.go:1173 and :1178 and :1090
var waitReasonStrings = [...]string{

// ====================================================
// =============    durably blocked    (my surmises)   ============
// ====================================================

waitReasonChanReceiveNilChan:    "chan receive (nil chan)",
waitReasonChanSendNilChan:       "chan send (nil chan)",
waitReasonSelectNoCases:         "select (no cases)",

waitReasonSleep:                 "sleep",
waitReasonSyncCondWait:          "sync.Cond.Wait",
waitReasonSyncWaitGroupWait:     "sync.WaitGroup.Wait",
waitReasonCoroutine:             "coroutine",

waitReasonSynctestRun:           "synctest.Run",
waitReasonSynctestWait:          "synctest.Wait",
waitReasonSynctestChanReceive:   "chan receive (synctest)",
waitReasonSynctestChanSend:      "chan send (synctest)",
waitReasonSynctestSelect:        "select (synctest)",

Here's a mnemonic for remembering them: "SCONES are durable":

S sleep
C coroutine
O obviously not mutexes (always ignored, even with the word "sync")
N nil chan send/recv
E empty select
S "sync", the word, appears in the status

Comment From: gopherbot

Change https://go.dev/cl/671961 mentions this issue: testing/synctest: add Test

Comment From: gopherbot

Change https://go.dev/cl/671960 mentions this issue: runtime: disallow closing bubbled chans from outside bubble

Comment From: aclements

Overall, it seems like people are very happy with synctest and it solves a lot of problems. The current API has some rough interactions with the testing package, which we think will be resolved by the fairly small API change in https://github.com/golang/go/issues/73567#issuecomment-2851576126.

We'd put this on hold to gain experience with it and I think that's been very successful. We're ready to take it off hold and consider this along with #73567.

Comment From: aclements

Based on the discussion above, this proposal seems like a likely accept. — aclements for the proposal review group

The proposal is to add a new package, testing/synctest, with the following API:

// Package synctest provides support for testing concurrent code.
package synctest

// Test executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form
// an isolated "bubble".
// Test waits for all goroutines in the bubble to exit before returning.
//
// Goroutines in the bubble use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
//
// Time advances when every goroutine in the bubble is blocked.
// For example, a call to time.Sleep will block until all other
// goroutines are blocked and return after the bubble's clock has
// advanced. See [Wait] for the specific definition of blocked.
//
// If every goroutine is blocked and there are no timers scheduled,
// Test panics.
//
// Channels, time.Timers, and time.Tickers created within the bubble
// are associated with it. Operating on a bubbled channel, timer, or ticker
// from outside the bubble panics.
//
// If Test is called from within an existing bubble, it panics.
//
// The [*testing.T] provided to f has the following properties:
//
//    - Functions registered with T.Cleanup run inside the bubble,
//       immediately before Test returns.
//    - The [context.Context] returned by T.Context has a Done channel
//       associated with the bubble.
//    - T.Run may not be called within the bubble.
func Test(t *testing.T, f func(*testing.T))

// Wait blocks until every goroutine within the current bubble,
// other than the current goroutine, is durably blocked.
// It panics if called from a non-bubbled goroutine,
// or if two goroutines in the same bubble call Wait at the same time.
//
// A goroutine is durably blocked if can only be unblocked by another
// goroutine in its bubble. The following operations durably block
// a goroutine:
//   - a send or receive on a channel from within the bubble
//   - a select statement where every case is a channel within the bubble
//   - sync.Cond.Wait
//   - time.Sleep
//
// A goroutine executing a system call or waiting for an external event
// such as a network operation is not durably blocked.
// For example, a goroutine blocked reading from an network connection
// is not durably blocked even if no data is currently available on the
// connection, because it may be unblocked by data written from outside
// the bubble or may be in the process of receiving data from a kernel
// network buffer.
//
// A goroutine is not durably blocked when blocked on a send or receive
// on a channel that was not created within its bubble, because it may
// be unblocked by a channel receive or send from outside its bubble.
func Wait()

This API comes from https://github.com/golang/go/issues/73567#issuecomment-2851576126, except that I changed Test to forbid T.Run without guaranteeing it will panic. We may also want to loosen the type of failure when every goroutine is blocked to allow, for example, t.Fatal in that case.

The existing synctest.Run function will continue to exist for one more Go release behind GOEXPERIMENT=synctest to ease transition to the new Test API.

Comment From: glycerine

Comment: The proposal text above mixes "blocked" (in the Test doc) and "durably blocked" (in the Wait doc) which is slightly confusing. Is this distinction intended? If so, it is worth explaining why blocked but not durably blocked suffices in the Test case, as such care went into explaining it for the Wait case. Otherwise it would be clearer to have the adjective used consistently, either just "blocked", or just "durably blocked", in both cases. I like it with durable in both, but only if that is accurate, of course.

Comment From: mgarstecki

I encountered unexpected and worrying flakiness in the stdlib when run under synctest. Here's a simple test that randomly panics:

package flaky_test

import (
    "encoding/json"
    "testing"
    "testing/synctest"
)

func TestFlaky(t *testing.T) {
    t.Parallel()

    for range 100 {
        t.Run("", func(t *testing.T) {
            t.Parallel()

            flakyScenario(t)
        })
    }
}

func flakyScenario(t *testing.T) {
    synctest.Run(func() {
        _, err := json.Marshal([]string{})
        if err != nil {
            t.Fatalf("failed to serialize: %v", err)
        }
    })
}

When run in a loop (ex: while GOEXPERIMENT=synctest go test .; do go clean -testcache ; done), it will fail after a few seconds with panic: deadlock: all goroutines in bubble are blocked.

This is boiled down from one of our real test cases. It fails faster under load or when you increase the parallelism, and we reproduced it several times in our CI because to this.

I haven't deeply analyzed the cause, but I see that the json package uses a global sync.Pool and I suspect it is racy under the concurrency model of synctest.

This issue was a total surprise to me, and took a while to debug: the actual test in our codebase also uses t.TempDir() and os.WriteFile() and those were, to me, much more likely candidates for flakyness since they break the bubble by calling to the OS.

I'm not sure this is an actual bug of the json package, more of a leaky abstraction: as a user, I expect the json.Marshal algorithm to be straighforward, thread-agnostic code, and a naïve but correct implementation would be. But, likely for optimizations, it uses shared state, which wasn't designed for the modified concurrency model of synctest.

There may be other seemingly benign algorithms in the stdlib that fail unexpectedly under synctest, which will create debugging headaches. I think those should be adapted to be stable under synctest (ideally) or documented as unsafe (reducing heavily the valid use cases for synctest).

Comment From: mgarstecki

To be clear, I love synctest and I definitely want it to be accepted. This case is extracted from a heavily asynchronous codebase, which has several agents acting on tickers and communicating through channels. Our previous testing approach relied heavily on carefully designed mocks that yield values very precisely to lockstep the tests, and careful setup of timeouts.

synctest made the tests faster, less flaky and easier to write and read. Our old mocks were piles of channels and mutexes to be defensive against data races. Synctest allows to convert them to basic implementations, relying on synctest.Wait() to know when the tested component has settled and we can run assertions and manipulate the mocks.

It has also helped me catch some goroutine leaks on cleanup that were asymptomatic but still unexpected. It feels like strict-mode goroutines, which is perfect to ensure correctness.

But it may be hard to use if there are such pitfalls in the stdlib 😕

Comment From: mgarstecki

I can't reproduce the issue when using sync.Pool directly. The issue is likely elsewhere in the marshaller, I'll try to isolate it later tonight.

Comment From: neild

Thanks for the report, @mgarstecki! I was able to reproduce the problem with your example.

The problem is that encoding/json uses a WaitGroup to coordinate access to a cache between multiple goroutines: https://go.googlesource.com/go/+/refs/tags/go1.24.3/src/encoding/json/encode.go#359

When multiple synctest bubbles access the cache at the same time, a goroutine in one bubble can become blocked on a goroutine in the other bubble. Since we don't track whether the WaitGroup was created in a bubble, or which bubble it was created in, this looks like a deadlock.

I'm not certain yet what the right fix is.

Comment From: spenczar

I wouldn't be offended if t.Parallel() were forbidden somehow when using synctest. At best, it is unclear what parallel tests would mean for synctest's manipulation of time.

Comment From: neild

Each bubble has its own independent clock. Barring some form of cross-bubble communication (which is what's happening here), parallel tests should just work.

Comment From: gopherbot

Change https://go.dev/cl/673335 mentions this issue: encoding/json: avoid supurious synctest deadlock detection

Comment From: mgarstecki

I wouldn't be offended if t.Parallel() were forbidden somehow when using synctest

That would be sad, because parallel fits well with synctest: both features combine to speed up the tests on a large codebase, it would be sad to have to choose between them. It also feels opposed to the promise of synctest to increase isolation between tests. Tests that run in synctest should be inherently parallelizable.

But it modifies the behavior of Goroutines, so I'm not surprised to find incompatible algorithms in existing code. I do wonder what should be done if further cases are found in the stdlib. We have a promising fix for json, but other algorithms may prove impossible to fix without unacceptable regressions.

If such cases are found in the stdlib, should they be documented as incompatible with synctest, and documented somewhere ?

Comment From: spenczar

I wouldn't be offended if t.Parallel() were forbidden somehow when using synctest

It also feels opposed to the promise of synctest to increase isolation between tests. Tests that run in synctest should be inherently parallelizable.

That's fair. I have different reasons for liking synctest. To me, the appeal is test reliability. Where there is tension between test speed and reliability, I will choose reliability, but my test suites aren't huge.

Perhaps using t.Parallel should mean accepting some reliability risk (because of global synchronization structures) with a promise of speed, and the Go team should work on issues as they are detected and reported; users can fall back to removing t.Parallel when they find a problem.

That's not completely satisfying, but maybe it is pragmatic?

Comment From: gopherbot

Change https://go.dev/cl/674515 mentions this issue: runtime: record synctest bubble ownership in hchan

Comment From: gopherbot

Change https://go.dev/cl/675257 mentions this issue: runtime, testing/synctest: run cleanups/finalizers outside bubbles

Comment From: aclements

Thanks for discovering the WaitGroup problem. I believe we have the latitude to refine details of how synchronization primitives interact in bubbles in cases where they currently don't, like for example with WaitGroups. Though we should explicitly document this. Thus, I see no need to hold up the proposal.

@neild , for the WaitGroup fix, I'd say ask for a freeze exception and try to get the fix in this week.

Comment From: aclements

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — aclements for the proposal review group

The proposal is to add a new package, testing/synctest, with the below API. For the complete (and extensive) package documentation, please see CL 671961 PS 10.

// Package synctest provides support for testing concurrent code.
package synctest

// Test executes f in a new bubble.
//
// Test waits for all goroutines in the bubble to exit before returning.
// If the goroutines in the bubble become deadlocked, the test fails.
//
// Test must not be called from within a bubble.
//
// The [*testing.T] provided to f has the following properties:
//
//   - T.Cleanup functions run inside the bubble,
//     immediately before Test returns.
//   - T.Context returns a [context.Context] with a Done channel
//     associated with the bubble.
//   - T.Run, T.Parallel, and T.Deadline must not be called.
func Test(t *testing.T, f func(*testing.T))

// Wait blocks until every goroutine within the current bubble,
// other than the current goroutine, is durably blocked.
//
// Wait must not be called from outside a bubble.
// Wait must not be called concurrently by multiple goroutines
// in the same bubble.
func Wait()

The existing synctest.Run function will continue to exist for one more Go release behind GOEXPERIMENT=synctest to ease transition to the new Test API.

Comment From: gopherbot

Change https://go.dev/cl/675617 mentions this issue: runtime, testing/synctest: breaking bubble isolation with Cond is fatal

Comment From: neild

Accepted and implemented, so closed.

Comment From: gopherbot

Change https://go.dev/cl/680695 mentions this issue: internal/fetch: remove goexperiment.synctest

Golang testing/synctest: new package for testing concurrent code

clock.Now()

clock.Sleep

clock.After

clock.NewTicker

example from the proposal would look like

System Under Test

Test Infrastructure

Using synctest

What went well

What went less well

70512 is an interesting case which could conceivably occur in user code:

Current proposal status

Known issues

Time advances forever after main test goroutine exits

Confusing error when go.mod Go version is less than 1.23

(testing.TB).Cleanup does not work well in synctest bubbles

(*testing.B).Loop does not work in synctest bubbles

Deadlock errors don't print useful stacks

AI-generated issue overview

Original Post Summary

Discussion Themes

Next Steps

TL;DR

Details

General Architecture

Testing Strategy

Downsides

synctest to the rescue

key challenge: giving up cloud.google.com/go/logging in SUT

key challenge: http.Client -> http.Transport

key challenge: SQLite

Confusing error when `go.mod` Go version is less than 1.23

`(testing.TB).Cleanup` does not work well in synctest bubbles

`(*testing.B).Loop` does not work in synctest bubbles

`synctest` to the rescue

key challenge: giving up `cloud.google.com/go/logging` in SUT

key challenge: `http.Client -> http.Transport`

key challenge: `SQLite`