In https://github.com/golang/go/issues/14383#issuecomment-220657111, @rsc said:
There is an argument for making
go tr := time.NewTimer(0) tr.Stop() // There may or may not be a value in the channel now. But definitely // one should not be added after we receive it. select { case <-tr.C: default: }
work, but it doesn't today, and we're certainly not going to do that for Go 1.7. Maybe early in Go 1.8 if someone wants to file a different bug.
As far as I can tell, no different bug was filed: the documentation for (*Timer).Stop
still says:
https://github.com/golang/go/blob/7d2473dc81c659fba3f3b83bc6e93ca5fe37a898/src/time/sleep.go#L57-L66
and that behavior is still resulting in subtle bugs (#27169). (Note that @rsc himself assumed that a select
should work in this way in https://github.com/golang/go/issues/14038#issuecomment-219909704.)
I think we should tighten up the invariants of (*Timer).Stop
to eliminate this subtlety.
CC @empijei @matttproud @the80srobot @ianlancetaylor
Comment From: bcmills
For an example program illustrating why a user might reasonably expect (and attempt to make use of) this behavior, see https://play.golang.org/p/mDkMG67ehAI.
Comment From: empijei
Thanks for filing this.
I also found this behavior puzzling and apparently I got it wrong too as I was expecting no values to be sent after t.Stop()
returned.
If, as I understand now, a value might be sent after the call to Stop
has returned, is there any way to correctly re-use a timer?
Comment From: bcmills
If, as I understand now, a value might be sent after the call to
Stop
has returned, is there any way to correctly re-use a timer?
If you know that no other goroutine is receiving on the C
channel, then you can receive unconditionally after the call to Stop
returns false
, and it won't actually matter whether the value was delivered before or after the call to Stop
returned.
(The select
-with-default
after Stop
is where the bug occurs, and none of the examples in your post do that — they don't need to, because in your examples the reads on t.C
are all on the same goroutine as the calls to t.Stop
.)
Comment From: ianlancetaylor
In the current implementation, this can probably be done by changing runOneTimer
to move the Cas
away from timerRunning
until after the invocation of f
.
Comment From: rsc
Added to proposal process to resolve whether to do this, after mention in #38945.
Comment From: rsc
This made its first appearance in the proposal minutes last week. You can imagine ways to break Go programs with this, such as by
t := time.NewTimer(1*time.Second)
time.Sleep(2*time.Second)
t.Stop()
<-t.C
This would succeed today, but if we clear the channel during Stop
it would start blocking.
We would have to define that in the case where Stop
pulls the element back out of the channel buffer, it returns true, meaning it "stopped" the timer (false means the timer already expired; expired would now mean "expired and was observed by a channel receive on t.C").
On the other hand, there is technically a race here, and it's not guaranteed that this snippet never blocks, especially on a loaded system. The correct snippet, even today, is:
t := time.NewTimer(1*time.Second)
time.Sleep(2*time.Second)
if !t.Stop() {
<-t.C
}
That example would be unaffected, because Stop would pull the buffered element back out of the channel and then return true
, signaling to the caller that there is nothing to receive.
We could make the change at the start of a release cycle and be ready to roll it back if real (as opposed to hypothetical) problems arose.
Thoughts?
Comment From: bcmills
While I like the behavior of pulling the value back out of the channel and having Stop
return true
, I don't think it is actually backward-compatible, depending on how the len
builtin for channels interacts with the memory model.
For example, this program today is not reported as a race and the receive is guaranteed to never block, but would deadlock if Stop
clears the channel:
package main
import (
"fmt"
"time"
)
func main() {
t := time.NewTimer(1 * time.Second)
for len(t.C) == 0 {
time.Sleep(time.Millisecond)
}
t.Stop()
<-t.C
fmt.Println("ok")
}
That is why I had restricted my original proposal to guarantee only that no value is sent after Stop
returns.
Comment From: ianlancetaylor
Sorry, I'm not sure I understand what it means to guarantee that no value is sent on the channel after Stop
returns. As far as I can see, that is already guaranteed--maybe not in the documentation, but certainly in the implementation. What is the actual change that you are suggesting?
Comment From: bcmills
@ianlancetaylor, the program given in https://play.golang.org/p/Wm1x8DmYoQo should run indefinitely without reporting any delayed times. Instead, it reports a nonzero rate of delayed timer sends.
On my Xeon workstation:
example.com$ gotip version
go version devel +b2ce3931 Tue May 26 22:28:45 2020 +0000 linux/amd64
example.com$ gotip run main.go
Timer stopped at 2020-05-27 22:16:46.951252699 -0400 EDT m=+2.886904949 subsequently received time
2020-05-27 22:16:46.951252454 -0400 EDT m=+2.886904697.
Timer stopped at 2020-05-27 22:17:00.794561424 -0400 EDT m=+16.730213661 subsequently received time
2020-05-27 22:17:00.794559263 -0400 EDT m=+16.730211512.
Timer stopped at 2020-05-27 22:17:00.774413697 -0400 EDT m=+16.710065937 subsequently received time
2020-05-27 22:17:00.77440917 -0400 EDT m=+16.710061421.
Timer stopped at 2020-05-27 22:17:03.679525664 -0400 EDT m=+19.615177943 subsequently received time
2020-05-27 22:17:03.679522832 -0400 EDT m=+19.615178663.
Timer stopped at 2020-05-27 22:17:06.024962313 -0400 EDT m=+21.960614549 subsequently received time
2020-05-27 22:17:06.024960025 -0400 EDT m=+21.960612270.
Timer stopped at 2020-05-27 22:17:06.692670731 -0400 EDT m=+22.628322974 subsequently received time
2020-05-27 22:17:06.692669848 -0400 EDT m=+22.628322092.
Timer stopped at 2020-05-27 22:17:22.942118175 -0400 EDT m=+38.877770415 subsequently received time
2020-05-27 22:17:22.942117468 -0400 EDT m=+38.877769713.
Timer stopped at 2020-05-27 22:17:22.901857123 -0400 EDT m=+38.837509358 subsequently received time
2020-05-27 22:17:22.901855958 -0400 EDT m=+38.837508205.
Timer stopped at 2020-05-27 22:17:27.342696328 -0400 EDT m=+43.278348567 subsequently received time
2020-05-27 22:17:27.342693224 -0400 EDT m=+43.278345463.
Timer stopped at 2020-05-27 22:17:28.993568136 -0400 EDT m=+44.929220391 subsequently received time
2020-05-27 22:17:28.993571601 -0400 EDT m=+44.929223839.
Received 10 delayed times after 128517449 iterations.
In all cases, the failure mode is the same: at some point after Stop
returns, the channel receives a very-slightly-earlier timestamp. That is: the return from Stop
is not synchronized with the actual send to the channel.
Instead, I propose that the send to the channel, if it occurs at all, ought to happen before the return from Stop
.
Comment From: ianlancetaylor
Thanks for the example. If the code is written per the documentation of time.Timer.Stop
, it works correctly (https://play.golang.org/p/KlC__ujuE5M).
And now I see how this happens. It's because in runOneTimer
we set the status to timerNoStatus
before we call the function that sends a value on the channel, which permits deltimer
and therefore time.Timer.Stop
to return false
before we send the value. The return of false
does reliably mean that a value will be sent on the channel, but the value may not have been sent when Stop
returns.
I don't see an easy fix in the current implementation. We can't leave the timer in timerRunning
state while we unlock pp.timersLock
in order to run the function.
Comment From: bcmills
Would it make sense to change the timer to the timerModifying
state while running the callback (which should be short), and then to timerNoStatus
after the callback completes?
(Or would that potentially induce starvation for a time.Ticker
with a very short period?)
Comment From: bcmills
The same send-after-Stop
problem appears to also applies to time.Ticker
, which I suppose is not a surprise because they share the same underlying timer
implementation.
(Code in https://play.golang.org/p/Y_Hz4xkYr07, but it doesn't reproduce there due to the Playground's faketime
patch.)
Comment From: ianlancetaylor
It's actually a bit more complicated than I thought. If we leave the timer status as timerRunning
(or timerModifying
), then if the timer function does anything that touches the timer, it can hang. For example, the scavenger uses a timer with a function of wakeScavenger
that expects to stop the timer. That would hang waiting for the timer to leave timerRunning
status. The netpoll code also manipulates the timer in the timer function, though I think it wouldn't cause any problems here.
Comment From: rsc
Let's try to separate the discussion of semantics from implementation.
The docs say:
To ensure the channel is empty after a call to Stop, check the return value
and drain the channel. For example, assuming the program has not received
from t.C already:
if !t.Stop() {
<-t.C
}
The proposed change in semantics is to make it that no receive can ever happen after t.Stop returns. That would mean that under the assumption - "assuming the program has not received from t.C already" - t.Stop would never return false anymore. So the above code would no longer be required (and if left alone would just never execute the if body).
This would avoid the problem of people not knowing to write that fixup code or not understanding the subtlety involved. Certainly we've all confused ourselves enough just thinking through this over the years.
That's the proposed semantic change.
There are also questions about how to implement that - it's not just the easy "pull the item back out of the buffer" that I suggested before - but certainly it is possible. Let's figure out if we agree about the semantic change.
(Another side-effect would probably be that len(t.C) would always return 0, meaning the channel would appear to be unbuffered. But we've never promised buffering in the docs. The buffering is only there because I was trying to avoid blocking the timer thread. It's a hack.)
Comment From: bcmills
I agree that if we could change the semantics to ensure that no receive is possible after Stop
, that would be preferable to both the current behavior and the originally-proposed change to only disallow a send-after-Stop
.
However, I don't see any way to do that without breaking the example in https://play.golang.org/p/r77N1PfXuu5.
That program explicitly relies on the buffering behaviors of t.C
: specifically, that len(t.C)
becomes nonzero when the timer fires, and that the buffered value remains receivable after t.Stop()
. Neither of those behaviors is strictly guaranteed by the documentation, but nor does anything in the documentation explicitly reserve them as subject to change — and they have been observable since CL 4063043 over nine(!) years ago.
Disallowing a receive after Stop
must necessarily break one of those two behaviors. If we marked the buffer as non-empty but then retracted the buffered value in t.Stop()
, then the program would block forever on the subsequent receive. If we instead made the channel unbuffered — or, equivalently, hooked the runtime to avoid sending to it until some goroutine is ready to receive — then the program would loop forever waiting for len(t.C)
to become nonzero.
So the only way I see to make that change in semantics would be to declare that this example program is already incorrect, because it is relying on undocumented behavior regardless of the stability of that behavior. I could be wrong, but given how long this behavior has been in place I suspect that changing it would break real programs.
Comment From: ianlancetaylor
Do you know of any programs that call len
on a timer channel?
Comment From: bcmills
I do not know of any specifically, but given the 9-year interval and Hyrum's law I would be surprised if they did not exist.
If we are willing to assume that such programs do not exist (or declare that they are invalid if they do exist, because the buffering behavior was never documented), then preventing the receive seems ok.
Comment From: rsc
Above, we identified that one possible way that the semantic change being discussed might break code is if the code used len
to poll whether the timer had gone off. We would have to make len
report zero all the time, simulating an unbuffered channel. Code that depended on len
sometimes being non-zero could possibly break.
My intuition was that using len
on a channel is racy and almost never better than a select with a default case, so there's a decent chance that not too much code is using len. To test this hypothesis, I searched my local corpus of Go packages for length on timer channels and sifted through the results with @bcmills. We found that use of len
on timer channels is (1) rare, (2) almost always incorrect, (3) when incorrect, always fixed by the new semantics.
The corpus I am using is a collection of the latest of each major version of each module listed in the Go index (index.golang.org) as of late March. There are 134,485 modules with distinct paths in my corpus. Of those, 17,021 mention time.NewTimer
or time.NewTicker
. Of those, 15 modules (slightly under 0.1% of the 17,021) contain code matching len\(.*\.C\)
in which the len
operation applies to a timer or ticker channel.
Of the 15 modules applying len
to a timer or ticker channel: one is an awkward simulation of a non-blocking select, already removed in a later commit; one uses len
in debug prints to monitor the ticker; and the remaining 13 are buggy.
The semantic change proposed in this issue would make all the buggy code correct, assuming we apply the same change to Reset
as we've discussed for Stop
. We should probably also apply the change to time.Ticker
's Reset
and Stop
.
Use of len to simulate non-blocking select
github.com/packetzoom/logslammer
github.com/packetzoom/logslammer version v0.0.3 output/elasticsearch/elasticsearch.go (April 7, 2016) contains the following code in output/elasticsearch/elasticsearch.go:
func readInputChannel(idx *Indexer, receiveChan chan *buffer.Event) {
// Drain the channel only if we have room
if idx.bulkService.NumberOfActions() < esSendBuffer {
select {
case ev := <-receiveChan:
idx.index(ev)
}
} else {
log.Printf("Internal Elasticsearch buffer is full, waiting")
time.Sleep(1 * time.Second)
}
}
...
// Loop events and publish to elasticsearch
tick := time.NewTicker(time.Duration(esFlushInterval) * time.Second)
for {
readInputChannel(idx, receiveChan)
if len(tick.C) > 0 || len(es.term) > 0 {
select {
case <-tick.C:
idx.flush()
case <-es.term:
tick.Stop()
log.Println("Elasticsearch received term signal")
break
}
}
}
This code seems to be using the len
checks as an alternative to a non-blocking select
on those two channels. It's also worth noting that when received events are rare, the tick receive and corresponding idx.flush
may be arbitrarily delayed by the blocking receive in readInputChannel
. It's possible that select was meant to be non-blocking.
The semantic change proposed in this issue would make the code never process ticks, since len(tick.C)
would always be zero. However, in the latest untagged commit (February 17, 2017), the test of len(tick.C)
has been removed. Also, the lack of any commits in the last three years, nor even a tag of the updated code, suggests the code is unmaintained and no longer used.
Use of len for logging/debugging prints
github.com/uber/cherami-server
github.com/uber/cherami-server version v1.28.1 services/outputhost/messagecache.go (November 29, 2017) has a logPumpHealth
method that logs information about the capacity and length of its ticker channel, among others:
func (msgCache *cgMsgCache) logPumpHealth() {
logPumpHealthOnce.Do(func() {
msgCache.lclLg.WithFields(bark.Fields{
...
`RedeliveryTickerCap`: cap(msgCache.redeliveryTicker.C),
}).Info(`cache pump health capacities`)
})
msgCache.lclLg.WithFields(bark.Fields{
...
`RedeliveryTickerLen`: len(msgCache.redeliveryTicker.C),
}).Info(`cache pump health`)
lg := msgCache.lclLg
now := common.Now()
...
if msgCache.lastRedeliveryTickerFull > 0 {
lg = lg.WithField(`RedeliveryTicker`, common.UnixNanoTime(now-msgCache.lastRedeliveryTickerFull).ToSeconds())
}
lg.Info(`cache pump health last channel full times`)
}
And the `updatePumpHealth method does:
func (msgCache *cgMsgCache) updatePumpHealth() int {
now := common.Now()
fullCount := 0
...
if cap(msgCache.redeliveryTicker.C) <= len(msgCache.redeliveryTicker.C) {
msgCache.lastRedeliveryTickerFull = now
fullCount++
}
return fullCount
}
The logging would never trigger anymore, and the pump health update would register the channel as “full” for being unbuffered (length and capacity both 0).
However, the repository's README says that the project is “deprecated and not maintained.”
Racy use of len to drain channel after Stop
github.com/Freezerburn/go-coroutine
github.com/Freezerburn/go-coroutine v1.0.1 src/github.com/Freezerburn/coroutine/embed.go:
// Guarantee the timer channel is drained.
// Despite the docs saying to only check the return value of Stop, it's possible for Stop to return false without
// there being an item in the channel, so we also double-check if there is an item in the channel before attempting
// to drain it.
if !e.waitTimer.Stop() && len(e.waitTimer.C) > 0 {
<-e.waitTimer.C
}
e.waitTimer.Reset(duration)
<-e.waitTimer.C
This use of len
is potentially racing with the pending send that Stop
did not stop. The correct code today would be to track whether e.waitTimer.C
has been received from already and use that state in place of the length check. In fact, since this code is the only use of e.waitTimer
, the timer will have only already fired and not been received from on the first execution of the method containing this code. Every time the method completes, the timer has expired and been read from. The first execution is different because e.waitTimer
is initialized to time.NewTimer(0)
.
The proposed semantic change to Stop
would make this code correct: on the first call, e.waitTimer.Stop
would return true—the Stop
prevented the zero-duration timer event from being sent. On subsequent iterations, e.waitTimer.Stop
would return false—the timer event was already received by the previous call—and len
would be zero, bypassing the no-longer-necessary draining code.
There is another instance of the pattern later in the file:
if !e.receiveTimer.Stop() && len(e.receiveTimer.C) > 0 {
<-e.receiveTimer.C
}
e.receiveTimer.Reset(duration)
select {
case <-e.receiver:
case <-e.receiveTimer.C:
}
Like e.waitTimer
, e.receiveTimer
is initialized to time.NewTimer(0)
.
On top of that, the code does not record whether the select received from
e.receiveTimer.C
, as it must for proper interpretation of the Stop
result.
This lack of memory is partly compensated for by the racy (buggy) len
check.
The same analysis applies: if the Stop
guarantees to stop any future send,
then either the receive has already happened, in which case Stop
returns false
and len
is zero, so or the receive is prevented, in which case Stop
returns true.
Either way, the code would now bypass the no-longer-necessary draining code.
The current latest commit rewrites both of those code snippets to use a helper that is different (avoiding len
entirely) but still buggy:
func resetTimer(t *time.Timer, d time.Duration) {
if !t.Stop() {
// I've seen cases where Stop returns false, but there isn't an item in the channel, causing the receive to
// hang forever.
select {
case <-t.C:
default:
}
}
t.Reset(d)
}
The same analysis applies, since a non-blocking select is just a less racy “len
plus receive” operation. Removing the race between the length check and the receive did not eliminate the race with the delayed timer send. But the proposed semantic change would.
github.com/MatrixAINetwork/go-matrix
github.com/MatrixAINetwork/go-matrix v1.1.7 p2p/buckets.go (May 20, 2020):
for {
select {
case <-timeoutTimer.C:
...
if !timeoutTimer.Stop() && len(timeoutTimer.C) > 0 {
<-timeoutTimer.C
}
timeoutTimer.Reset(time.Second * 60)
case h := <-b.blockChain:
if !timeoutTimer.Stop() && len(timeoutTimer.C) > 0 {
<-timeoutTimer.C
}
timeoutTimer.Reset(time.Second * 60)
...
}
}
This code is trying to make sure that each iteration of the select falls into the timeout case after 60 seconds.
For the most part it succeeds, but a receive from b.blockChain
right at the moment the timer went off would fail to drain timeoutTimer.C
, causing the next iteration to fall into the timeout case immediately.
Same race as in previous example; proposed semantic change applies the same fix.
github.com/cobolbaby/log-agent
github.com/cobolbaby/log-agent 71f7f9f watchdog/watchdog.go:
timer := time.NewTimer(delay)
...
for {
select {
case e = <-eventChan:
...
// timer may be not active, and fired
if !timer.Stop() && len(timer.C) > 0 {
<-timer.C //ctry to drain from the channel
}
timer.Reset(delay)
case <-timer.C:
...
return
}
}
Same race; same fix.
github.com/dcarbone/go-cacheman
github.com/dcarbone/go-cacheman v1.1.2 key_manager.go:
if !lifespan.Stop() && len(lifespan.C) > 0 {
<-lifespan.C
}
lifespan.Reset(km.idleTTL)
Same race; same fix (appears four times in file).
github.com/myENA/consultant/v2
github.com/myENA/consultant/v2 v2.1.1 service.go:
if !refreshTimer.Stop() && len(refreshTimer.C) > 0 {
<-refreshTimer.C
}
refreshTimer.Reset(ms.refreshInterval)
Same race; same fix (appears three times in file).
github.com/smartcontractkit/chainlink
github.com/smartcontractkit/chainlink v0.7.8 core/services/fluxmonitor/flux_monitor.go:
// stopTimer stops and clears the timer as suggested by the documentation.
func stopTimer(arg *time.Timer) {
if !arg.Stop() && len(arg.C) > 0 {
// Residual events are the timer's channel and need to be cleared.
//
// Refer to timer.Stop's documentation or
// https://developpaper.com/detailed-explanation-of-the-trap-of-timer-in-golang/
<-arg.C
}
}
Same race; same fix. This is of course not the code suggested by the documentation, although it does match the (incorrect) linked blog post.
This code was removed without comment in v0.8.0.
github.com/vntchain/go-vnt
github.com/vntchain/go-vnt v0.6.4-alpha.6 producer/worker.go:
// The stop command may be happen when the round timer is timeout but not deal with
// the timeout event, so cleaning the channel of roundTimer is needed.
if false == self.roundTimer.Stop() && len(self.roundTimer.C) > 0 {
<-self.roundTimer.C
log.Warn("worker.roundTimer.C still has a expired event, now has been cleaned")
}
Same race; same fix.
github.com/eBay/akutan
github.com/eBay/akutan 9a750f2 src/github.com/ebay/akutan/util/clocks/wall.go:
timer.Stop()
for len(timer.C) > 0 {
<-timer.C
}
...
timer.Reset(wake.Sub(Wall.Now()))
Same race; same fix.
github.com/piotrnar/gocoin
github.com/piotrnar/gocoin 820d7ad client/main.go:
SaveBlockChain.Stop()
for len(SaveBlockChain.C) > 0 {
<-SaveBlockChain.C
}
if common.BlockChainSynchronized {
SaveBlockChain.Reset(SaveBlockChainAfter)
} else {
SaveBlockChain.Reset(SaveBlockChainAfterNoSync)
}
Same race; same fix.
Racy use of len to drain channel before Reset
github.com/qlcchain/go-qlc
github.com/qlcchain/go-qlc v1.3.5 p2p/sync.go:
if len(ss.pullTimer.C) > 0 {
<-ss.pullTimer.C
}
ss.pullTimer.Reset(pullReqTimeOut)
This is the usual pattern from the previous section, except there is no call to Stop
,
so the receives are racing against an unstopped timer instead of the dying gasps of a stopped timer. But the race is the same.
If we change Reset
to have the same guarantee as Stop
, namely that no send from the old timer setting (before the Reset
) can complete after Reset
returns, then a plain call to Reset
without Stop
like in this code becomes correct, even without the if
statement. With the proposed change, the if
body would never execute.
Racy use of len to drain channel after Reset
github.com/chenjie199234/Corelib
github.com/chenjie199234/Corelib 2d8c16542cbe logger/logger.go:
for {
select {
case <-tmer.C:
...
tmer.Reset(500 * time.Millisecond)
if len(tmer.C) > 0 {
<-tmer.C
}
case <-l.update:
...
tmer.Reset(500 * time.Millisecond)
if len(tmer.C) > 0 {
<-tmer.C
}
case <-l.refresh:
...
}
}
This is like the previous example but the draining happens after Reset
, making it race with the newly reset timer. A bad scheduling pause that introduced a half-second delay in goroutine execution between the timer being reset and the draining would cause a legitimate timer event to be discarded.
This entire package was removed on April 20.
Racy use of len to drain channel without Stop/Reset
github.com/indeedsecurity/carbonbeat
github.com/indeedsecurity/carbonbeat v1.0.0 app/carbonbeat.go:
ticker := time.NewTicker(bt.config.Period)
for {
select {
case <-bt.done:
...
return nil
case <-ticker.C:
//Remove all ticks from the ticker channel, do not "catch up"
for len(ticker.C) > 0 {
<-ticker.C
}
}
...
}
Strictly speaking, this code is not wrong, but it is also not useful.
The ticker channel has a buffer of length 1.
In the instant after the receive just completed, the buffer is almost certainly empty.
The check is racing with the next send, but very unlikely to lose that race.
Changing the len
to always return 0 will not have any discernible impact
on this code.
github.com/indeedsecurity/carbonbeat/v2
github.com/indeedsecurity/carbonbeat/v2 v2.0.2 app/carbonbeat.go contains the same code.
Comment From: rsc
For what it's worth, seeing all that buggy code as evidence of people struggling to use the current t.Stop and t.Reset correctly in single-goroutine use cases only makes me that much more convinced we should try the proposed change.
Comment From: rsc
Discussion ongoing, so leaving this for another week, but it seems headed for likely accept. Please speak up if you think this is a bad idea.
Comment From: seebs
I think this is an excellent idea and I'd bet that I've written at least one piece of code which lacks, but needs, the fixup code.
Comment From: rogpeppe
Is this actually the same issue as this old one? https://github.com/golang/go/issues/11513
Comment From: rsc
@rogpeppe, yes it certainly subsumes that one. The example I gave there in https://github.com/golang/go/issues/11513#issuecomment-150332714 would definitely change but after seeing real uses, I think more code will be fixed than would break. It can be an early in cycle change.
Also, in that comment I said it's not the library's job to "throw away values already sent on the channel". The fix I'm suggesting above makes the channel (indistinguishable from) unbuffered instead, so there are no values "already sent".
Comment From: rsc
Based on the discussion above, this sounds like a likely accept.
One possible implementation would be: 1. Find a bit in the channel struct to mark timer channels. 2. Stop inlining chanlen/chancap. 3. Make chanlen/chancap return 0 for timer channels even though the representation has buffer size 1. 4. In Stop/Reset, clear the buffer if there's something in it.
There are other more involved implementations (see for example #8898), but this simple change seems like the shortest path to the semantics we've been discussing.
Comment From: rsc
No change in consensus, so accepting.
Comment From: gopherbot
Change https://go.dev/cl/568341 mentions this issue: time: avoid stale receives after Timer/Ticker Stop/Reset return
Comment From: gopherbot
Change https://go.dev/cl/568375 mentions this issue: cmd/compile: compile cap(ch) as call to runtime.chancap
Comment From: rsc
Looks like this may happen for Go 1.23, finally.
In the discussion above https://github.com/golang/go/issues/37196#issuecomment-638344744 I said the channel would "appear to be unbuffered", but I don't think I realized at the time (or I forgot) that that's literally the entire semantic change: timer channels will now be synchronous aka unbuffered. All the simplification follows from that. (The implementation is more subtle, including a hidden buffer as an optimization, but the semantic effect is simply that timer channels will now behave as if unbuffered.)
For a graceful rollout, this change and the change for #8898 can be disabled by GODEBUG=asynctimerchan=1.
Comment From: kevinburkesegment
Just found this thread while trying to track down a test flake - the test was trying to drain the channel with a <-timer.C
but also provided a default:
flag which caused the old event to fail to be read. I am glad that the behavior here is changing to make it more difficult to do the subtly wrong thing.
Comment From: zigo101
With tip, the finalizer of a timer will not get executed. Is it intended?
package main
import "time"
import "runtime"
func main() {
timer := time.NewTimer(time.Second)
runtime.SetFinalizer(timer, func(c *time.Timer){
println("done")
})
timer.Stop()
runtime.GC()
time.Sleep(time.Second)
}
$ gotv 1.22. run main.go
[Run]: $HOME/.cache/gotv/tag_go1.22.1/bin/go run main.go
done
$ gotv :tip run main.go
[Run]: $HOME/.cache/gotv/bra_master/bin/go run main.go
$
Comment From: zigo101
BTW, since Go 1.23, does the Stop method of a Timer still need to be called to make the Timer collectable as earlier as possible?
Comment From: kscooo
BTW, since Go 1.23, does the Stop method of a Timer still need to be called to make the Timer collectable as earlier as possible?
https://go-review.googlesource.com/c/go/+/512355#:~:text=This%20CL%20fixes,be%20GC%27ed%20immediately.
Comment From: ianlancetaylor
@go101 The finalizer change sounds like a bug to me. Please open a new issue for that. Thanks.
Comment From: zigo101
@ianlancetaylor I'm not sure whether or not it is a bug, because the comment in the PR mentions a Timer object might be not allocated on heap, Anyway, I created the issue.
Comment From: ianlancetaylor
Thanks for creating the issue. Seems to me it is a bug either in the implementation or in the documentation.
Comment From: prashantv
With this change, I'm not sure if it will be possible for fake clocks to be used for deterministic tests and behave similarly to the stdlib timer methods.
Given some production code, say:
func delayedPush(clock clock.Clock) {
c := clock.Timer(time.Second)
<-c
push()
}
And the following test:
func TestDelayedPush(t *testing.T) {
fake := clock.NewFake()
go delayedPush(fake)
fake.Advance(time.Second)
if !pushed {
t.Errorf("expected push")
}
}
This test is flaky: it passes if the background goroutine has created the timer before the fake advances time, but fails otherwise since. This is a really common issue so tests often use runtime.Gosched()
or time.Sleep(...)
to work around these, and while that reduces the likelihood of failure, it can still occur -- it's not fully deterministic.
This problem can be solved with fakes that support blocking until the timer is created, e.g.,BlockUntil
,
func TestDelayedPush(t *testing.T) {
fake := clock.NewFake()
go delayedPush(fake)
fake.BlockUntil(1) // wait for background Timer call
fake.Advance(time.Second)
if !pushed {
t.Errorf("expected push")
}
}
Prior to this change, since the channel was buffered, the Advance
would write to the channel regardless of whether the background delayedPush
goroutine was waiting on the channel or not. With this change, assuming the fake behaves consistently with the stdlib (non-blocking write to an unbuffered channel), then the test is flaky again: it passes if the background goroutine is reading from the channel, otherwise, the write is dropped, and the test fails.
I'm currently working on a fake clock library, primarily to help with writing deterministic, non-flaky tests, and with this change, I'm not sure that'll be possible within the current API surface.
Comment From: ianlancetaylor
The code that uses the timers shouldn't care whether the channel is buffered or not. I don't see why a fake clock library would have to use an unbuffered channel.
Please see #67434 for some recent ideas on how to test timers and the like.
Comment From: prashantv
The fake can use a buffered channel, but it's behavior will not be able to match the stdlib:
* len
and cap
would return different values. Based on Russ's analysis above, I think this is probably OK.
* Stop
will behave differently since it can't unsend values like the stdlib.
I was hoping to build a fake that matches the stdlib behavior closely for high fidelity testing.
In this case, it's not a huge issue since the new stdlib behavior is less error-prone, and continues to work with code written against the previous model.
After reading the time PR more closely, it's likely that the fake can emulate the stdlib more closely by using the expensive approach that the change emulates -- use a separate goroutine to write to the channel, cancelling it when Stop
is called, etc.
Comment From: powerman
go func delayedPush(clock clock.Clock) { c := clock.Timer(time.Second) <-c push() }
To me it looks like test for this code will be racy in any case - this just can't be solved by doing something with timers, because all you can get as result is sync with <-c
line, but not push()
line, while test actually needs sync with push()
line.