Adam Roach
2017-05-09 04:51:24 UTC
Adam Roach has entered the following ballot position for
draft-ietf-core-coap-tcp-tls-08: Discuss
When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)
Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.
The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-core-coap-tcp-tls/
----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------
- Part of the document is outside the scope of the charter of the WG
which requested its publication
While I understand that this document requires a WebSockets mechanism for
.well-known, and that such a mechanism doesn’t yet exist, it seems pretty
far out of scope for the CORE working group to take on defining this
itself (unless I missed something in its charter, which is entirely
possible: it’s quite long). Specifically, I fear that this venue is
unlikely to bring such a change to the attention of those people best
positioned to comment on whether .well-known is appropriate for
WebSockets.
Even if this is in scope for CORE, it really needs to be its own
document. If some future document comes along at a later point and wants
to make use of its own .well-known path with WebSockets, it would be
really quite strange to require it to reference this document in
describing .well-known for WS.
----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------
General — this is a very bespoke approach to what could have been mostly
solved with a single four-byte “length” header; it is complicated on the
wire, and in implementation; and the format variations among CoAP over
UDP, coap+tls, and coap+ws are going to make gateways much harder to
implement and less efficient (as they will necessarily have to
disassemble messages and rebuild them to change between formats). The
protocol itself mentions gateways in several places, but does not discuss
how they are expected to map among the various flavors of CoAP defined in
this document. Some of the changes seem unnecessary, but it could be that
I’m missing the motivation for them. Ideally, the introduction would work
harder at explaining why CoAP over these transports is as different from
CoAP over UDP as it is, focusing in particular on why the complexity of
having three syntactically incompatible headers is justified by the
benefits provided by such variations.
Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
With the exception of subscribing to resource state (which could be
added), it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).
I’m also a bit puzzled that CoAP already has an inherent mechanism for
blocking messages off into chunks, which this document circumvents for
TCP connections (by allowing Max-Message-Size to be increased), and then
is forced to offer remedies for the resultant head-of-line blocking
issues. If you didn’t introduce this feature, messages with a two-byte
token add six bytes of overhead for every 1024 bytes of content — less
than 0.6% size inflation. It seems like a lot of complicated machinery —
which has a built-in foot-gun that you have to warn people about misusing
— for a very tiny gain. I know it’s relatively late in the process, but
if these trade-offs haven't had a lot of discussion yet, it’s probably
worth at least giving them some additional thought.
I’ll note that the entire BERT mechanism seems to fall into the same trap
of adding extra complexity for virtually nonexistent savings. CoAP
headers are, by design, tiny. It seems like a serious over-optimization
to try to eliminate them in this fashion. In particular, you’re making
the actual implementation code larger to save a trivial number of bits on
the wire; I was under the impression that many of the implementation
environments CoAP is intended for had some serious on-chip restrictions
that would point away from this kind of additional complexity.
Specific comments follow.
Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
(Aside — it seems odd and somewhat backwards that TCP connections are
provided an affordance for fine-grained control over message sizes, while
UDP communications are not.)
Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional? (I’ll also note that the presence of
keepalive mechanisms in both TCP and WebSockets would seem to make the
addition of new CoAP primitives for the same purpose unnecessary, but I
suspect this has already been debated).
Section 5 and its subsections define a new set of message types,
presumably for use only on connection-oriented protocols, although this
is only implied, and never stated. For example, some implementors may see
CSM, Ping, and Pong as potentially useful in UDP; and, finding no
prohibition in this document against using them, decide to give it a go.
Is that intended? If not, I strongly suggest an explicit prohibition
against using these in UDP contexts.
Section 5.3.2 says that implementations supporting block-wise transfers
SHOULD indicate the Block-wise Transfer Option. I can't figure out why
this is anything other than a "MUST". It seems odd that this document
would define a way to communicate this, and then choose to leave the
communicated options as “YES” and “YOUR GUESS IS AS GOOD AS MINE” rather
than the simpler and more useful “YES” and “NO”.
I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations. This seems impossible to distinguish from a
failure-due-to-timeout from the perspective of the Ping sender. Why not
limit this behavior only to Ping messages that include the Custody
Option?
I find the unmotivated definition of the default port for “coaps+tcp” to
443 — a port that is already assigned to https — to be surprising, to put
it mildly. This definitely needs motivating text, and I suspect it's
actually wrong.
I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.
The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing). It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.
Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)
Section A.1 does not indicate what gateways are supposed to do with
out-of-order notifications. The TCP side requires these to be delivered
in-order; so, do this mean that gateways observing a gap in sequence
numbers need to quarantine the newly received message so that it can
deliver the missing one first? Or does it deliver the newly-received
message and then discard the “stale” one when it arrives? I don’t think
that leaving this up to implementations is particularly advisable.
Section A.3 is a bit more worrisome. I understand the desired
optimization here, but where you reduce traffic in one direction, you run
the risk of exploding it in the other. For example, consider a coap+tcp
client connecting to a gateway that communicates with a CoAP-over-UDP
server. When that client wants to check the health of its observations,
it can send a Ping and receive a Pong that confirms that they are all
alive and well. In order to be able to send a Pong that *means* “all your
observations are alive and well,” the gateway has to verify that all the
observations are alive and well. A simple implementation of a gateway
will likely check on each observed resource individually when it gets a
Ping, and then send a Pong after it hears back about all of them. So, as
a client, I can set up, let’s say, two dozen observations through this
gateway. Then, with each Ping I send, the gateway sends two dozen checks
towards the server. This kind of message amplification attack is an
awesome way to DoS both the gateway and the server. I believe the
document needs a treatment of how UDP/TCP gateways handle notification
health checks, along with techniques for mitigating this specific
attack.
Section A.4 talks about the rather different ways of dealing with
unsubscribing from a resource. Presumably, gateways that get a reset to a
notification are expected to synthesize a new GET to deregister on behalf
of the client? Or is it okay if they just pass along the reset, and
expect the server to know that it means the same thing as a
deregistration? Without explicit guidance here, I expect server and
gateway implementors to make different choices and end up with a lack of
interop.
** There is 1 instance of too long lines in the document, the longest one
being 3 characters in excess of 72.
draft-ietf-core-coap-tcp-tls-08: Discuss
When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)
Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.
The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-core-coap-tcp-tls/
----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------
- Part of the document is outside the scope of the charter of the WG
which requested its publication
While I understand that this document requires a WebSockets mechanism for
.well-known, and that such a mechanism doesn’t yet exist, it seems pretty
far out of scope for the CORE working group to take on defining this
itself (unless I missed something in its charter, which is entirely
possible: it’s quite long). Specifically, I fear that this venue is
unlikely to bring such a change to the attention of those people best
positioned to comment on whether .well-known is appropriate for
WebSockets.
Even if this is in scope for CORE, it really needs to be its own
document. If some future document comes along at a later point and wants
to make use of its own .well-known path with WebSockets, it would be
really quite strange to require it to reference this document in
describing .well-known for WS.
----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------
General — this is a very bespoke approach to what could have been mostly
solved with a single four-byte “length” header; it is complicated on the
wire, and in implementation; and the format variations among CoAP over
UDP, coap+tls, and coap+ws are going to make gateways much harder to
implement and less efficient (as they will necessarily have to
disassemble messages and rebuild them to change between formats). The
protocol itself mentions gateways in several places, but does not discuss
how they are expected to map among the various flavors of CoAP defined in
this document. Some of the changes seem unnecessary, but it could be that
I’m missing the motivation for them. Ideally, the introduction would work
harder at explaining why CoAP over these transports is as different from
CoAP over UDP as it is, focusing in particular on why the complexity of
having three syntactically incompatible headers is justified by the
benefits provided by such variations.
Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
With the exception of subscribing to resource state (which could be
added), it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).
I’m also a bit puzzled that CoAP already has an inherent mechanism for
blocking messages off into chunks, which this document circumvents for
TCP connections (by allowing Max-Message-Size to be increased), and then
is forced to offer remedies for the resultant head-of-line blocking
issues. If you didn’t introduce this feature, messages with a two-byte
token add six bytes of overhead for every 1024 bytes of content — less
than 0.6% size inflation. It seems like a lot of complicated machinery —
which has a built-in foot-gun that you have to warn people about misusing
— for a very tiny gain. I know it’s relatively late in the process, but
if these trade-offs haven't had a lot of discussion yet, it’s probably
worth at least giving them some additional thought.
I’ll note that the entire BERT mechanism seems to fall into the same trap
of adding extra complexity for virtually nonexistent savings. CoAP
headers are, by design, tiny. It seems like a serious over-optimization
to try to eliminate them in this fashion. In particular, you’re making
the actual implementation code larger to save a trivial number of bits on
the wire; I was under the impression that many of the implementation
environments CoAP is intended for had some serious on-chip restrictions
that would point away from this kind of additional complexity.
Specific comments follow.
Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
(Aside — it seems odd and somewhat backwards that TCP connections are
provided an affordance for fine-grained control over message sizes, while
UDP communications are not.)
Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional? (I’ll also note that the presence of
keepalive mechanisms in both TCP and WebSockets would seem to make the
addition of new CoAP primitives for the same purpose unnecessary, but I
suspect this has already been debated).
Section 5 and its subsections define a new set of message types,
presumably for use only on connection-oriented protocols, although this
is only implied, and never stated. For example, some implementors may see
CSM, Ping, and Pong as potentially useful in UDP; and, finding no
prohibition in this document against using them, decide to give it a go.
Is that intended? If not, I strongly suggest an explicit prohibition
against using these in UDP contexts.
Section 5.3.2 says that implementations supporting block-wise transfers
SHOULD indicate the Block-wise Transfer Option. I can't figure out why
this is anything other than a "MUST". It seems odd that this document
would define a way to communicate this, and then choose to leave the
communicated options as “YES” and “YOUR GUESS IS AS GOOD AS MINE” rather
than the simpler and more useful “YES” and “NO”.
I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations. This seems impossible to distinguish from a
failure-due-to-timeout from the perspective of the Ping sender. Why not
limit this behavior only to Ping messages that include the Custody
Option?
I find the unmotivated definition of the default port for “coaps+tcp” to
443 — a port that is already assigned to https — to be surprising, to put
it mildly. This definitely needs motivating text, and I suspect it's
actually wrong.
I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.
The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing). It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.
Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)
Section A.1 does not indicate what gateways are supposed to do with
out-of-order notifications. The TCP side requires these to be delivered
in-order; so, do this mean that gateways observing a gap in sequence
numbers need to quarantine the newly received message so that it can
deliver the missing one first? Or does it deliver the newly-received
message and then discard the “stale” one when it arrives? I don’t think
that leaving this up to implementations is particularly advisable.
Section A.3 is a bit more worrisome. I understand the desired
optimization here, but where you reduce traffic in one direction, you run
the risk of exploding it in the other. For example, consider a coap+tcp
client connecting to a gateway that communicates with a CoAP-over-UDP
server. When that client wants to check the health of its observations,
it can send a Ping and receive a Pong that confirms that they are all
alive and well. In order to be able to send a Pong that *means* “all your
observations are alive and well,” the gateway has to verify that all the
observations are alive and well. A simple implementation of a gateway
will likely check on each observed resource individually when it gets a
Ping, and then send a Pong after it hears back about all of them. So, as
a client, I can set up, let’s say, two dozen observations through this
gateway. Then, with each Ping I send, the gateway sends two dozen checks
towards the server. This kind of message amplification attack is an
awesome way to DoS both the gateway and the server. I believe the
document needs a treatment of how UDP/TCP gateways handle notification
health checks, along with techniques for mitigating this specific
attack.
Section A.4 talks about the rather different ways of dealing with
unsubscribing from a resource. Presumably, gateways that get a reset to a
notification are expected to synthesize a new GET to deregister on behalf
of the client? Or is it okay if they just pass along the reset, and
expect the server to know that it means the same thing as a
deregistration? Without explicit guidance here, I expect server and
gateway implementors to make different choices and end up with a lack of
interop.
** There is 1 instance of too long lines in the document, the longest one
being 3 characters in excess of 72.