Discussion:
[core] Adam Roach's Discuss on draft-ietf-core-coap-tcp-tls-08: (with DISCUSS and COMMENT)
Adam Roach
2017-05-09 04:51:24 UTC
Permalink
Adam Roach has entered the following ballot position for
draft-ietf-core-coap-tcp-tls-08: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-core-coap-tcp-tls/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

- Part of the document is outside the scope of the charter of the WG
which requested its publication

While I understand that this document requires a WebSockets mechanism for
.well-known, and that such a mechanism doesn’t yet exist, it seems pretty
far out of scope for the CORE working group to take on defining this
itself (unless I missed something in its charter, which is entirely
possible: it’s quite long). Specifically, I fear that this venue is
unlikely to bring such a change to the attention of those people best
positioned to comment on whether .well-known is appropriate for
WebSockets.

Even if this is in scope for CORE, it really needs to be its own
document. If some future document comes along at a later point and wants
to make use of its own .well-known path with WebSockets, it would be
really quite strange to require it to reference this document in
describing .well-known for WS.


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

General — this is a very bespoke approach to what could have been mostly
solved with a single four-byte “length” header; it is complicated on the
wire, and in implementation; and the format variations among CoAP over
UDP, coap+tls, and coap+ws are going to make gateways much harder to
implement and less efficient (as they will necessarily have to
disassemble messages and rebuild them to change between formats). The
protocol itself mentions gateways in several places, but does not discuss
how they are expected to map among the various flavors of CoAP defined in
this document. Some of the changes seem unnecessary, but it could be that
I’m missing the motivation for them. Ideally, the introduction would work
harder at explaining why CoAP over these transports is as different from
CoAP over UDP as it is, focusing in particular on why the complexity of
having three syntactically incompatible headers is justified by the
benefits provided by such variations.

Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
With the exception of subscribing to resource state (which could be
added), it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).

I’m also a bit puzzled that CoAP already has an inherent mechanism for
blocking messages off into chunks, which this document circumvents for
TCP connections (by allowing Max-Message-Size to be increased), and then
is forced to offer remedies for the resultant head-of-line blocking
issues. If you didn’t introduce this feature, messages with a two-byte
token add six bytes of overhead for every 1024 bytes of content — less
than 0.6% size inflation. It seems like a lot of complicated machinery —
which has a built-in foot-gun that you have to warn people about misusing
— for a very tiny gain. I know it’s relatively late in the process, but
if these trade-offs haven't had a lot of discussion yet, it’s probably
worth at least giving them some additional thought.

I’ll note that the entire BERT mechanism seems to fall into the same trap
of adding extra complexity for virtually nonexistent savings. CoAP
headers are, by design, tiny. It seems like a serious over-optimization
to try to eliminate them in this fashion. In particular, you’re making
the actual implementation code larger to save a trivial number of bits on
the wire; I was under the impression that many of the implementation
environments CoAP is intended for had some serious on-chip restrictions
that would point away from this kind of additional complexity.

Specific comments follow.

Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
(Aside — it seems odd and somewhat backwards that TCP connections are
provided an affordance for fine-grained control over message sizes, while
UDP communications are not.)

Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional? (I’ll also note that the presence of
keepalive mechanisms in both TCP and WebSockets would seem to make the
addition of new CoAP primitives for the same purpose unnecessary, but I
suspect this has already been debated).

Section 5 and its subsections define a new set of message types,
presumably for use only on connection-oriented protocols, although this
is only implied, and never stated. For example, some implementors may see
CSM, Ping, and Pong as potentially useful in UDP; and, finding no
prohibition in this document against using them, decide to give it a go.
Is that intended? If not, I strongly suggest an explicit prohibition
against using these in UDP contexts.

Section 5.3.2 says that implementations supporting block-wise transfers
SHOULD indicate the Block-wise Transfer Option. I can't figure out why
this is anything other than a "MUST". It seems odd that this document
would define a way to communicate this, and then choose to leave the
communicated options as “YES” and “YOUR GUESS IS AS GOOD AS MINE” rather
than the simpler and more useful “YES” and “NO”.

I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations. This seems impossible to distinguish from a
failure-due-to-timeout from the perspective of the Ping sender. Why not
limit this behavior only to Ping messages that include the Custody
Option?

I find the unmotivated definition of the default port for “coaps+tcp” to
443 — a port that is already assigned to https — to be surprising, to put
it mildly. This definitely needs motivating text, and I suspect it's
actually wrong.

I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.

The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing). It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.

Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)

Section A.1 does not indicate what gateways are supposed to do with
out-of-order notifications. The TCP side requires these to be delivered
in-order; so, do this mean that gateways observing a gap in sequence
numbers need to quarantine the newly received message so that it can
deliver the missing one first? Or does it deliver the newly-received
message and then discard the “stale” one when it arrives? I don’t think
that leaving this up to implementations is particularly advisable.

Section A.3 is a bit more worrisome. I understand the desired
optimization here, but where you reduce traffic in one direction, you run
the risk of exploding it in the other. For example, consider a coap+tcp
client connecting to a gateway that communicates with a CoAP-over-UDP
server. When that client wants to check the health of its observations,
it can send a Ping and receive a Pong that confirms that they are all
alive and well. In order to be able to send a Pong that *means* “all your
observations are alive and well,” the gateway has to verify that all the
observations are alive and well. A simple implementation of a gateway
will likely check on each observed resource individually when it gets a
Ping, and then send a Pong after it hears back about all of them. So, as
a client, I can set up, let’s say, two dozen observations through this
gateway. Then, with each Ping I send, the gateway sends two dozen checks
towards the server. This kind of message amplification attack is an
awesome way to DoS both the gateway and the server. I believe the
document needs a treatment of how UDP/TCP gateways handle notification
health checks, along with techniques for mitigating this specific
attack.

Section A.4 talks about the rather different ways of dealing with
unsubscribing from a resource. Presumably, gateways that get a reset to a
notification are expected to synthesize a new GET to deregister on behalf
of the client? Or is it okay if they just pass along the reset, and
expect the server to know that it means the same thing as a
deregistration? Without explicit guidance here, I expect server and
gateway implementors to make different choices and end up with a lack of
interop.
** There is 1 instance of too long lines in the document, the longest one
being 3 characters in excess of 72.
Hannes Tschofenig
2017-05-09 08:09:51 UTC
Permalink
Hi Adam,

thanks for your review.
Post by Adam Roach
----------------------------------------------------------------------
----------------------------------------------------------------------
- Part of the document is outside the scope of the charter of the WG
which requested its publication
While I understand that this document requires a WebSockets mechanism for
.well-known, and that such a mechanism doesn’t yet exist, it seems pretty
far out of scope for the CORE working group to take on defining this
itself (unless I missed something in its charter, which is entirely
possible: it’s quite long). Specifically, I fear that this venue is
unlikely to bring such a change to the attention of those people best
positioned to comment on whether .well-known is appropriate for
WebSockets.
Even if this is in scope for CORE, it really needs to be its own
document. If some future document comes along at a later point and wants
to make use of its own .well-known path with WebSockets, it would be
really quite strange to require it to reference this document in
describing .well-known for WS.
The authors of the document have different views about the inclusion of
the support of WebSockets in the document. I leave it to the responsible
AD to decide what the best document structure is and what is indeed
covered as part of the CORE working group charter.
Post by Adam Roach
----------------------------------------------------------------------
----------------------------------------------------------------------
You have a couple of comments, namely

* Variable length format

The group decided to have a variable length format. I argued for a fixed
size length format and lost the argument.

* Gateways and their complexity

We are using gateway functionality today in our deployments but they are
not just simple protocol translations, as described in RFC 7252 or in
RFC 8075. Instead they two protocols on each side of the gateway have
different semantic and functionality. As such, the considerations in
those two RFCs don't apply to us and we are not seeing any of that
complexity.

* Too many transport options

We care only about CoAP over TLS. We are not going to use the WebSockets
part of the document. In practice for many companies there will not be a
problem with too many transports since they will only use specific ones
in their deployment.

* Block-wise transport with CoAP over TCP

Maybe this needs to be better explained but CoAP is tailored to small
data transmissions only. Unfortunately, there are some larger payloads
to be shuffled around as well, particularly firmware updates.

When RFC 7959 is used with TCP we found out that the performance is
quite bad since the block-wise transfer spec limits the size of the
chunks to a really small size (2048 bytes). The addition in this spec is
to increase the size of the chunks.

I will see whether the text can be improved to get his message across.
Post by Adam Roach
General — this is a very bespoke approach to what could have been mostly
solved with a single four-byte “length” header; it is complicated on the
wire, and in implementation; and the format variations among CoAP over
UDP, coap+tls, and coap+ws are going to make gateways much harder to
implement and less efficient (as they will necessarily have to
disassemble messages and rebuild them to change between formats). The
protocol itself mentions gateways in several places, but does not discuss
how they are expected to map among the various flavors of CoAP defined in
this document. Some of the changes seem unnecessary, but it could be that
I’m missing the motivation for them. Ideally, the introduction would work
harder at explaining why CoAP over these transports is as different from
CoAP over UDP as it is, focusing in particular on why the complexity of
having three syntactically incompatible headers is justified by the
benefits provided by such variations.
Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
With the exception of subscribing to resource state (which could be
added), it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).
I’m also a bit puzzled that CoAP already has an inherent mechanism for
blocking messages off into chunks, which this document circumvents for
TCP connections (by allowing Max-Message-Size to be increased), and then
is forced to offer remedies for the resultant head-of-line blocking
issues. If you didn’t introduce this feature, messages with a two-byte
token add six bytes of overhead for every 1024 bytes of content — less
than 0.6% size inflation. It seems like a lot of complicated machinery —
which has a built-in foot-gun that you have to warn people about misusing
— for a very tiny gain. I know it’s relatively late in the process, but
if these trade-offs haven't had a lot of discussion yet, it’s probably
worth at least giving them some additional thought.
I’ll note that the entire BERT mechanism seems to fall into the same trap
of adding extra complexity for virtually nonexistent savings. CoAP
headers are, by design, tiny. It seems like a serious over-optimization
to try to eliminate them in this fashion. In particular, you’re making
the actual implementation code larger to save a trivial number of bits on
the wire; I was under the impression that many of the implementation
environments CoAP is intended for had some serious on-chip restrictions
that would point away from this kind of additional complexity.
Specific comments follow.
Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
(Aside — it seems odd and somewhat backwards that TCP connections are
provided an affordance for fine-grained control over message sizes, while
UDP communications are not.)
I personally would set a minimum requirement for the size of message the
remote site needs to support. Thereby, the initiator can be sure that
messages up to a certain size are supported. If it wants to send larger
messages then it has to wait till the remote site provides their CSM.

In our environment this would not be a problem with the TCP server is
actually not on the IoT device but rather on the cloud-based (or
on-premise-based) server instead. The TCP client is running on the IoT
device.
Post by Adam Roach
Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional? (I’ll also note that the presence of
keepalive mechanisms in both TCP and WebSockets would seem to make the
addition of new CoAP primitives for the same purpose unnecessary, but I
suspect this has already been debated).
The issue was that TCP keepalives are sometimes getting blocked or
modified by firewalls whereas the CoAP ping/pong on top of TLS won't
Post by Adam Roach
Section 5 and its subsections define a new set of message types,
presumably for use only on connection-oriented protocols, although this
is only implied, and never stated. For example, some implementors may see
CSM, Ping, and Pong as potentially useful in UDP; and, finding no
prohibition in this document against using them, decide to give it a go.
Is that intended? If not, I strongly suggest an explicit prohibition
against using these in UDP contexts.
I believe a similar issue came up recently (provided by Jim) when he was
asking whether this mechanism is also applicable to a transport over SMS.

If there is functionality in the document that is useful for other
transports in the future then that's great. I wouldn't rule out such use
just because we cannot imagine it today.
Post by Adam Roach
Section 5.3.2 says that implementations supporting block-wise transfers
SHOULD indicate the Block-wise Transfer Option. I can't figure out why
this is anything other than a "MUST". It seems odd that this document
would define a way to communicate this, and then choose to leave the
communicated options as “YES” and “YOUR GUESS IS AS GOOD AS MINE” rather
than the simpler and more useful “YES” and “NO”.
Sounds reasonable to me.
Post by Adam Roach
I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations. This seems impossible to distinguish from a
failure-due-to-timeout from the perspective of the Ping sender. Why not
limit this behavior only to Ping messages that include the Custody
Option?
I push this to Carsten, who I believe wrote this text.
Post by Adam Roach
I find the unmotivated definition of the default port for “coaps+tcp” to
443 — a port that is already assigned to https — to be surprising, to put
it mildly. This definitely needs motivating text, and I suspect it's
actually wrong.
If we don't do this then we do not get through firewalls.
Post by Adam Roach
I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.
It is fine for me to require ALPN always.
Post by Adam Roach
The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing). It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.
This follows of what is being said in CoAP. Even the HTTP-spec does not
go into such a level of detail.

In terms of interoperability clearly IoT devices that implement
pre-shared secrets are not going to talk to devices that only implement
certificates. This is, however, not a real interoperability issue since
the use of CoAP will most likely be part of a device management
framework like LwM2M or stuff the OIC is working on. Companies deploying
IoT devices then need to figure out what they want to accomplish and
what security threats they care about.
Post by Adam Roach
Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)
As mentioned before, I personally don't see any issue with this at all
since we not shuffling CoAP over UDP on one side to CoAP over TCP on the
other side. In fact, I don't know anyone doing that.

~snip~

Ciao
Hannes
Alexey Melnikov
2017-05-09 10:21:47 UTC
Permalink
Hi Adam,
Post by Adam Roach
Adam Roach has entered the following ballot position for
draft-ietf-core-coap-tcp-tls-08: Discuss
[snip]
Post by Adam Roach
----------------------------------------------------------------------
----------------------------------------------------------------------
- Part of the document is outside the scope of the charter of the WG
which requested its publication
While I understand that this document requires a WebSockets mechanism for
.well-known, and that such a mechanism doesn’t yet exist, it seems pretty
far out of scope for the CORE working group to take on defining this
itself (unless I missed something in its charter, which is entirely
possible: it’s quite long). Specifically, I fear that this venue is
unlikely to bring such a change to the attention of those people best
positioned to comment on whether .well-known is appropriate for
WebSockets.
Mark Nottingham's review pointed out that .well-known is not registered
for ws:/wss:, so the obvious fix was to define it.
The WG working on WebSockets is closed. Carsten emailed the HYBI
mailings list (WebSocket WG mailing list) about this change:

https://www.ietf.org/mail-archive/web/hybi/current/msg10768.html

There was no objection.

I was one of the two editors of the WebSocket RFC, so I thought just
defining .well-known for WebSockets was a sensible thing to do.
Post by Adam Roach
Even if this is in scope for CORE, it really needs to be its own
document. If some future document comes along at a later point and wants
to make use of its own .well-known path with WebSockets, it would be
really quite strange to require it to reference this document in
describing .well-known for WS.
I didn't push for a separate document just for that (it will be 1 page),
but if IESG really likes a separate document we will do that.
Carsten Bormann
2017-05-10 18:31:13 UTC
Permalink
Hi Adam,

thank you for your extensive review.

Alexey has addressed your procedural DISCUSS, and I don’t have anything to add there.

I will try to make one quick round through the COMMENTs.
Post by Adam Roach
----------------------------------------------------------------------
----------------------------------------------------------------------
General — this is a very bespoke approach to what could have been mostly
solved with a single four-byte “length” header; it is complicated on the
wire, and in implementation; and the format variations among CoAP over
UDP, coap+tls, and coap+ws are going to make gateways much harder to
implement and less efficient (as they will necessarily have to
disassemble messages and rebuild them to change between formats). The
protocol itself mentions gateways in several places, but does not discuss
how they are expected to map among the various flavors of CoAP defined in
this document. Some of the changes seem unnecessary, but it could be that
I’m missing the motivation for them. Ideally, the introduction would work
harder at explaining why CoAP over these transports is as different from
CoAP over UDP as it is, focusing in particular on why the complexity of
having three syntactically incompatible headers is justified by the
benefits provided by such variations.
I think I addressed this one in the comments to Mirja.

I’m still amazed how the arrangement of the first few bytes of the header can cause so much interest.
Post by Adam Roach
Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
I read this as “why don’t you use HTTP when you need TCP”?
CoAP over TCP is more complex than CoAP over UDP, but still much less complex than either one of the HTTPs.
Post by Adam Roach
With the exception of subscribing to resource state (which could be
added),
A very big exception — the observe option is fundamental to many interactions with Things, and we currently don’t have a way to map this on HTTP.
Post by Adam Roach
it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).
These motivations are in the draft.
Post by Adam Roach
I’m also a bit puzzled that CoAP already has an inherent mechanism for
blocking messages off into chunks, which this document circumvents for
TCP connections (by allowing Max-Message-Size to be increased), and then
is forced to offer remedies for the resultant head-of-line blocking
issues. If you didn’t introduce this feature, messages with a two-byte
token add six bytes of overhead for every 1024 bytes of content — less
than 0.6% size inflation. It seems like a lot of complicated machinery —
which has a built-in foot-gun that you have to warn people about misusing
— for a very tiny gain. I know it’s relatively late in the process, but
if these trade-offs haven't had a lot of discussion yet, it’s probably
worth at least giving them some additional thought.
There are indeed different ways of handling large bodies (which occur as payload mainly as an exception in our use cases).
The current set of features was defined after OCF complained that the lock-step nature of the block protocol slowed down their firmware transfers too much; this was easy to fix my making the message size more flexible.
Post by Adam Roach
I’ll note that the entire BERT mechanism seems to fall into the same trap
of adding extra complexity for virtually nonexistent savings. CoAP
headers are, by design, tiny. It seems like a serious over-optimization
to try to eliminate them in this fashion. In particular, you’re making
the actual implementation code larger to save a trivial number of bits on
the wire; I was under the impression that many of the implementation
environments CoAP is intended for had some serious on-chip restrictions
that would point away from this kind of additional complexity.
There is a spectrum of devices out there, and some may benefit from this — performance issues for larger messages tend to come up with larger nodes.
Post by Adam Roach
Specific comments follow.
Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
Given the introduction of Max-Message-Size, there was some speculation that it would be good to allow going below the default CoAP value of 1152. In UDP CoAP, preferences of this kind are indicated in the “control usage” of the Block options, and there is an assumption that violating them will lead to suboptimal performance, not to malfunction. But we never really thought that the possibility of reducing Max-Message-Size below 1152 would motivate making the state machine more complex; maybe we should state the obvious and add the warning that indicating a smaller Max-Message-Size is no protection against receiving a message that was sent off before that new value was known to the peer.
Post by Adam Roach
(Aside — it seems odd and somewhat backwards that TCP connections are
provided an affordance for fine-grained control over message sizes, while
UDP communications are not.)
Connections provide a context for performing this control; for transports that don’t have connections, it is less clear what the control state should be attached to. (Certainly not four-tuples.) So far, we have handled DTLS like UDP; theoretically, we could use CSM-style messages in DTLS, but that hasn’t been explored yet (and so far I’m not aware of interest in changing CoAP over DTLS to make use of this theoretical possibility).
Post by Adam Roach
Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional?
Successful TCP keepalives are usually not visible to the application, so they are not quite in the same league as CoAP PING/PONG or WebSocket mechanism. This is a SHOULD NOT because there may be reasons why the WebSocket mechanism might be the one to use; the CoAP-level PING/PONG are closer to the application and provide some CoAP-specific functionality (such as Custody).
Post by Adam Roach
(I’ll also note that the presence of
keepalive mechanisms in both TCP and WebSockets would seem to make the
addition of new CoAP primitives for the same purpose unnecessary, but I
suspect this has already been debated).
One problem with such mechanisms is that they are often buried in layers that are hard to access/control, so it ay be simpler to solve the problem at the application layer. (This is a bit of a TLS vs. IPsec argument.)
Post by Adam Roach
Section 5 and its subsections define a new set of message types,
presumably for use only on connection-oriented protocols, although this
is only implied, and never stated.
Now https://github.com/core-wg/coap-tcp-tls/issues/152
Post by Adam Roach
For example, some implementors may see
CSM, Ping, and Pong as potentially useful in UDP; and, finding no
prohibition in this document against using them, decide to give it a go.
Is that intended? If not, I strongly suggest an explicit prohibition
against using these in UDP contexts.
Yes.
Post by Adam Roach
Section 5.3.2 says that implementations supporting block-wise transfers
SHOULD indicate the Block-wise Transfer Option. I can't figure out why
this is anything other than a "MUST”.
Well, they may not want to disclose this capability…
What the text probably should say is that if they want to make this capability available to the peer they MUST indicate it.

Now https://github.com/core-wg/coap-tcp-tls/issues/153
Post by Adam Roach
It seems odd that this document
would define a way to communicate this, and then choose to leave the
communicated options as “YES” and “YOUR GUESS IS AS GOOD AS MINE” rather
than the simpler and more useful “YES” and “NO”.
Indeed.
Post by Adam Roach
I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations.
Oh. The intention was that the PINGer gives permission to delay with the Custody option.
I now realize that the text isn’t clear about that.

s/Unless there is an
option with delaying semantics/Unless the PING carries an
option with delaying semantics/

Now https://github.com/core-wg/coap-tcp-tls/issues/154
Post by Adam Roach
This seems impossible to distinguish from a
failure-due-to-timeout from the perspective of the Ping sender. Why not
limit this behavior only to Ping messages that include the Custody
Option?
(Or a similar future option.)
Post by Adam Roach
I find the unmotivated definition of the default port for “coaps+tcp” to
443 — a port that is already assigned to https — to be surprising, to put
it mildly. This definitely needs motivating text, and I suspect it's
actually wrong.
That is possible because of the way ALPN works; it seems that 443 is the best choice for actually getting connectivity.
Post by Adam Roach
I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.
Interesting scenario.

The current rules are an attempt to use ALPN as it is the future, but also allow environments (which were specifically cited, but I forget which ones they were) without ALPN to play. I think that objective is worth some complexity, but I’ve opened an issue to discuss this nonetheless.

Now https://github.com/core-wg/coap-tcp-tls/issues/155
Post by Adam Roach
The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
The entirety of CoAP over TCP is optional for a CoAP client or server.
Post by Adam Roach
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing).
Generally, this is determined by what kind of security workflows are used; writing down a preference here will have little impact.

EKR has pointed out that we need to make explicit that we want to deviate from the TLS 1.2 MTI for constrained-to-cloud, so some changes will be required here. See also https://github.com/core-wg/coap-tcp-tls/issues/145
Post by Adam Roach
It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.
The server and client roles are not really defined very well here, as either side can use a connection either way once it has been set up.

My advice would be: If you want interoperability, use CoAP over UDP with the security model that you have a security workflow for.
If there are operational restrictions making this impossible, respond to those operational restrictions.
Post by Adam Roach
Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)
Section A.1 does not indicate what gateways are supposed to do with
out-of-order notifications. The TCP side requires these to be delivered
in-order; so, do this mean that gateways observing a gap in sequence
numbers need to quarantine the newly received message so that it can
deliver the missing one first? Or does it deliver the newly-received
message and then discard the “stale” one when it arrives? I don’t think
that leaving this up to implementations is particularly advisable.
Yes, we had that discussion on the list. There is little point in trying to do anything here but the latter.
We could state that more explicitly here, or leave that to implementers’ advice documents such as draft-ietf-lwig-coap (we’ll need to rework section 6 of that now anyway).
Post by Adam Roach
Section A.3 is a bit more worrisome. I understand the desired
optimization here, but where you reduce traffic in one direction, you run
the risk of exploding it in the other. For example, consider a coap+tcp
client connecting to a gateway that communicates with a CoAP-over-UDP
server. When that client wants to check the health of its observations,
it can send a Ping and receive a Pong that confirms that they are all
alive and well. In order to be able to send a Pong that *means* “all your
observations are alive and well,” the gateway has to verify that all the
observations are alive and well.
Oh. The PONG says the proxy hasn’t crashed, so the observations are alive and well in the proxy. How that makes sure it gets fresh data from the next hop (by observe, by polling) is up to the proxy. There is no intention that a PING suddenly makes a proxy do a check, when it otherwise has ignored its duty to check that before.
Post by Adam Roach
A simple implementation of a gateway
will likely check on each observed resource individually when it gets a
Ping, and then send a Pong after it hears back about all of them. So, as
a client, I can set up, let’s say, two dozen observations through this
gateway. Then, with each Ping I send, the gateway sends two dozen checks
towards the server. This kind of message amplification attack is an
awesome way to DoS both the gateway and the server. I believe the
document needs a treatment of how UDP/TCP gateways handle notification
health checks, along with techniques for mitigating this specific
attack.
See above. We probably need to say that PINGs check the connection (and thus the fate-sharing observation state) but are not meant to make proxies frantically check the observation relationships to their data sources. Now https://github.com/core-wg/coap-tcp-tls/issues/156
Post by Adam Roach
Section A.4 talks about the rather different ways of dealing with
unsubscribing from a resource. Presumably, gateways that get a reset to a
notification are expected to synthesize a new GET to deregister on behalf
of the client?
A proxy that translates from a UDP client to TCP server may want to deregister if the client sends a Reset. Or not, of the proxy has other clients that want this data (or if it is configured to watch that TCP server).
Post by Adam Roach
Or is it okay if they just pass along the reset,
In that scenario, there is no way to pass on the Reset, as there are no resets over the TCP connection.
(Well, you could close the TCP connection :-)
Post by Adam Roach
and
expect the server to know that it means the same thing as a
deregistration? Without explicit guidance here, I expect server and
gateway implementors to make different choices and end up with a lack of
interop.
A UDP CoAP server needs to handle resets on notifications, so it cannot make a choice here.
A TCP CoAP server never can see a reset, so it only has to handle the explicit deregistration (or a connection teardown).
The proxy can make a choice to perform explicit deregistration to a UDP CoAP server or send a Reset on the next notification; I would generally assume that proxies will use explicit deregistration.
Food for section 6 of draft-ietf-lwig-coap, I’d say.
Post by Adam Roach
** There is 1 instance of too long lines in the document, the longest one
being 3 characters in excess of 72.
Thanks, now https://github.com/core-wg/coap-tcp-tls/issues/157

Again, thank you for this extensive review — getting feedback from someone who hasn’t followed the work since its start really can open one’s eyes on how a document can be confusing.

Grüße, Carsten
Adam Roach
2017-05-10 19:46:24 UTC
Permalink
I'm starting a new thread on this one issue in particular because it has
much larger architectural implications for the future of the Internet at
large.
Post by Carsten Bormann
Post by Adam Roach
I find the unmotivated definition of the default port for “coaps+tcp” to
443 — a port that is already assigned to https — to be surprising, to put
it mildly. This definitely needs motivating text, and I suspect it's
actually wrong.
That is possible because of the way ALPN works; it seems that 443 is the best choice for actually getting connectivity.
I understand that it's *possible*, but that hardly seems to make it
*advisable*.

At the moment, the primary component in the design of most of our layer
4 protocols -- including TCP -- for distinguishing among different
services is the port number. The way ALPN is currently used is for
distinguishing among protocols on a single port that fundamentally
accomplish the same thing. Users of "https" URLs use it to switch among
HTTP, SPDY, and H2. Users of "stun" URLs use it to switch between relay
server functionality and NAT discovery functionality. Users of WebRTC
datachannels use it to differentiate between confidential and
non-confidential media.

All of those are basically using variations of the same protocol on a
single port.

You're proposing a fairly large departure from that, in that CoAP --
while inspired by HTTP and its relatives -- is really a different thing
than what port 443 is assigned for, and the argument being offered is
that port 443 tends to work through firewalls better than other ports.
Allowing a default port of 443 for coaps+tcp would mark the moment where
we first assigned overlapping default ports to different protocols for
the purposes of circumventing firewall policies that -- through
intention or misconfiguration -- prevent the use of a unique assigned
port. Taken to its logical end, all future TLS-using protocols could
equally claim a default port of 443 on the basis that:

1. It works through firewalls better, and
2. ALPN makes it possible

This list seems to be a complete accounting of your rationale for using
a default port 443 for CoAP; correct me if I'm wrong.

If we're going to take the first steps down this path, I want them to be
made deliberately and after due consideration of the consequences. If we
decide that we really need to evaluate such an approach, I have a number
of concrete, severely detrimental real-world consequences that would
ensue from allowing the generalized assignment of port 443 to non-HTTPS
protocols. I'm not raising them here because I think enough people will
find the decision to move differentiation among unrelated protocols from
port numbers to ALPN IDs sufficiently architecturally distasteful that
debating the specific consequences will be unnecessary.

/a
Adam Roach
2017-05-10 22:22:19 UTC
Permalink
Post by Hannes Tschofenig
Hi Adam,
thank you for your extensive review.
Alexey has addressed your procedural DISCUSS, and I don’t have anything to add there.
I will try to make one quick round through the COMMENTs.
Post by Adam Roach
----------------------------------------------------------------------
----------------------------------------------------------------------
General — this is a very bespoke approach to what could have been mostly
solved with a single four-byte “length” header; it is complicated on the
wire, and in implementation; and the format variations among CoAP over
UDP, coap+tls, and coap+ws are going to make gateways much harder to
implement and less efficient (as they will necessarily have to
disassemble messages and rebuild them to change between formats). The
protocol itself mentions gateways in several places, but does not discuss
how they are expected to map among the various flavors of CoAP defined in
this document. Some of the changes seem unnecessary, but it could be that
I’m missing the motivation for them. Ideally, the introduction would work
harder at explaining why CoAP over these transports is as different from
CoAP over UDP as it is, focusing in particular on why the complexity of
having three syntactically incompatible headers is justified by the
benefits provided by such variations.
I think I addressed this one in the comments to Mirja.
I’m still amazed how the arrangement of the first few bytes of the header can cause so much interest.
It's not so much the specific arrangement itself as much as the creation
of three mutually incompatible versions of the arrangement that is
causing me heartburn. I'd be much happier with something baroque like
little-endian byte packing than with so many variations on a theme. I'm
seeing that some implementations will have to have three different
parsers (or equivalent complexity in terms of alternate code paths) and
three different serializers if they're going to implement all three
variations. That is very unfriendly to developers and testers alike.
Post by Hannes Tschofenig
Post by Adam Roach
Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
I read this as “why don’t you use HTTP when you need TCP”?
CoAP over TCP is more complex than CoAP over UDP, but still much less complex than either one of the HTTPs.
I really hope this doesn't come across as snarky, as such is really not
my intention, but that explanation thoroughly falls apart when I get to
section 4.
Post by Hannes Tschofenig
Post by Adam Roach
With the exception of subscribing to resource state (which could be
added),
A very big exception — the observe option is fundamental to many interactions with Things, and we currently don’t have a way to map this on HTTP.
I would suggest becoming familiar with RFC8030.
Post by Hannes Tschofenig
Post by Adam Roach
it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).
These motivations are in the draft.
To be clear: I think it needs to clearly say "we are making interop less
likely, and find these benefits to be sufficiently compelling to justify
doing so." I suspect that, phrased that way, the current justification
won't hold up to your own scrutiny, at least not without being made
substantially more convincing.
Post by Hannes Tschofenig
Post by Adam Roach
I’m also a bit puzzled that CoAP already has an inherent mechanism for
blocking messages off into chunks, which this document circumvents for
TCP connections (by allowing Max-Message-Size to be increased), and then
is forced to offer remedies for the resultant head-of-line blocking
issues. If you didn’t introduce this feature, messages with a two-byte
token add six bytes of overhead for every 1024 bytes of content — less
than 0.6% size inflation. It seems like a lot of complicated machinery —
which has a built-in foot-gun that you have to warn people about misusing
— for a very tiny gain. I know it’s relatively late in the process, but
if these trade-offs haven't had a lot of discussion yet, it’s probably
worth at least giving them some additional thought.
There are indeed different ways of handling large bodies (which occur as payload mainly as an exception in our use cases).
The current set of features was defined after OCF complained that the lock-step nature of the block protocol slowed down their firmware transfers too much; this was easy to fix my making the message size more flexible.
Ah, so the current block transfer mode is like of like TFTP in that it
has a window size of 1? I didn't realize that. In that context, the
current design makes a bit more sense. Perhaps an explanation closer to
the fron that ties Max-Message-Size, block transfer, and the BERT
mechanism together would be useful.

[snip]
Post by Hannes Tschofenig
Post by Adam Roach
Specific comments follow.
Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
Given the introduction of Max-Message-Size, there was some speculation that it would be good to allow going below the default CoAP value of 1152.
Including the rationale in the document would be good.
Post by Hannes Tschofenig
In UDP CoAP, preferences of this kind are indicated in the “control usage” of the Block options, and there is an assumption that violating them will lead to suboptimal performance, not to malfunction. But we never really thought that the possibility of reducing Max-Message-Size below 1152 would motivate making the state machine more complex; maybe we should state the obvious and add the warning that indicating a smaller Max-Message-Size is no protection against receiving a message that was sent off before that new value was known to the peer.
Yes. I guarantee that the current phrasing will make some implementors
want to treat messages larger than their advertised Max-Message-Size as
an error. If it is supposed to be not an error, you need explicit
language that it is not an error.
Post by Hannes Tschofenig
Post by Adam Roach
(Aside — it seems odd and somewhat backwards that TCP connections are
provided an affordance for fine-grained control over message sizes, while
UDP communications are not.)
Connections provide a context for performing this control; for transports that don’t have connections, it is less clear what the control state should be attached to. (Certainly not four-tuples.) So far, we have handled DTLS like UDP; theoretically, we could use CSM-style messages in DTLS, but that hasn’t been explored yet (and so far I’m not aware of interest in changing CoAP over DTLS to make use of this theoretical possibility).
The "block transfer window size is one" clarification nullifies this
observation.
Post by Hannes Tschofenig
Post by Adam Roach
Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional?
Successful TCP keepalives are usually not visible to the application, so they are not quite in the same league as CoAP PING/PONG or WebSocket mechanism. This is a SHOULD NOT because there may be reasons why the WebSocket mechanism might be the one to use; the CoAP-level PING/PONG are closer to the application and provide some CoAP-specific functionality (such as Custody).
You seem to have answered some inverse of what I was asking, so I'll try
to be clearer: should section 3.4 say "SHOULD NOT use TCP keepalives"?

[snip]
Post by Hannes Tschofenig
Post by Adam Roach
I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations.
Oh. The intention was that the PINGer gives permission to delay with the Custody option.
I now realize that the text isn’t clear about that.
s/Unless there is an
option with delaying semantics/Unless the PING carries an
option with delaying semantics/
Now https://github.com/core-wg/coap-tcp-tls/issues/154
That fixes part of the problem. 5.4.1 also says "When responding to a
Ping message, the receiver can include an elective Custody Option in the
Pong message," making it again sound like it's the entity sending the
Pong making the decision.

And then I can't read the second paragraph of section 5.4.1 in any way
other than "in addition to the (clearly unilateral) decision to include
a Custody Option in a Pong, the sender of the Ping can request that this
happen by including one in the Ping."

I think most of the text in 5.4.1 needs to be rewritten to make it clear
-- and I do suggest normative language here -- that a Ping MAY include a
Custody Option, and a Pong MUST NOT include a Custody Option unless the
corresponding Ping also contained one.
Post by Hannes Tschofenig
(Or a similar future option.)
Yes. Or a similar future option.
Post by Hannes Tschofenig
Post by Adam Roach
I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.
Interesting scenario.
The current rules are an attempt to use ALPN as it is the future, but also allow environments (which were specifically cited,
Not in this document, as far as I can tell. Including design rationale
for non-obvious choices is generally a Good Thing....
Post by Hannes Tschofenig
but I forget which ones they were)
...and that's why.
Post by Hannes Tschofenig
without ALPN to play. I think that objective is worth some complexity, but I’ve opened an issue to discuss this nonetheless.
Now https://github.com/core-wg/coap-tcp-tls/issues/155
I think this conversation will be very difficult to properly reason
about unless someone can unearth the rationale that you describe above.
Post by Hannes Tschofenig
Post by Adam Roach
The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
The entirety of CoAP over TCP is optional for a CoAP client or server.
s/CoAP client/coaps+tls client/ -- sorry for the imprecision, as I
thought this would be obvious from context.
Post by Hannes Tschofenig
Post by Adam Roach
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing).
Generally, this is determined by what kind of security workflows are used; writing down a preference here will have little impact.
This again points towards something about CoAP that's giving me a vague
sense of unease about the whole ecosystem as designed. Right now, you
have the mutually incompatible options of:

1. COAP/UDP
2. COAP/DLTS/UDP
3. COAP mapped to HTTP per RFC 7252
4. COAP mapped to HTTPS per RFC 7252
5. COAP/TCP
6. COAP/TLS/TCP with PSK
7. COAP/TLS/TCP with Raw Public Key
8. COAP/TLS/TCP with Certs
9. COAP/WS/HTTP
10. COAP/WS/HTTPS


It might be worse than this, as I suspect that your handling of DTLS
probably parallels your handling of TLS. And these are all
*configuration* options, not negotiated options. It's made worse by the
fact that you're defining the resource spaces for many of these to be
different from the others, so you can't just switch among them to find
one you have in common: if something is available via coaps+tcp, then
there's no mechanical way to determine whether it's also available over
coaps (on UDP), so even an assertion that UDP is MTI doesn't make things
work.
Post by Hannes Tschofenig
We care only about CoAP over TLS. We are not going to use the WebSockets
part of the document. In practice for many companies there will not be a
problem with too many transports since they will only use specific ones
in their deployment.
Basically what I'm hearing is that each deployment will have to pick one
of these mutually incompatible (and increasingly unrelated) flavors of
COAP. At some point, this becomes indistinguishable from each
installation having its own proprietary protocol in terms of the
benefits derived from standardization in the first place.

I'm sad that these protocols are syntactically different from each other
and have different state machines, requiring multiple or unnecessarily
complex libraries to implement them. I'm sad that these protocols have
disjoint resource spaces, preventing automated failover among them to
find one in common. I'm sad that these protocols have no in-built
negotiation among themselves. I'm sad that these protocols have no MTI
among the myriad of options, and particularly sad that the disjoint
resource spaces will forever slam the door on fixing that flaw. And I'm
sad that this list appears to be long and growing longer.

Mostly, I'm sad that saying "our device uses CoAP" is becoming an
increasingly meaningless statement, as you have to be very clear about
which incompatible variation of CoAP it implements.
Post by Hannes Tschofenig
EKR has pointed out that we need to make explicit that we want to deviate from the TLS 1.2 MTI for constrained-to-cloud, so some changes will be required here. See also https://github.com/core-wg/coap-tcp-tls/issues/145
If EKR is on top of the TLS binding issue, I'm sure he is better
equipped to drive it to a reasonable conclusion than I am.
Post by Hannes Tschofenig
Post by Adam Roach
It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.
The server and client roles are not really defined very well here, as either side can use a connection either way once it has been set up.
My advice would be: If you want interoperability, use CoAP over UDP with the security model that you have a security workflow for.
If there are operational restrictions making this impossible, respond to those operational restrictions.
This seems precluded by the determination that each transport has a
disjoint resource space.
Post by Hannes Tschofenig
Post by Adam Roach
Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)
Section A.1 does not indicate what gateways are supposed to do with
out-of-order notifications. The TCP side requires these to be delivered
in-order; so, do this mean that gateways observing a gap in sequence
numbers need to quarantine the newly received message so that it can
deliver the missing one first? Or does it deliver the newly-received
message and then discard the “stale” one when it arrives? I don’t think
that leaving this up to implementations is particularly advisable.
Yes, we had that discussion on the list. There is little point in trying to do anything here but the latter.
We could state that more explicitly here, or leave that to implementers’ advice documents such as draft-ietf-lwig-coap (we’ll need to rework section 6 of that now anyway).
Say that out loud, or someone will get it wrong. I mean, I outlined the
two potentially sensible things. There are infinite nonsensical things.
Absent guidance, people will implement out of both categories.
Post by Hannes Tschofenig
Post by Adam Roach
Section A.3 is a bit more worrisome. I understand the desired
optimization here, but where you reduce traffic in one direction, you run
the risk of exploding it in the other. For example, consider a coap+tcp
client connecting to a gateway that communicates with a CoAP-over-UDP
server. When that client wants to check the health of its observations,
it can send a Ping and receive a Pong that confirms that they are all
alive and well. In order to be able to send a Pong that *means* “all your
observations are alive and well,” the gateway has to verify that all the
observations are alive and well.
Oh. The PONG says the proxy hasn’t crashed, so the observations are alive and well in the proxy.
That's not what this document says, and why I was careful to quote from
the document above.
Post by Hannes Tschofenig
How that makes sure it gets fresh data from the next hop (by observe, by polling) is up to the proxy. There is no intention that a PING suddenly makes a proxy do a check, when it otherwise has ignored its duty to check that before.
You are making a bunch of assumptions here that are completely undocumented.
Post by Hannes Tschofenig
Post by Adam Roach
A simple implementation of a gateway
will likely check on each observed resource individually when it gets a
Ping, and then send a Pong after it hears back about all of them. So, as
a client, I can set up, let’s say, two dozen observations through this
gateway. Then, with each Ping I send, the gateway sends two dozen checks
towards the server. This kind of message amplification attack is an
awesome way to DoS both the gateway and the server. I believe the
document needs a treatment of how UDP/TCP gateways handle notification
health checks, along with techniques for mitigating this specific
attack.
See above. We probably need to say that PINGs check the connection (and thus the fate-sharing observation state) but are not meant to make proxies frantically check the observation relationships to their data sources. Now https://github.com/core-wg/coap-tcp-tls/issues/156
Yes. That's the solution. I strongly suggest outlining this attack as
one consequence if implementors ignore such advice. Sending checks when
you get a Ping is the easiest thing to code, by far, so unless you
explain that it could be potentially quite harmful, people *will* do it
even if you say not to. (In fact, saying not to without being clear
about *why* not could be worse than saying nothing; cf.
<https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose>)
Post by Hannes Tschofenig
Post by Adam Roach
Section A.4 talks about the rather different ways of dealing with
unsubscribing from a resource. Presumably, gateways that get a reset to a
notification are expected to synthesize a new GET to deregister on behalf
of the client?
A proxy that translates from a UDP client to TCP server may want to deregister if the client sends a Reset. Or not, of the proxy has other clients that want this data (or if it is configured to watch that TCP server).
Post by Adam Roach
Or is it okay if they just pass along the reset,
In that scenario, there is no way to pass on the Reset, as there are no resets over the TCP connection.
(Well, you could close the TCP connection :-)
Post by Adam Roach
and
expect the server to know that it means the same thing as a
deregistration? Without explicit guidance here, I expect server and
gateway implementors to make different choices and end up with a lack of
interop.
A UDP CoAP server needs to handle resets on notifications, so it cannot make a choice here.
A TCP CoAP server never can see a reset, so it only has to handle the explicit deregistration (or a connection teardown).
The proxy can make a choice to perform explicit deregistration to a UDP CoAP server or send a Reset on the next notification; I would generally assume that proxies will use explicit deregistration.
Food for section 6 of draft-ietf-lwig-coap, I’d say.
Okay.

[snip]

/a
Carsten Bormann
2017-05-11 06:55:28 UTC
Permalink
Post by Carsten Bormann
I think I addressed this one in the comments to Mirja.
I’m still amazed how the arrangement of the first few bytes of the header can cause so much interest.
It's not so much the specific arrangement itself as much as the creation of three mutually incompatible versions of the arrangement that is causing me heartburn.
Well, there is only one arrangement, with a variable-size length extension.
I’m afraid the box notation we used doesn’t make the simplicity very transparent.

This box notation might have been better:


0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Len | TKL | Code | Length extension (if any)
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Token (if any, TKL bytes) ...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options (if any) ...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1 1 1 1 1 1 1 1| Payload (if any) ...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
I'd be much happier with something baroque like little-endian byte packing than with so many variations on a theme. I'm seeing that some implementations will have to have three different parsers (or equivalent complexity in terms of alternate code paths) and three different serializers if they're going to implement all three variations. That is very unfriendly to developers and testers alike.
Well, we’ll have to fix the presentation so people don’t think there is more to this than there is.
(Variable-size lengths already occur in other places in the CoAP protocol, so this is not surprising to CoAP developers.)

Now https://github.com/core-wg/coap-tcp-tls/issues/159
Post by Carsten Bormann
Post by Adam Roach
Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
I read this as “why don’t you use HTTP when you need TCP”?
CoAP over TCP is more complex than CoAP over UDP, but still much less complex than either one of the HTTPs.
I really hope this doesn't come across as snarky, as such is really not my intention, but that explanation thoroughly falls apart when I get to section 4.
Well, I was talking about CoAP over TCP and not CoAP over Websockets.
There is a different simplicity argument for the latter: Not having to translate application semantics simplifies the application.
Post by Carsten Bormann
Post by Adam Roach
With the exception of subscribing to resource state (which could be
added),
A very big exception — the observe option is fundamental to many interactions with Things, and we currently don’t have a way to map this on HTTP.
I would suggest becoming familiar with RFC8030.
Web Push is about forwarding messages (“events”), not about observing resources.
I’m sure we can create a mapping from resources to messages back to resources, but that is exactly the kind of complexity that direct use of CoAP can avoid.
Post by Carsten Bormann
Post by Adam Roach
it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).
These motivations are in the draft.
To be clear: I think it needs to clearly say "we are making interop less likely, and find these benefits to be sufficiently compelling to justify doing so." I suspect that, phrased that way, the current justification won't hold up to your own scrutiny, at least not without being made substantially more convincing.
While reducing options always is a good objective, there are also cases where creating options increases interoperability.
Which of the two effects prevails depends a lot on the specifics of the domain.

[snip]
Perhaps an explanation closer to the fron that ties Max-Message-Size, block transfer, and the BERT mechanism together would be useful.
Makes sense. Now https://github.com/core-wg/coap-tcp-tls/issues/160
[snip]
Post by Carsten Bormann
Post by Adam Roach
Specific comments follow.
Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
Given the introduction of Max-Message-Size, there was some speculation that it would be good to allow going below the default CoAP value of 1152.
Including the rationale in the document would be good.
Post by Carsten Bormann
In UDP CoAP, preferences of this kind are indicated in the “control usage” of the Block options, and there is an assumption that violating them will lead to suboptimal performance, not to malfunction. But we never really thought that the possibility of reducing Max-Message-Size below 1152 would motivate making the state machine more complex; maybe we should state the obvious and add the warning that indicating a smaller Max-Message-Size is no protection against receiving a message that was sent off before that new value was known to the peer.
Yes. I guarantee that the current phrasing will make some implementors want to treat messages larger than their advertised Max-Message-Size as an error. If it is supposed to be not an error, you need explicit language that it is not an error.
We should cover this in https://github.com/core-wg/coap-tcp-tls/issues/160

[snip]
Post by Carsten Bormann
Post by Adam Roach
Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional?
Successful TCP keepalives are usually not visible to the application, so they are not quite in the same league as CoAP PING/PONG or WebSocket mechanism. This is a SHOULD NOT because there may be reasons why the WebSocket mechanism might be the one to use; the CoAP-level PING/PONG are closer to the application and provide some CoAP-specific functionality (such as Custody).
You seem to have answered some inverse of what I was asking, so I'll try to be clearer: should section 3.4 say "SHOULD NOT use TCP keepalives”?
The assumption here is that we cannot mandate anything about TCP keepalives because it may be outside of the control of the application. But there is nothing in the protocol that would create interoperability problems if an application did have control and used them. So I wouldn’t add that.
[snip]
Post by Carsten Bormann
Post by Adam Roach
I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations.
Oh. The intention was that the PINGer gives permission to delay with the Custody option.
I now realize that the text isn’t clear about that.
s/Unless there is an
option with delaying semantics/Unless the PING carries an
option with delaying semantics/
Now
https://github.com/core-wg/coap-tcp-tls/issues/154
That fixes part of the problem. 5.4.1 also says "When responding to a Ping message, the receiver can include an elective Custody Option in the Pong message," making it again sound like it's the entity sending the Pong making the decision.
And then I can't read the second paragraph of section 5.4.1 in any way other than "in addition to the (clearly unilateral) decision to include a Custody Option in a Pong, the sender of the Ping can request that this happen by including one in the Ping."
I think most of the text in 5.4.1 needs to be rewritten to make it clear -- and I do suggest normative language here -- that a Ping MAY include a Custody Option, and a Pong MUST NOT include a Custody Option unless the corresponding Ping also contained one.
Well, the latter MUST NOT would be only about delaying — a PONG MAY very well include a spontaneous Custody option (or similar).
Post by Carsten Bormann
(Or a similar future option.)
Yes. Or a similar future option.
Post by Carsten Bormann
Post by Adam Roach
I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.
Interesting scenario.
The current rules are an attempt to use ALPN as it is the future, but also allow environments (which were specifically cited,
Not in this document, as far as I can tell. Including design rationale for non-obvious choices is generally a Good Thing....
Post by Carsten Bormann
but I forget which ones they were)
...and that's why.
But which environments do not support ALPN is very ephemeral information.
(I seem to remember we specifically talked about Microsoft .NET — I don’t think we should build this up to an implementation survey about ALPN.)
Post by Carsten Bormann
without ALPN to play. I think that objective is worth some complexity, but I’ve opened an issue to discuss this nonetheless.
Now
https://github.com/core-wg/coap-tcp-tls/issues/155
I think this conversation will be very difficult to properly reason about unless someone can unearth the rationale that you describe above.
Post by Carsten Bormann
Post by Adam Roach
The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
The entirety of CoAP over TCP is optional for a CoAP client or server.
s/CoAP client/coaps+tls client/ -- sorry for the imprecision, as I thought this would be obvious from context.
The MTI would then address the constrained-to-cloud use case and ignore that there are other use cases.
So we would recommend RFC 7925 cipher suites and make RPK MTI because it is MTI in RFC 7252.

(The MTIs revert to default for coaps+ws.)
Post by Carsten Bormann
Post by Adam Roach
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing).
Generally, this is determined by what kind of security workflows are used; writing down a preference here will have little impact.
• COAP/UDP
• COAP/DLTS/UDP
• COAP mapped to HTTP per RFC 7252
• COAP mapped to HTTPS per RFC 7252
• COAP/TCP
• COAP/TLS/TCP with PSK
• COAP/TLS/TCP with Raw Public Key
• COAP/TLS/TCP with Certs
• COAP/WS/HTTP
• COAP/WS/HTTPS
They are incompatible only in the sense that IP is incompatible over Ethernet from over WiFi: I can’t connect my laptop (that connects fine over WiFi) to Ethernet without a dongle. But of course there is a switch connecting the two. Actually, the WebSockets use case pretty much *requires* the use of a proxy to translate between WebSockets from the browser and the device protocol (which likely will be CoAP/UDP or CoAP/DTLS/UDP). Devices should not have to implement WebSockets.
It might be worse than this, as I suspect that your handling of DTLS probably parallels your handling of TLS. And these are all *configuration* options, not negotiated options. It's made worse by the fact that you're defining the resource spaces for many of these to be different from the others, so you can't just switch among them to find one you have in common: if something is available via coaps+tcp, then there's no mechanical way to determine whether it's also available over coaps (on UDP), so even an assertion that UDP is MTI doesn't make things work.
Yes, the siloing of URIs by URI schemes is unfortunate.
Post by Carsten Bormann
We care only about CoAP over TLS. We are not going to use the WebSockets
part of the document. In practice for many companies there will not be a
problem with too many transports since they will only use specific ones
in their deployment.
Basically what I'm hearing is that each deployment will have to pick one of these mutually incompatible (and increasingly unrelated) flavors of COAP. At some point, this becomes indistinguishable from each installation having its own proprietary protocol in terms of the benefits derived from standardization in the first place.
Very much not so, because they are all bound together by the Web model of permissionless-innovation enabled Web proxies.
I'm sad that these protocols are syntactically different from each other and have different state machines, requiring multiple or unnecessarily complex libraries to implement them. I'm sad that these protocols have disjoint resource spaces, preventing automated failover among them to find one in common. I'm sad that these protocols have no in-built negotiation among themselves. I'm sad that these protocols have no MTI among the myriad of options, and particularly sad that the disjoint resource spaces will forever slam the door on fixing that flaw. And I'm sad that this list appears to be long and growing longer.
Mostly, I'm sad that saying "our device uses CoAP" is becoming an increasingly meaningless statement, as you have to be very clear about which incompatible variation of CoAP it implements.
We are very aware of that danger.
In fact, the biggest obstacle to interoperation is incompatible security flows, and groups such as ACE have been set up to get more interoperability there.

Until recently, we tried to get by with UDP transport only, but the realities of today’s Internet cannot be ignored.

[snip]
Post by Carsten Bormann
Post by Adam Roach
It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.
The server and client roles are not really defined very well here, as either side can use a connection either way once it has been set up.
My advice would be: If you want interoperability, use CoAP over UDP with the security model that you have a security workflow for.
If there are operational restrictions making this impossible, respond to those operational restrictions.
This seems precluded by the determination that each transport has a disjoint resource space.
That’s where the link bundles come in that we discussed with Ben.
Infrastructure such as the Resource Directory can mitigate this somewhat.

I have to run now; more about the rest of your message (Observe and proxies) later.

Grüße, Carsten
Carsten Bormann
2017-05-11 13:32:07 UTC
Permalink
Post by Carsten Bormann
Post by Adam Roach
Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)
Section A.1 does not indicate what gateways are supposed to do with
out-of-order notifications. The TCP side requires these to be delivered
in-order; so, do this mean that gateways observing a gap in sequence
numbers need to quarantine the newly received message so that it can
deliver the missing one first? Or does it deliver the newly-received
message and then discard the “stale” one when it arrives? I don’t think
that leaving this up to implementations is particularly advisable.
Yes, we had that discussion on the list. There is little point in trying to do anything here but the latter.
We could state that more explicitly here, or leave that to implementers’ advice documents such as draft-ietf-lwig-coap (we’ll need to rework section 6 of that now anyway).
Say that out loud, or someone will get it wrong. I mean, I outlined the two potentially sensible things. There are infinite nonsensical things. Absent guidance, people will implement out of both categories.
Let’s do that. Now https://github.com/core-wg/coap-tcp-tls/issues/161
Post by Carsten Bormann
Post by Adam Roach
Section A.3 is a bit more worrisome. I understand the desired
optimization here, but where you reduce traffic in one direction, you run
the risk of exploding it in the other. For example, consider a coap+tcp
client connecting to a gateway that communicates with a CoAP-over-UDP
server. When that client wants to check the health of its observations,
it can send a Ping and receive a Pong that confirms that they are all
alive and well. In order to be able to send a Pong that *means* “all your
observations are alive and well,” the gateway has to verify that all the
observations are alive and well.
Oh. The PONG says the proxy hasn’t crashed, so the observations are alive and well in the proxy.
That's not what this document says, and why I was careful to quote from the document above.
Since observation relationships are hop-by-hop in CoAP (i.e., terminate at the next proxy), I can’t agree with your reading.
But again, we can state all this more explicitly. Now https://github.com/core-wg/coap-tcp-tls/issues/162
Post by Carsten Bormann
How that makes sure it gets fresh data from the next hop (by observe, by polling) is up to the proxy. There is no intention that a PING suddenly makes a proxy do a check, when it otherwise has ignored its duty to check that before.
You are making a bunch of assumptions here that are completely undocumented.
Post by Carsten Bormann
Post by Adam Roach
A simple implementation of a gateway
will likely check on each observed resource individually when it gets a
Ping, and then send a Pong after it hears back about all of them. So, as
a client, I can set up, let’s say, two dozen observations through this
gateway. Then, with each Ping I send, the gateway sends two dozen checks
towards the server. This kind of message amplification attack is an
awesome way to DoS both the gateway and the server. I believe the
document needs a treatment of how UDP/TCP gateways handle notification
health checks, along with techniques for mitigating this specific
attack.
See above. We probably need to say that PINGs check the connection (and thus the fate-sharing observation state) but are not meant to make proxies frantically check the observation relationships to their data sources. Now https://github.com/core-wg/coap-tcp-tls/issues/156
Yes. That's the solution. I strongly suggest outlining this attack as one consequence if implementors ignore such advice. Sending checks when you get a Ping is the easiest thing to code, by far, so unless you explain that it could be potentially quite harmful, people *will* do it even if you say not to. (In fact, saying not to without being clear about *why* not could be worse than saying nothing; cf. <https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose>)
Thanks. I added that to 156 (and I learned a new term :-).

[snip]

Grüße, Carsten

Adam Roach
2017-05-10 20:16:40 UTC
Permalink
Adam Roach has entered the following ballot position for
draft-ietf-core-coap-tcp-tls-08: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-core-coap-tcp-tls/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

ISSUE 1: WebSockets and .well-known

- Part of the document is outside the scope of the charter of the WG
which requested its publication

While I understand that this document requires a WebSockets mechanism for
.well-known, and that such a mechanism doesn’t yet exist, it seems pretty
far out of scope for the CORE working group to take on defining this
itself (unless I missed something in its charter, which is entirely
possible: it’s quite long). Specifically, I fear that this venue is
unlikely to bring such a change to the attention of those people best
positioned to comment on whether .well-known is appropriate for
WebSockets.

Even if this is in scope for CORE, it really needs to be its own
document. If some future document comes along at a later point and wants
to make use of its own .well-known path with WebSockets, it would be
really quite strange to require it to reference this document in
describing .well-known for WS.

==================================================

ISSUE 2: Assignment of port 443 as default

- Widespread deployment would be damaging to the Internet or an
enterprise network for reasons of congestion control, scalability, or the
like.

I'd like to thank the authors for helping me to understand the intention
with the use of port 443 more clearly. Based on their clarifications, I
need to move my issue about assigning a default of port 443 to coaps+tcp
from my Comment into the Discuss, as it does have implications for the
Internet at large that will have long-term damaging effects.

The rationale being offered for the using the already-assigned port 443
as a default is that it tends to go through firewalls that other ports
may not, and that doing so is fine because ALPN makes it possible. These
arguments, if we accept them, are manifestly true for all future
TLS-using protocols. Allowing CoAP to re-use an assigned port on this
basis established precedent for pretty much all future protocols to do
so, effectively moving the protocol demux point for future protocols from
port numbers to ALPN IDs (all over port 443). It is hard to imagine an
outcome *other* *than* firewall manufacturers starting to whitelist
desired ALPN IDs, which effectively ossifies the available set of IDs to
whatever is defined at that moment, destroying the future utility of the
mechanism.

There are other issues having to do with software architecture, protocol
demultiplexing in user space rather than kernel space, and operational
considerations that come into play as well, but they don't technically
fall under discuss criteria.


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

General — this is a very bespoke approach to what could have been mostly
solved with a single four-byte “length” header; it is complicated on the
wire, and in implementation; and the format variations among CoAP over
UDP, coap+tls, and coap+ws are going to make gateways much harder to
implement and less efficient (as they will necessarily have to
disassemble messages and rebuild them to change between formats). The
protocol itself mentions gateways in several places, but does not discuss
how they are expected to map among the various flavors of CoAP defined in
this document. Some of the changes seem unnecessary, but it could be that
I’m missing the motivation for them. Ideally, the introduction would work
harder at explaining why CoAP over these transports is as different from
CoAP over UDP as it is, focusing in particular on why the complexity of
having three syntactically incompatible headers is justified by the
benefits provided by such variations.

Additionally, it’s not clear from the introduction what the motivation
for using the mechanisms in this document is as compared to the
techniques described in section 10 (and its subsections) of RFC 7252.
With the exception of subscribing to resource state (which could be
added), it seems that such an approach is significantly easier to
implement and more clearly defined than what is in this document; and it
appears to provide the combined benefits of all four transports discussed
in this document. My concern here is that an explosion of transport
options makes it less likely that a client and server can find two in
common: the limit of the probability of two implementations having a
transport in common as the number of transports approaches infinity is
zero. Due to this likely decrease in interoperability, I’d expect to see
some pretty powerful motivation in here for defining a third, fourth,
fifth, and sixth way to carry CoAP when only TCP is available (I count
RFC 7252 http and https as the first and second ways in this
accounting).

I’m also a bit puzzled that CoAP already has an inherent mechanism for
blocking messages off into chunks, which this document circumvents for
TCP connections (by allowing Max-Message-Size to be increased), and then
is forced to offer remedies for the resultant head-of-line blocking
issues. If you didn’t introduce this feature, messages with a two-byte
token add six bytes of overhead for every 1024 bytes of content — less
than 0.6% size inflation. It seems like a lot of complicated machinery —
which has a built-in foot-gun that you have to warn people about misusing
— for a very tiny gain. I know it’s relatively late in the process, but
if these trade-offs haven't had a lot of discussion yet, it’s probably
worth at least giving them some additional thought.

I’ll note that the entire BERT mechanism seems to fall into the same trap
of adding extra complexity for virtually nonexistent savings. CoAP
headers are, by design, tiny. It seems like a serious over-optimization
to try to eliminate them in this fashion. In particular, you’re making
the actual implementation code larger to save a trivial number of bits on
the wire; I was under the impression that many of the implementation
environments CoAP is intended for had some serious on-chip restrictions
that would point away from this kind of additional complexity.

Specific comments follow.

Section 3.3, paragraph 3 says that an initiator may send messages prior
to receiving the remote side’s CSM, even though the message may be larger
than would be allowed by that CSM. What should the recipient of an
oversized message do in this case? In fact, I don’t see in here what a
recipient of a message larger than it allowed for in its CSM is supposed
to do in response at *any* stage of the connection. Is it an error? If
so, how do you indicate it? Or is the Max-Message-Size option just a
suggestion for the other side? This definitely needs clarification.
(Aside — it seems odd and somewhat backwards that TCP connections are
provided an affordance for fine-grained control over message sizes, while
UDP communications are not.)

Section 4.4 has a prohibition against using WebSockets keepalives in
favor of using CoAP ping/pong. Section 3.4 has no similar prohibition
against TCP keepalives, while the rationale would seem to be identical.
Is this asymmetry intentional? (I’ll also note that the presence of
keepalive mechanisms in both TCP and WebSockets would seem to make the
addition of new CoAP primitives for the same purpose unnecessary, but I
suspect this has already been debated).

Section 5 and its subsections define a new set of message types,
presumably for use only on connection-oriented protocols, although this
is only implied, and never stated. For example, some implementors may see
CSM, Ping, and Pong as potentially useful in UDP; and, finding no
prohibition in this document against using them, decide to give it a go.
Is that intended? If not, I strongly suggest an explicit prohibition
against using these in UDP contexts.

Section 5.3.2 says that implementations supporting block-wise transfers
SHOULD indicate the Block-wise Transfer Option. I can't figure out why
this is anything other than a "MUST". It seems odd that this document
would define a way to communicate this, and then choose to leave the
communicated options as “YES” and “YOUR GUESS IS AS GOOD AS MINE” rather
than the simpler and more useful “YES” and “NO”.

I find the described operation of the Custody Option in the operation of
Ping and Pong to be somewhat problematic: it allows the Pong sender to
unilaterally decide to set the Custody Option, and consequently
quarantine the Pong for an arbitrary amount of time while it processes
other operations. This seems impossible to distinguish from a
failure-due-to-timeout from the perspective of the Ping sender. Why not
limit this behavior only to Ping messages that include the Custody
Option?

[Moved from Comment to Discuss: I find the unmotivated definition of the
default port for “coaps+tcp” to 443 — a port that is already assigned to
https — to be surprising, to put it mildly. This definitely needs
motivating text, and I suspect it's actually wrong.]

I am similarly perplexed by the hard-coded “must do ALPN *unless* the
designated port takes the magical value 5684” behavior. I don’t think
I’ve ever seen a protocol that has such variation based on a hard-coded
port number, and it seems unlikely to be deployed correctly (I’m imaging
the frustration of: “I changed both the server and the client
configuration from the default port of 5684 to 49152, and it just stopped
working. Like, literally the *only* way it works is on port 5684. I've
checked firewall settings everywhere and don't see any special handling
for that port -- I just can't figure this out, and it's driving me
crazy.”). Given the nearly universal availability of ALPN in pretty much
all modern TLS libraries, it seems much cleaner to just require ALPN
support and call it done. Or *don’t* require ALPN at all and call it
done. But *changing* protocol behavior based on magic port numbers seems
like it’s going to cause a lot of operational heartburn.

The final paragraph of section 8.1 is very confusing, making it somewhat
unclear which of the three modes must be implemented on a CoAP client,
and which must be implemented on a CoAP server. Read naïvely, this sounds
like clients are required to do only one (but one of their choosing) of
these three, while servers are required to also do only one (again, of
their choosing). It seems that the chance of finding devices that could
interoperate under such circumstances is going to be relatively low: to
work together, you would have to find a client and a server that happened
to make the same implementation choice among these three. What I’m used
to in these kinds of cases is: (a) server must implement all, client can
choose to implement only one (or more), (b) client must implement all,
server can choose to implement only one (or more), or (c) client and
server must implement a specifically named lowest-common denominator, and
can negotiate up from there. Pretty much anything else (aside from
strange “everyone must implement two of three” schemes) will end up with
interop issues.

Although the document clearly expects the use of gateways and proxies
between these connection-oriented usages of CoAP and UDP-based CoAP,
Appendix A seems to omit discussion or consideration of how this
gatewaying can be performed. The following list of problems is
illustrative of this larger issue, but likely not exhaustive. (I'll note
that all of these issues evaporate if you move to a simpler scheme that
merely frames otherwise unmodified UDP CoAP messages)

Section A.1 does not indicate what gateways are supposed to do with
out-of-order notifications. The TCP side requires these to be delivered
in-order; so, do this mean that gateways observing a gap in sequence
numbers need to quarantine the newly received message so that it can
deliver the missing one first? Or does it deliver the newly-received
message and then discard the “stale” one when it arrives? I don’t think
that leaving this up to implementations is particularly advisable.

Section A.3 is a bit more worrisome. I understand the desired
optimization here, but where you reduce traffic in one direction, you run
the risk of exploding it in the other. For example, consider a coap+tcp
client connecting to a gateway that communicates with a CoAP-over-UDP
server. When that client wants to check the health of its observations,
it can send a Ping and receive a Pong that confirms that they are all
alive and well. In order to be able to send a Pong that *means* “all your
observations are alive and well,” the gateway has to verify that all the
observations are alive and well. A simple implementation of a gateway
will likely check on each observed resource individually when it gets a
Ping, and then send a Pong after it hears back about all of them. So, as
a client, I can set up, let’s say, two dozen observations through this
gateway. Then, with each Ping I send, the gateway sends two dozen checks
towards the server. This kind of message amplification attack is an
awesome way to DoS both the gateway and the server. I believe the
document needs a treatment of how UDP/TCP gateways handle notification
health checks, along with techniques for mitigating this specific
attack.

Section A.4 talks about the rather different ways of dealing with
unsubscribing from a resource. Presumably, gateways that get a reset to a
notification are expected to synthesize a new GET to deregister on behalf
of the client? Or is it okay if they just pass along the reset, and
expect the server to know that it means the same thing as a
deregistration? Without explicit guidance here, I expect server and
gateway implementors to make different choices and end up with a lack of
interop.
** There is 1 instance of too long lines in the document, the longest one
being 3 characters in excess of 72.
Alexey Melnikov
2017-05-11 12:31:58 UTC
Permalink
Hi Adam,
Post by Adam Roach
Adam Roach has entered the following ballot position for
draft-ietf-core-coap-tcp-tls-08: Discuss
[snip]
Post by Adam Roach
ISSUE 2: Assignment of port 443 as default
- Widespread deployment would be damaging to the Internet or an
enterprise network for reasons of congestion control, scalability, or the
like.
I'd like to thank the authors for helping me to understand the intention
with the use of port 443 more clearly. Based on their clarifications, I
need to move my issue about assigning a default of port 443 to coaps+tcp
from my Comment into the Discuss, as it does have implications for the
Internet at large that will have long-term damaging effects.
The rationale being offered for the using the already-assigned port 443
as a default is that it tends to go through firewalls that other ports
may not, and that doing so is fine because ALPN makes it possible. These
arguments, if we accept them, are manifestly true for all future
TLS-using protocols. Allowing CoAP to re-use an assigned port on this
basis established precedent for pretty much all future protocols to do
so, effectively moving the protocol demux point for future protocols from
port numbers to ALPN IDs (all over port 443). It is hard to imagine an
outcome *other* *than* firewall manufacturers starting to whitelist
desired ALPN IDs, which effectively ossifies the available set of IDs to
whatever is defined at that moment, destroying the future utility of the
mechanism.
There are other issues having to do with software architecture, protocol
demultiplexing in user space rather than kernel space, and operational
considerations that come into play as well, but they don't technically
fall under discuss criteria.
I've missed use of port 443 in my review. I agree with you that this is
an issue for coaps+tcp URIs.

It is not an issue for coaps+ws URIs, which are similar to wss URIs
which also use 443 as the default.

Best Regards,
Alexey
Carsten Bormann
2017-05-11 13:10:42 UTC
Permalink
Post by Hannes Tschofenig
Hi Adam,
Post by Adam Roach
Adam Roach has entered the following ballot position for
draft-ietf-core-coap-tcp-tls-08: Discuss
[snip]
Post by Adam Roach
ISSUE 2: Assignment of port 443 as default
- Widespread deployment would be damaging to the Internet or an
enterprise network for reasons of congestion control, scalability, or the
like.
I'd like to thank the authors for helping me to understand the intention
with the use of port 443 more clearly. Based on their clarifications, I
need to move my issue about assigning a default of port 443 to coaps+tcp
from my Comment into the Discuss, as it does have implications for the
Internet at large that will have long-term damaging effects.
The rationale being offered for the using the already-assigned port 443
as a default is that it tends to go through firewalls that other ports
may not, and that doing so is fine because ALPN makes it possible. These
arguments, if we accept them, are manifestly true for all future
TLS-using protocols. Allowing CoAP to re-use an assigned port on this
basis established precedent for pretty much all future protocols to do
so, effectively moving the protocol demux point for future protocols from
port numbers to ALPN IDs (all over port 443). It is hard to imagine an
outcome *other* *than* firewall manufacturers starting to whitelist
desired ALPN IDs, which effectively ossifies the available set of IDs to
whatever is defined at that moment, destroying the future utility of the
mechanism.
There are other issues having to do with software architecture, protocol
demultiplexing in user space rather than kernel space, and operational
considerations that come into play as well, but they don't technically
fall under discuss criteria.
I've missed use of port 443 in my review. I agree with you that this is
an issue for coaps+tcp URIs.
Without having verified this with the WG:
I don’t think that the choice of the default port 443 for coaps+tcp:// is in any way essential.

We chose 443 as a matter of course -- both RFC 7301 (ALPN) and operational practice suggested it.
But I believe we can also live with the coaps:// port 5684 as the default port for coaps+tcp://.
Post by Hannes Tschofenig
It is not an issue for coaps+ws URIs, which are similar to wss URIs
which also use 443 as the default.
Indeed, coaps+ws:// should keep 443 as the default (as should coap+ws:// with port 80), as they are mapped to wss:// and ws://.

Grüße, Carsten
Loading...