[P4-dev] Mapping between egress port and egress pipeline

Andy Fingerhut andy.fingerhut at gmail.com
Wed Jul 19 12:28:06 EDT 2017

The packet buffer in the P4_14 switch architecture figures is a
implementation-specific, but a typical implementation in a device would be
that the packet buffer storage is shared across output ports, i.e. shared
across queues for different output ports.  Thus if at a particular moment
in time, the queues for most of N output ports are short, but one is long,
the one long queue would be allowed to use more than 1/N fraction of the
packet buffer storage space.  Typically there would be some amount of
packet buffer storage less than a fraction 1/N, e.g. maybe 1/2N, that would
be reserved for each output port, and the remainder would be shareable on a
first-come first-serve basis.  What the configuration options are for a
packet buffer in this regard is _not_ specified by any P4 'standards'
document -- it is device-dependent.

Whether a device has one egress pipeline or multiple egress pipelines is
also an implementation detail that can change from one P4 device
implementation to another.  Barefoot has published [1] that their 6.5 Tbps
Tofino ASIC has 4 ingress pipelines and 4 egress pipelines, where each of
those pipelines is shared by 1/4 of the Ethernet ports on the device.

One could imagine a device that has a separate egress pipeline per output
port, but I don't know if anyone has built one.  A potential down-side of
such separate egress pipelines is if that means that any tables you access
in the egress pipeline must be separate tables -- for a resource that is
never modified by the P4 program, that is an extra cost, but doesn't hurt
the behavior of your program.  For a resource that is modified by the P4
program itself, e.g. P4 registers that your P4 program writes, now you need
to be aware that there are separate such sets of P4 registers for different
subsets of output ports, each doing their writes to independent physical
registers, not sharing their modifications with each other.  That might
affect what kind of features you can implement in P4.

I am not aware of any differences between the concept of a queue in
OpenFlow and one in P4.  For the reasons I mentioned in an earlier email,
dedicating particular queues for packets going to particular output ports
makes it possible to implement a work-conserving system -- violating that
property means you either give up work conserving behavior, or achieve it
in some more complex way that I can't easily imagine.

At least for the bmv2 software P4 emulator, enq_qdepth is documented to be "the
depth of the queue when the packet was first enqueued" [2].  I would expect
that this would be the case for other P4 implementations that implement
this feature, but I don't have enough knowledge of all P4 implementations
to say for sure.  The Portable Switch Architecture (PSA) document has not
been published yet, but my hope is that it will define this kind of feature
for devices that claim they are compliant with the PSA.


in particular the talk on the hardware by Dan Lenoski

On Tue, Jul 18, 2017 at 11:48 PM, Eric Ruan <ruanweizhang at gmail.com> wrote:

> Hi Andy,
> as a complement to last email, I wonder if I understand the queue
> correctly? Is the queue in P4 different from the queue in OpenFlow? Because
> in OpenFlow queue is attached to a specific output port.
> meanwhile in p4 switch, the queue seems like a huge buffer. So when I read
> the intrinsic metadata *enq_qdepth*, am I reading the depth/size of this
> buffer or the queue depth that attached to a specific
> output port?
> Thanks in advance.
> Best,
> Eric
> 2017-07-19 14:13 GMT+08:00 Eric Ruan <ruanweizhang at gmail.com>:
>> Hi Andy,
>> I have one more question. As I know, queueing is between ingress pipeline
>> and egress pipeline, and each queue is dedicated to a specific output port,
>> though each output port may have more than one queue.
>> So is it true that each packet is processed by ingress pipeline and then
>> routed/replicated to different queues. Each queue has its own egress
>> pipeline. After packets are processed by egress pipeline they go
>> out from the specific output port or recirculate. The number of egress
>> pipeline is the same as that of queues. Do I understand correctly?
>> Thanks in advance.
>> Best,
>> Eric
>> 2017-07-18 22:59 GMT+08:00 Eric Ruan <ruanweizhang at gmail.com>:
>>> Hi Andy,
>>> many thanks for your detailed explanation, it helps a lot!
>>> Best,
>>> Eric
>>> 2017-07-18 22:36 GMT+08:00 Andy Fingerhut <andy.fingerhut at gmail.com>:
>>>> Eric:
>>>> I will give my understanding for why it would be bad if a P4 program
>>>> for a switch ASIC that follows the switch architecture described in the
>>>> P4_14 spec [1] could change the egress port in the egress control block
>>>> (and for P4_16, the Portable Switch Architecture (PSA) is planned to be
>>>> similar in many ways to the architecture in the P4_14 spec).  Hopefully
>>>> others can jump in and add other reasons, if I am missing anything
>>>> significant.
>>>> It is usually desirable for a switch to be work conserving [2] on each
>>>> of its output ports.  That is, if the switch contains at least one packet
>>>> that is finished processing, and destined for output port X, then it should
>>>> already be transmitting a packet on output port X, or should start to do so
>>>> very soon.
>>>> If there was no egress processing at all, then packets would be
>>>> transmitted soon after they were scheduled from the queue for that output
>>>> port.  The hardware scheduler for output port X could monitor that link,
>>>> and know that if the last packet it scheduled from the output port X queue
>>>> was N bytes long, and that output port transmits at 100 gigabits/second,
>>>> for example, it can easily calculate when it needs to choose another packet
>>>> to transmit out port X, leaving no idle time between packets.  If it
>>>> schedules the next packet too soon, then another buffer is needed before
>>>> the output port to store the packet somewhere, waiting for port X to be
>>>> finished with the previous packet.  If it schedules the next packet too
>>>> late, then port X will go idle for a time, and the switch is not work
>>>> conserving.
>>>> All of the description above assumes that there are one or more queues
>>>> containing packets, all of which are known to be destined for output port
>>>> X, and this choice of output port will not change after the packet has been
>>>> chosen from that output port.
>>>> If egress processing can change that output port selection, then there
>>>> is no way to make the system work conserving.  For example, the scheduler
>>>> might schedule packets that ingress processing specified will go to output
>>>> ports 1 through 10, but if egress processing changes them all to output
>>>> port 5, then all of those packets except the first to be transmitted need
>>>> to be buffered somewhere, and all ports except 5 will be idle until the
>>>> scheduler chooses a packet that goes to them.
>>>> Could you have another set of queues and a big packet buffer after
>>>> egress processing?  Sure, I can imagine a switch ASIC designed like that.
>>>> However, even then, if some other processing can change the output port
>>>> after _that_ packet buffer's scheduler, then the switch cannot achieve work
>>>> conserving behavior.
>>>> [1] https://p4lang.github.io/p4-spec/
>>>> [2] https://en.wikipedia.org/wiki/Work-conserving_scheduler
>>>> On Tue, Jul 18, 2017 at 3:35 AM, Eric Ruan <ruanweizhang at gmail.com>
>>>> wrote:
>>>>> Dear Antonin, Andy and all,
>>>>> I wonder the mapping between egress port and egress pipeline is 1:1 or
>>>>> many:1.
>>>>> To the best of my knowledge, the egress port is the output port of
>>>>> switch. The egress port has to be set in the ingress pipeline before the
>>>>> packet is routed to the egress pipeline.
>>>>> If 1:1 is the case, then it makes sense that the egress port cannot be
>>>>> changed in egress pipeline. But what is the design principle behind of
>>>>> this? Since this is kind of waste of resources.
>>>>> If many:1 is the case, then why is it not allowed to change the egress
>>>>> port in egress pipeline?
>>>>> Thanks in advance.
>>>>> Best,
>>>>> Eric
>>>>> _______________________________________________
>>>>> P4-dev mailing list
>>>>> P4-dev at lists.p4.org
>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>>> --
>>> 阮偉章
>>> Eric Yuen
>>> 交通大學 網路工程所 博士班
>>> Institute of Network Engineering
>>> National Chiao Tung University
>>> 工程三館616室(EC616)
>>> Email: ruanweizhang at gmail.com
>>> <http://ruanweizhang@gmail.com/619026859@qq.com>
>> --
>> 阮偉章
>> Eric Yuen
>> 交通大學 網路工程所 博士班
>> Institute of Network Engineering
>> National Chiao Tung University
>> 工程三館616室(EC616)
>> Email: ruanweizhang at gmail.com
>> <http://ruanweizhang@gmail.com/619026859@qq.com>
> --
> 阮偉章
> Eric Yuen
> 交通大學 網路工程所 博士班
> Institute of Network Engineering
> National Chiao Tung University
> 工程三館616室(EC616)
> Email: ruanweizhang at gmail.com
> <http://ruanweizhang@gmail.com/619026859@qq.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.p4.org/pipermail/p4-dev_lists.p4.org/attachments/20170719/55445c00/attachment-0002.html>

More information about the P4-dev mailing list