[P4-dev] Mapping between egress port and egress pipeline

Eric Ruan ruanweizhang at gmail.com
Wed Jul 19 13:08:13 EDT 2017

Andy, you solved all the problems in my mind, thanks very much bro.


2017-07-20 0:28 GMT+08:00 Andy Fingerhut <andy.fingerhut at gmail.com>:

> The packet buffer in the P4_14 switch architecture figures is a
> implementation-specific, but a typical implementation in a device would be
> that the packet buffer storage is shared across output ports, i.e. shared
> across queues for different output ports.  Thus if at a particular moment
> in time, the queues for most of N output ports are short, but one is long,
> the one long queue would be allowed to use more than 1/N fraction of the
> packet buffer storage space.  Typically there would be some amount of
> packet buffer storage less than a fraction 1/N, e.g. maybe 1/2N, that would
> be reserved for each output port, and the remainder would be shareable on a
> first-come first-serve basis.  What the configuration options are for a
> packet buffer in this regard is _not_ specified by any P4 'standards'
> document -- it is device-dependent.
> Whether a device has one egress pipeline or multiple egress pipelines is
> also an implementation detail that can change from one P4 device
> implementation to another.  Barefoot has published [1] that their 6.5 Tbps
> Tofino ASIC has 4 ingress pipelines and 4 egress pipelines, where each of
> those pipelines is shared by 1/4 of the Ethernet ports on the device.
> One could imagine a device that has a separate egress pipeline per output
> port, but I don't know if anyone has built one.  A potential down-side of
> such separate egress pipelines is if that means that any tables you access
> in the egress pipeline must be separate tables -- for a resource that is
> never modified by the P4 program, that is an extra cost, but doesn't hurt
> the behavior of your program.  For a resource that is modified by the P4
> program itself, e.g. P4 registers that your P4 program writes, now you need
> to be aware that there are separate such sets of P4 registers for different
> subsets of output ports, each doing their writes to independent physical
> registers, not sharing their modifications with each other.  That might
> affect what kind of features you can implement in P4.
> I am not aware of any differences between the concept of a queue in
> OpenFlow and one in P4.  For the reasons I mentioned in an earlier email,
> dedicating particular queues for packets going to particular output ports
> makes it possible to implement a work-conserving system -- violating that
> property means you either give up work conserving behavior, or achieve it
> in some more complex way that I can't easily imagine.
> At least for the bmv2 software P4 emulator, enq_qdepth is documented to be
> "the depth of the queue when the packet was first enqueued" [2].  I would
> expect that this would be the case for other P4 implementations that
> implement this feature, but I don't have enough knowledge of all P4
> implementations to say for sure.  The Portable Switch Architecture (PSA)
> document has not been published yet, but my hope is that it will define
> this kind of feature for devices that claim they are compliant with the PSA.
> Andy
> [1] http://techfieldday.com/appearance/barefoot-networks-
> presents-at-networking-field-day-14/ in particular the talk on the
> hardware by Dan Lenoski
> [2] https://github.com/p4lang/behavioral-model/blob/master/
> docs/simple_switch.md
> On Tue, Jul 18, 2017 at 11:48 PM, Eric Ruan <ruanweizhang at gmail.com>
> wrote:
>> Hi Andy,
>> as a complement to last email, I wonder if I understand the queue
>> correctly? Is the queue in P4 different from the queue in OpenFlow? Because
>> in OpenFlow queue is attached to a specific output port.
>> meanwhile in p4 switch, the queue seems like a huge buffer. So when I
>> read the intrinsic metadata *enq_qdepth*, am I reading the depth/size of
>> this buffer or the queue depth that attached to a specific
>> output port?
>> Thanks in advance.
>> Best,
>> Eric
>> 2017-07-19 14:13 GMT+08:00 Eric Ruan <ruanweizhang at gmail.com>:
>>> Hi Andy,
>>> I have one more question. As I know, queueing is between ingress
>>> pipeline and egress pipeline, and each queue is dedicated to a specific
>>> output port, though each output port may have more than one queue.
>>> So is it true that each packet is processed by ingress pipeline and then
>>> routed/replicated to different queues. Each queue has its own egress
>>> pipeline. After packets are processed by egress pipeline they go
>>> out from the specific output port or recirculate. The number of egress
>>> pipeline is the same as that of queues. Do I understand correctly?
>>> Thanks in advance.
>>> Best,
>>> Eric
>>> 2017-07-18 22:59 GMT+08:00 Eric Ruan <ruanweizhang at gmail.com>:
>>>> Hi Andy,
>>>> many thanks for your detailed explanation, it helps a lot!
>>>> Best,
>>>> Eric
>>>> 2017-07-18 22:36 GMT+08:00 Andy Fingerhut <andy.fingerhut at gmail.com>:
>>>>> Eric:
>>>>> I will give my understanding for why it would be bad if a P4 program
>>>>> for a switch ASIC that follows the switch architecture described in the
>>>>> P4_14 spec [1] could change the egress port in the egress control block
>>>>> (and for P4_16, the Portable Switch Architecture (PSA) is planned to be
>>>>> similar in many ways to the architecture in the P4_14 spec).  Hopefully
>>>>> others can jump in and add other reasons, if I am missing anything
>>>>> significant.
>>>>> It is usually desirable for a switch to be work conserving [2] on each
>>>>> of its output ports.  That is, if the switch contains at least one packet
>>>>> that is finished processing, and destined for output port X, then it should
>>>>> already be transmitting a packet on output port X, or should start to do so
>>>>> very soon.
>>>>> If there was no egress processing at all, then packets would be
>>>>> transmitted soon after they were scheduled from the queue for that output
>>>>> port.  The hardware scheduler for output port X could monitor that link,
>>>>> and know that if the last packet it scheduled from the output port X queue
>>>>> was N bytes long, and that output port transmits at 100 gigabits/second,
>>>>> for example, it can easily calculate when it needs to choose another packet
>>>>> to transmit out port X, leaving no idle time between packets.  If it
>>>>> schedules the next packet too soon, then another buffer is needed before
>>>>> the output port to store the packet somewhere, waiting for port X to be
>>>>> finished with the previous packet.  If it schedules the next packet too
>>>>> late, then port X will go idle for a time, and the switch is not work
>>>>> conserving.
>>>>> All of the description above assumes that there are one or more queues
>>>>> containing packets, all of which are known to be destined for output port
>>>>> X, and this choice of output port will not change after the packet has been
>>>>> chosen from that output port.
>>>>> If egress processing can change that output port selection, then there
>>>>> is no way to make the system work conserving.  For example, the scheduler
>>>>> might schedule packets that ingress processing specified will go to output
>>>>> ports 1 through 10, but if egress processing changes them all to output
>>>>> port 5, then all of those packets except the first to be transmitted need
>>>>> to be buffered somewhere, and all ports except 5 will be idle until the
>>>>> scheduler chooses a packet that goes to them.
>>>>> Could you have another set of queues and a big packet buffer after
>>>>> egress processing?  Sure, I can imagine a switch ASIC designed like that.
>>>>> However, even then, if some other processing can change the output port
>>>>> after _that_ packet buffer's scheduler, then the switch cannot achieve work
>>>>> conserving behavior.
>>>>> [1] https://p4lang.github.io/p4-spec/
>>>>> [2] https://en.wikipedia.org/wiki/Work-conserving_scheduler
>>>>> On Tue, Jul 18, 2017 at 3:35 AM, Eric Ruan <ruanweizhang at gmail.com>
>>>>> wrote:
>>>>>> Dear Antonin, Andy and all,
>>>>>> I wonder the mapping between egress port and egress pipeline is 1:1
>>>>>> or many:1.
>>>>>> To the best of my knowledge, the egress port is the output port of
>>>>>> switch. The egress port has to be set in the ingress pipeline before the
>>>>>> packet is routed to the egress pipeline.
>>>>>> If 1:1 is the case, then it makes sense that the egress port cannot
>>>>>> be changed in egress pipeline. But what is the design principle behind of
>>>>>> this? Since this is kind of waste of resources.
>>>>>> If many:1 is the case, then why is it not allowed to change the
>>>>>> egress port in egress pipeline?
>>>>>> Thanks in advance.
>>>>>> Best,
>>>>>> Eric
>>>>>> _______________________________________________
>>>>>> P4-dev mailing list
>>>>>> P4-dev at lists.p4.org
>>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>>>> --
>>>> 阮偉章
>>>> Eric Yuen
>>>> 交通大學 網路工程所 博士班
>>>> Institute of Network Engineering
>>>> National Chiao Tung University
>>>> 工程三館616室(EC616)
>>>> Email: ruanweizhang at gmail.com
>>>> <http://ruanweizhang@gmail.com/619026859@qq.com>
>>> --
>>> 阮偉章
>>> Eric Yuen
>>> 交通大學 網路工程所 博士班
>>> Institute of Network Engineering
>>> National Chiao Tung University
>>> 工程三館616室(EC616)
>>> Email: ruanweizhang at gmail.com
>>> <http://ruanweizhang@gmail.com/619026859@qq.com>
>> --
>> 阮偉章
>> Eric Yuen
>> 交通大學 網路工程所 博士班
>> Institute of Network Engineering
>> National Chiao Tung University
>> 工程三館616室(EC616)
>> Email: ruanweizhang at gmail.com
>> <http://ruanweizhang@gmail.com/619026859@qq.com>

Eric Yuen
交通大學 網路工程所 博士班
Institute of Network Engineering
National Chiao Tung University

Email: ruanweizhang at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.p4.org/pipermail/p4-dev_lists.p4.org/attachments/20170720/1f9bbcd2/attachment-0002.html>

More information about the P4-dev mailing list