[P4-dev] General question on P4

Andy Fingerhut andy.fingerhut at gmail.com
Sun Apr 9 21:41:01 EDT 2017

"1 packet per clock cycle _throughput_".  The whole idea of a pipelined
architecture, whether it is packet forwarding or a general purpose CPU
executing Intel x86 intructions, is to increase throughput by parallelism,
breaking up the problem into small assembly line steps that can each be
completed at a high rate.

Many packet forwarding ASICs can achieve a packet forwarding _throughput_
of 1 packet every 1 cycle, or 1 packet every 2 cycles.  If you look at any
individual packet, it might take many steps in a pipeline, each of which
takes only 1 or 2 clock cycles, but then it moves on to the next pipeline
stage and the next arriving one takes its place.  The _latency_ of an
individual packet might be dozens or hundreds of clock cycles, because of
all the pipeline stages that it must go through to finish its processing.

Yes, you can always write a program for such an ASIC that is more than that
ASIC can finish within the pipeline depth the ASIC was designed with.  The
hope is to create an ASIC with a pipeline deep enough for most practical
purposes, not everything one can imagine.  I have heard Barefoot engineers
mention in a January 2017 Networking Field Day [1] that the performance of
their ASICs are known and deterministic, _if the program fits in the chip_.

If a program doesn't in the chip, then its performance is 0 for that chip.



On Sun, Apr 9, 2017 at 5:58 PM, Michael Borokhovich <michaelbor at gmail.com>

> Hi Andy,
> Regarding the "1 packet per clock cycle". It depends on the program,
> right? What if the pipeline includes several field extraction, traversing
> of several flow tables and then modification of some fields?
> So it looks like P4 approach is more efficient than a general NPU since it
> provides a limited set of abstractions in the hardware that limits
> flexibility but improves performance. The NPU gives maximum flexibility at
> the price of performance. If this is the main differentiator, this makes a
> very good sense to me.
> Thanks,
> Michael.
> On Sat, Apr 8, 2017 at 3:39 PM, Andy Fingerhut <andy.fingerhut at gmail.com>
> wrote:
>> I haven't seen published packets/sec numbers for Tofino, either, but they
>> do publish 65 x 100GE interfaces on their fastest device, and they
>> published that there are 4 independent pipelines, each most likely handling
>> 1/4 of those ports.  I know that their architecture can achieve 1 packet
>> per clock cycle throughput, and clock rates in the 1 GHz to 1.5 GHz range
>> are definitely achievable, if not even higher.  Conservatively that puts
>> them at 4 billion packets per second, which at 65 x 100GE ports would
>> handle an average packet size of 200 bytes at line rate.  6 billion packets
>> per second would drive down the average packet size to 111 bytes.  Their
>> fastest device is likely somewhere in that 4 to 6 billion packets/sec range.
>> Andy
>> On Sat, Apr 8, 2017 at 11:48 AM, Michael Borokhovich <
>> michaelbor at gmail.com> wrote:
>>> Hi Andy,
>>> Thank you for the insight. If this is the performance difference, then
>>> of course the advantage of P4 ASIC (e.g., Tofino) is obvious. I see that
>>> EZchip NP5 supports 300 millions packets per second. But I didn't find a
>>> similar spec for Tofino. Also, this comparison should be done for
>>> comparable programs since each additional piece of functionality
>>> (parsing/modifying an additional header field or doing an additional table
>>> search) affects this pps metrics.
>>> But again, if Tofino indeed achieves ~10 times more pps than e.g.,
>>> EZchip NP5 for the same program, than I clearly see the benefit and the
>>> novelty.
>>> Michael.
>>> On Fri, Apr 7, 2017 at 5:52 PM, Andy Fingerhut <andy.fingerhut at gmail.com
>>> > wrote:
>>>> In case it isn't obvious, max packet rate that you can achieve in an
>>>> ASIC turns into a significant difference in cost when buying the equipment
>>>> and paying the power bill for a network.
>>>> Suppose you have a choice of a programmable ASIC that goes at 2 billion
>>>> packets per second, and an NPU that goes up to 200 million packets per
>>>> second, and they both cost roughly the same amount and consume the same
>>>> power.
>>>> You have some part of a data center connecting a bunch of hosts
>>>> together where you decide that kind of programmability is important.  You
>>>> do some calculations to determine those hosts need 200 billion packets per
>>>> second of forwarding capacity between them.
>>>> Do you want buy and provide power for 200/2 = 100 fast programmable
>>>> ASICs, or 200/.2 = 1,000 programmable NPUs?
>>>> Andy
>>>> On Fri, Apr 7, 2017 at 2:37 PM, Andy Fingerhut <
>>>> andy.fingerhut at gmail.com> wrote:
>>>>> I don't have experience with all NPUs, but many I have seen top out on
>>>>> the order of hundreds of millions of packets per second with current
>>>>> technology.
>>>>> With the same current technology, it is possible to design fixed
>>>>> function ASICs, and programmable ASICs like Barefoot's Tofino, that achieve
>>>>> billions of packets per second.
>>>>> The main difference that I am aware of is that many NPUs are based on
>>>>> parallel arrays of 32-bit or 64-bit processor cores, and each core requires
>>>>> many cycles for things like constructing table search keys and performing
>>>>> side effects on the 'packet vector' (state maintained while forwarding the
>>>>> packet about that packet only).  If you want to go at billions of packets
>>>>> per second, the only way I know to get there is to have fixed or
>>>>> configurable hardware that can do those things in 1 or 2 clock cycles per
>>>>> packet.
>>>>> You can write a compiler that compiles a P4 program to run on an NPU
>>>>> as described above, and it will achieve portability of the P4 program, but
>>>>> it won't make that NPU able to go at billions of packets per second.  It is
>>>>> limited in performance by its hardware architecture.
>>>>> There are proprietary methods for programming some ASICs that can go
>>>>> at billions of packets per second, but all that I know of are lower level
>>>>> than P4 and non-portable.
>>>>> Andy
>>>>> On Thu, Apr 6, 2017 at 6:37 PM, Michael Borokhovich <
>>>>> michaelbor at gmail.com> wrote:
>>>>>> Hi Remy,
>>>>>> I'm not confusing hardware with the language... What I mean is that
>>>>>> P4 + ASIC that supports it claims to give us programmable data-plane and
>>>>>> this is claimed to be the innovation. But that is exactly the purpose of
>>>>>> NPUs - to give us programmable data-plane and NPUs are around for a very
>>>>>> long time. So maybe I'm missing the point of innovation that P4 + ASIC that
>>>>>> supports it gives. As Nate said, and I agree, one big advantage is
>>>>>> portability and the other - ability to do verification.
>>>>>> So, P4 brings kind of an open standard for programmable ASICs which
>>>>>> is analogous to a programming language (e.g., C) for regular CPUs. While
>>>>>> each NPU currently have its own language and a programming style.
>>>>>> What do you think?
>>>>>> Thanks,
>>>>>> Michael.
>>>>>> On Thu, Apr 6, 2017 at 2:07 PM, Remy Chang <remy at barefootnetworks.com
>>>>>> > wrote:
>>>>>>> Hi Michael,
>>>>>>> It seems you're conflating hardware with language.  NPU,
>>>>>>> programmable ASIC, general purpose CPU, and even GPU can all potentially
>>>>>>> execute p4 code.
>>>>>>> Regards,
>>>>>>> Remy
>>>>>>> On Apr 6, 2017 10:57, "Michael Borokhovich" <michaelbor at gmail.com>
>>>>>>> wrote:
>>>>>>> Thanks for the reply Nate!
>>>>>>> So, to summarize, the benefits of P4 approach are: portability and
>>>>>>> performance. Other than that you probably can achieve the same (if not
>>>>>>> better) flexibility/programmability with an NPU. Is this correct?
>>>>>>> On Thu, Apr 6, 2017 at 1:01 AM, Nate Foster <jnfoster at cs.cornell.edu
>>>>>>> > wrote:
>>>>>>>> Your question seems to be more about the relative merits of various
>>>>>>>> architectures than the P4 language. But yes an ASIC is generally more
>>>>>>>> efficient than an NPU, at least at scale.
>>>>>>>> Beyond efficiency there are other benefits to expressing a data
>>>>>>>> plane algorithm in an open framework like P4. For example, a P4 programs
>>>>>>>> should be relatively easy to port to a different target. The same is
>>>>>>>> unlikely to be true for C programs written against closed SDKs.
>>>>>>>> -N
>>>>>>>> On Wed, Apr 5, 2017 at 6:59 PM, Michael Borokhovich <
>>>>>>>> michaelbor at gmail.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>> P4 allows for configurable data-plane, e.g., we can easily support
>>>>>>>>> new custom protocols. However, the same functionality may be achieved by
>>>>>>>>> using a network processor, e.g., EZchip (the one I had experience with).
>>>>>>>>> As I understand, the advantages of programmable ASIC/FPGA that
>>>>>>>>> supports P4 is better performance and a lower price than a network
>>>>>>>>> processor?
>>>>>>>>> What do you think?
>>>>>>>>> Thanks!
>>>>>>>>> Michael.
>>>>>>>>> _______________________________________________
>>>>>>>>> P4-dev mailing list
>>>>>>>>> P4-dev at lists.p4.org
>>>>>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>>>>>>> _______________________________________________
>>>>>>> P4-dev mailing list
>>>>>>> P4-dev at lists.p4.org
>>>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>>>>>> _______________________________________________
>>>>>> P4-dev mailing list
>>>>>> P4-dev at lists.p4.org
>>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.p4.org/pipermail/p4-dev_lists.p4.org/attachments/20170409/57597f2a/attachment-0002.html>

More information about the P4-dev mailing list