[P4-dev] General question on P4

Michael Borokhovich michaelbor at gmail.com
Tue Apr 11 09:29:04 EDT 2017

Andy, thanks a lot for the detailed and clear explanation. It really helps!
Also, thank you for the reference!

On Sun, Apr 9, 2017 at 9:41 PM, Andy Fingerhut <andy.fingerhut at gmail.com>

> "1 packet per clock cycle _throughput_".  The whole idea of a pipelined
> architecture, whether it is packet forwarding or a general purpose CPU
> executing Intel x86 intructions, is to increase throughput by parallelism,
> breaking up the problem into small assembly line steps that can each be
> completed at a high rate.
> Many packet forwarding ASICs can achieve a packet forwarding _throughput_
> of 1 packet every 1 cycle, or 1 packet every 2 cycles.  If you look at any
> individual packet, it might take many steps in a pipeline, each of which
> takes only 1 or 2 clock cycles, but then it moves on to the next pipeline
> stage and the next arriving one takes its place.  The _latency_ of an
> individual packet might be dozens or hundreds of clock cycles, because of
> all the pipeline stages that it must go through to finish its processing.
> Yes, you can always write a program for such an ASIC that is more than
> that ASIC can finish within the pipeline depth the ASIC was designed with.
> The hope is to create an ASIC with a pipeline deep enough for most
> practical purposes, not everything one can imagine.  I have heard Barefoot
> engineers mention in a January 2017 Networking Field Day [1] that the
> performance of their ASICs are known and deterministic, _if the program
> fits in the chip_.
> If a program doesn't in the chip, then its performance is 0 for that chip.
> [1] http://techfieldday.com/appearance/barefoot-networks-
> presents-at-networking-field-day-14/
> Andy
> On Sun, Apr 9, 2017 at 5:58 PM, Michael Borokhovich <michaelbor at gmail.com>
> wrote:
>> Hi Andy,
>> Regarding the "1 packet per clock cycle". It depends on the program,
>> right? What if the pipeline includes several field extraction, traversing
>> of several flow tables and then modification of some fields?
>> So it looks like P4 approach is more efficient than a general NPU since
>> it provides a limited set of abstractions in the hardware that limits
>> flexibility but improves performance. The NPU gives maximum flexibility at
>> the price of performance. If this is the main differentiator, this makes a
>> very good sense to me.
>> Thanks,
>> Michael.
>> On Sat, Apr 8, 2017 at 3:39 PM, Andy Fingerhut <andy.fingerhut at gmail.com>
>> wrote:
>>> I haven't seen published packets/sec numbers for Tofino, either, but
>>> they do publish 65 x 100GE interfaces on their fastest device, and they
>>> published that there are 4 independent pipelines, each most likely handling
>>> 1/4 of those ports.  I know that their architecture can achieve 1 packet
>>> per clock cycle throughput, and clock rates in the 1 GHz to 1.5 GHz range
>>> are definitely achievable, if not even higher.  Conservatively that puts
>>> them at 4 billion packets per second, which at 65 x 100GE ports would
>>> handle an average packet size of 200 bytes at line rate.  6 billion packets
>>> per second would drive down the average packet size to 111 bytes.  Their
>>> fastest device is likely somewhere in that 4 to 6 billion packets/sec range.
>>> Andy
>>> On Sat, Apr 8, 2017 at 11:48 AM, Michael Borokhovich <
>>> michaelbor at gmail.com> wrote:
>>>> Hi Andy,
>>>> Thank you for the insight. If this is the performance difference, then
>>>> of course the advantage of P4 ASIC (e.g., Tofino) is obvious. I see that
>>>> EZchip NP5 supports 300 millions packets per second. But I didn't find a
>>>> similar spec for Tofino. Also, this comparison should be done for
>>>> comparable programs since each additional piece of functionality
>>>> (parsing/modifying an additional header field or doing an additional table
>>>> search) affects this pps metrics.
>>>> But again, if Tofino indeed achieves ~10 times more pps than e.g.,
>>>> EZchip NP5 for the same program, than I clearly see the benefit and the
>>>> novelty.
>>>> Michael.
>>>> On Fri, Apr 7, 2017 at 5:52 PM, Andy Fingerhut <
>>>> andy.fingerhut at gmail.com> wrote:
>>>>> In case it isn't obvious, max packet rate that you can achieve in an
>>>>> ASIC turns into a significant difference in cost when buying the equipment
>>>>> and paying the power bill for a network.
>>>>> Suppose you have a choice of a programmable ASIC that goes at 2
>>>>> billion packets per second, and an NPU that goes up to 200 million packets
>>>>> per second, and they both cost roughly the same amount and consume the same
>>>>> power.
>>>>> You have some part of a data center connecting a bunch of hosts
>>>>> together where you decide that kind of programmability is important.  You
>>>>> do some calculations to determine those hosts need 200 billion packets per
>>>>> second of forwarding capacity between them.
>>>>> Do you want buy and provide power for 200/2 = 100 fast programmable
>>>>> ASICs, or 200/.2 = 1,000 programmable NPUs?
>>>>> Andy
>>>>> On Fri, Apr 7, 2017 at 2:37 PM, Andy Fingerhut <
>>>>> andy.fingerhut at gmail.com> wrote:
>>>>>> I don't have experience with all NPUs, but many I have seen top out
>>>>>> on the order of hundreds of millions of packets per second with current
>>>>>> technology.
>>>>>> With the same current technology, it is possible to design fixed
>>>>>> function ASICs, and programmable ASICs like Barefoot's Tofino, that achieve
>>>>>> billions of packets per second.
>>>>>> The main difference that I am aware of is that many NPUs are based on
>>>>>> parallel arrays of 32-bit or 64-bit processor cores, and each core requires
>>>>>> many cycles for things like constructing table search keys and performing
>>>>>> side effects on the 'packet vector' (state maintained while forwarding the
>>>>>> packet about that packet only).  If you want to go at billions of packets
>>>>>> per second, the only way I know to get there is to have fixed or
>>>>>> configurable hardware that can do those things in 1 or 2 clock cycles per
>>>>>> packet.
>>>>>> You can write a compiler that compiles a P4 program to run on an NPU
>>>>>> as described above, and it will achieve portability of the P4 program, but
>>>>>> it won't make that NPU able to go at billions of packets per second.  It is
>>>>>> limited in performance by its hardware architecture.
>>>>>> There are proprietary methods for programming some ASICs that can go
>>>>>> at billions of packets per second, but all that I know of are lower level
>>>>>> than P4 and non-portable.
>>>>>> Andy
>>>>>> On Thu, Apr 6, 2017 at 6:37 PM, Michael Borokhovich <
>>>>>> michaelbor at gmail.com> wrote:
>>>>>>> Hi Remy,
>>>>>>> I'm not confusing hardware with the language... What I mean is that
>>>>>>> P4 + ASIC that supports it claims to give us programmable data-plane and
>>>>>>> this is claimed to be the innovation. But that is exactly the purpose of
>>>>>>> NPUs - to give us programmable data-plane and NPUs are around for a very
>>>>>>> long time. So maybe I'm missing the point of innovation that P4 + ASIC that
>>>>>>> supports it gives. As Nate said, and I agree, one big advantage is
>>>>>>> portability and the other - ability to do verification.
>>>>>>> So, P4 brings kind of an open standard for programmable ASICs which
>>>>>>> is analogous to a programming language (e.g., C) for regular CPUs. While
>>>>>>> each NPU currently have its own language and a programming style.
>>>>>>> What do you think?
>>>>>>> Thanks,
>>>>>>> Michael.
>>>>>>> On Thu, Apr 6, 2017 at 2:07 PM, Remy Chang <
>>>>>>> remy at barefootnetworks.com> wrote:
>>>>>>>> Hi Michael,
>>>>>>>> It seems you're conflating hardware with language.  NPU,
>>>>>>>> programmable ASIC, general purpose CPU, and even GPU can all potentially
>>>>>>>> execute p4 code.
>>>>>>>> Regards,
>>>>>>>> Remy
>>>>>>>> On Apr 6, 2017 10:57, "Michael Borokhovich" <michaelbor at gmail.com>
>>>>>>>> wrote:
>>>>>>>> Thanks for the reply Nate!
>>>>>>>> So, to summarize, the benefits of P4 approach are: portability and
>>>>>>>> performance. Other than that you probably can achieve the same (if not
>>>>>>>> better) flexibility/programmability with an NPU. Is this correct?
>>>>>>>> On Thu, Apr 6, 2017 at 1:01 AM, Nate Foster <
>>>>>>>> jnfoster at cs.cornell.edu> wrote:
>>>>>>>>> Your question seems to be more about the relative merits of
>>>>>>>>> various architectures than the P4 language. But yes an ASIC is generally
>>>>>>>>> more efficient than an NPU, at least at scale.
>>>>>>>>> Beyond efficiency there are other benefits to expressing a data
>>>>>>>>> plane algorithm in an open framework like P4. For example, a P4 programs
>>>>>>>>> should be relatively easy to port to a different target. The same is
>>>>>>>>> unlikely to be true for C programs written against closed SDKs.
>>>>>>>>> -N
>>>>>>>>> On Wed, Apr 5, 2017 at 6:59 PM, Michael Borokhovich <
>>>>>>>>> michaelbor at gmail.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> P4 allows for configurable data-plane, e.g., we can easily
>>>>>>>>>> support new custom protocols. However, the same functionality may be
>>>>>>>>>> achieved by using a network processor, e.g., EZchip (the one I had
>>>>>>>>>> experience with).
>>>>>>>>>> As I understand, the advantages of programmable ASIC/FPGA that
>>>>>>>>>> supports P4 is better performance and a lower price than a network
>>>>>>>>>> processor?
>>>>>>>>>> What do you think?
>>>>>>>>>> Thanks!
>>>>>>>>>> Michael.
>>>>>>>>>> _______________________________________________
>>>>>>>>>> P4-dev mailing list
>>>>>>>>>> P4-dev at lists.p4.org
>>>>>>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>>>>>>>> _______________________________________________
>>>>>>>> P4-dev mailing list
>>>>>>>> P4-dev at lists.p4.org
>>>>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>>>>>>> _______________________________________________
>>>>>>> P4-dev mailing list
>>>>>>> P4-dev at lists.p4.org
>>>>>>> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.p4.org/pipermail/p4-dev_lists.p4.org/attachments/20170411/7532fb7f/attachment-0002.html>

More information about the P4-dev mailing list