[P4-dev] Parallelism and Manual TDGs

David Hancock dhancock at cs.utah.edu
Thu Jan 11 23:33:15 EST 2018

Thank you both, very helpful.


On 01/11/2018 05:29 PM, Antonin Bas wrote:
> I think Andy gave a very good answer, including regarding bmv2.
> As of today bmv2 simple_switch uses a single thread for the ingress 
> pipeline and /N/ threads for the egress pipeline. Egress ports are 
> statically assigned to one of the egress threads to avoid packet 
> re-ordering. I believe the reason we have multiple egress threads is 
> to avoid too much contention in the queueing logic and more accurately 
> enforce the rate-limiting parameters specified by the user. We could 
> also add support for multiple ingress threads.
> When I started developing bmv2 I briefly considered trying to emulate 
> the RMT architecture: I wanted to split the tables in stages based on 
> the TDG and run each stage in its own thread. I could also have 
> introduced the notion of pipe to leverage even more cores. I decided 
> not to go with this solution for the following reasons:
> 1) I believed it would have made debugging more difficult. For example 
> the packet logs would have been very intertwined even when switching a 
> single flow (this could have been "fixed" by filtering the logs based 
> on the packet id)
> 2) I believed it may not have been a very efficient use of resources, 
> with some stages taking more CPU cycles than others (the slowest stage 
> would have determined the throughput of the pipe).
> 3) Finally I believed this would have required a synchronized queue 
> between each stage / thread, as packets were waiting to be processed 
> by the next thread. My guess was that it would actually slow bmv2 
> down. This was mostly guess-work and maybe the issue could be avoided 
> with a lock-free queue.
> I don't think it would be very hard to rewrite bmv2 (and more 
> precisely the Pipeline class: 
> https://github.com/p4lang/behavioral-model/blob/master/src/bm_sim/pipeline.cpp#L39) 
> to use stages but I am not sure it is worth it at this time...
> Thanks,
> Antonin
> On Thu, Jan 11, 2018 at 3:42 PM, Andy Fingerhut 
> <andy.fingerhut at gmail.com <mailto:andy.fingerhut at gmail.com>> wrote:
>     I will let others answer authoritatively regarding your question
>     about bmv2 simple_switch taking advantage of multiple CPU cores,
>     but I would guess for the case you ask about, likely the answer is
>     no.  Individual primitive operations are usually relatively few in
>     number per table action (e.g. 1 to a dozen or so), and each takes
>     only a tiny amount of computation to complete (e.g. copy a handful
>     of bytes from one place in memory to another), so the overhead of
>     distributing the work to multiple CPU cores and synchronizing them
>     would very likely result in slower packet forwarding, not faster,
>     if you tried that approach.
>     If one wanted to achieve actual speedup for high rates of packet
>     forwarding on a multi-core CPU, I would expect an approach like
>     "use one CPU core for packets that arrive on ports 0 thru 3, and a
>     separate CPU core for packets that arrive on ports 4 thru 7" would
>     be far more likely to produce actual speedup over only using one
>     CPU core (with a device having 8 ports in my example scenario). 
>     This is the same kind of parallelism that many packet switching
>     ASICs take advantage of, for those that are designed with multiple
>     parallel pipelines inside of them.
>     A table dependency graph (TDG) seems unlikely to me to be able to
>     help make a software based P4 forwarding plane like bmv2
>     simple_switch run any faster.  A TDG can help if you have a
>     hardware design like that described in the RMT paper, e.g. as far
>     as I know from public information Barefoot's Tofino ASIC has such
>     a hardware architecture.  Such an architecture has a fixed number
>     of pipeline stages in which packet processing must be completed,
>     and calculating a TDG and doing many other device-specific steps
>     in the compiler can help execute multiple tables in parallel in
>     the same pipeline stage of an RMT device, instead of only
>     executing one per pipeline stage.
>     Andy
>     On Thu, Jan 11, 2018 at 3:21 PM, David Hancock
>     <dhancock at cs.utah.edu <mailto:dhancock at cs.utah.edu>> wrote:
>         Hello,
>         If I have an action with independent primitives, will bmv2
>         simple switch run them in parallel if multiple cores are
>         available?
>         Also I recall the table dependency graph (TDG) being touted in
>         the original P4 paper as able to support construction of the
>         optimally efficient pipeline.  I think, though, that TDG
>         generation was turned off by default in p4c-bmv2, possibly in
>         response to my request, because at least in my case (hundreds
>         of tables) it could not be computed in a reasonable amount of
>         time.
>         Would it be possible to manually specify a TDG? If so, how
>         could I then use that to deploy a more efficient image of my
>         program to bmv2 simple switch?
>         Thanks,
>         David
>         _______________________________________________
>         P4-dev mailing list
>         P4-dev at lists.p4.org <mailto:P4-dev at lists.p4.org>
>         http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>         <http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org>
>     _______________________________________________
>     P4-dev mailing list
>     P4-dev at lists.p4.org <mailto:P4-dev at lists.p4.org>
>     http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
>     <http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org>
> -- 
> Antonin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.p4.org/pipermail/p4-dev_lists.p4.org/attachments/20180111/3fd24b43/attachment-0002.html>

More information about the P4-dev mailing list