[P4-dev] Parallelism and Manual TDGs

Andy Fingerhut andy.fingerhut at gmail.com
Thu Jan 11 18:42:09 EST 2018

I will let others answer authoritatively regarding your question about bmv2
simple_switch taking advantage of multiple CPU cores, but I would guess for
the case you ask about, likely the answer is no.  Individual primitive
operations are usually relatively few in number per table action (e.g. 1 to
a dozen or so), and each takes only a tiny amount of computation to
complete (e.g. copy a handful of bytes from one place in memory to
another), so the overhead of distributing the work to multiple CPU cores
and synchronizing them would very likely result in slower packet
forwarding, not faster, if you tried that approach.

If one wanted to achieve actual speedup for high rates of packet forwarding
on a multi-core CPU, I would expect an approach like "use one CPU core for
packets that arrive on ports 0 thru 3, and a separate CPU core for packets
that arrive on ports 4 thru 7" would be far more likely to produce actual
speedup over only using one CPU core (with a device having 8 ports in my
example scenario).  This is the same kind of parallelism that many packet
switching ASICs take advantage of, for those that are designed with
multiple parallel pipelines inside of them.

A table dependency graph (TDG) seems unlikely to me to be able to help make
a software based P4 forwarding plane like bmv2 simple_switch run any
faster.  A TDG can help if you have a hardware design like that described
in the RMT paper, e.g. as far as I know from public information Barefoot's
Tofino ASIC has such a hardware architecture.  Such an architecture has a
fixed number of pipeline stages in which packet processing must be
completed, and calculating a TDG and doing many other device-specific steps
in the compiler can help execute multiple tables in parallel in the same
pipeline stage of an RMT device, instead of only executing one per pipeline


On Thu, Jan 11, 2018 at 3:21 PM, David Hancock <dhancock at cs.utah.edu> wrote:

> Hello,
> If I have an action with independent primitives, will bmv2 simple switch
> run them in parallel if multiple cores are available?
> Also I recall the table dependency graph (TDG) being touted in the
> original P4 paper as able to support construction of the optimally
> efficient pipeline.  I think, though, that TDG generation was turned off by
> default in p4c-bmv2, possibly in response to my request, because at least
> in my case (hundreds of tables) it could not be computed in a reasonable
> amount of time.
> Would it be possible to manually specify a TDG?  If so, how could I then
> use that to deploy a more efficient image of my program to bmv2 simple
> switch?
> Thanks,
> David
> _______________________________________________
> P4-dev mailing list
> P4-dev at lists.p4.org
> http://lists.p4.org/mailman/listinfo/p4-dev_lists.p4.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.p4.org/pipermail/p4-dev_lists.p4.org/attachments/20180111/fbd931f1/attachment-0002.html>

More information about the P4-dev mailing list