[P4-dev] Parallelism and Manual TDGs
antonin at barefootnetworks.com
Thu Jan 11 19:29:11 EST 2018
I think Andy gave a very good answer, including regarding bmv2.
As of today bmv2 simple_switch uses a single thread for the ingress
pipeline and *N* threads for the egress pipeline. Egress ports are
statically assigned to one of the egress threads to avoid packet
re-ordering. I believe the reason we have multiple egress threads is to
avoid too much contention in the queueing logic and more accurately enforce
the rate-limiting parameters specified by the user. We could also add
support for multiple ingress threads.
When I started developing bmv2 I briefly considered trying to emulate the
RMT architecture: I wanted to split the tables in stages based on the TDG
and run each stage in its own thread. I could also have introduced the
notion of pipe to leverage even more cores. I decided not to go with this
solution for the following reasons:
1) I believed it would have made debugging more difficult. For example the
packet logs would have been very intertwined even when switching a single
flow (this could have been "fixed" by filtering the logs based on the
2) I believed it may not have been a very efficient use of resources, with
some stages taking more CPU cycles than others (the slowest stage would
have determined the throughput of the pipe).
3) Finally I believed this would have required a synchronized queue between
each stage / thread, as packets were waiting to be processed by the next
thread. My guess was that it would actually slow bmv2 down. This was mostly
guess-work and maybe the issue could be avoided with a lock-free queue.
I don't think it would be very hard to rewrite bmv2 (and more precisely the
to use stages but I am not sure it is worth it at this time...
On Thu, Jan 11, 2018 at 3:42 PM, Andy Fingerhut <andy.fingerhut at gmail.com>
> I will let others answer authoritatively regarding your question about
> bmv2 simple_switch taking advantage of multiple CPU cores, but I would
> guess for the case you ask about, likely the answer is no. Individual
> primitive operations are usually relatively few in number per table action
> (e.g. 1 to a dozen or so), and each takes only a tiny amount of computation
> to complete (e.g. copy a handful of bytes from one place in memory to
> another), so the overhead of distributing the work to multiple CPU cores
> and synchronizing them would very likely result in slower packet
> forwarding, not faster, if you tried that approach.
> If one wanted to achieve actual speedup for high rates of packet
> forwarding on a multi-core CPU, I would expect an approach like "use one
> CPU core for packets that arrive on ports 0 thru 3, and a separate CPU core
> for packets that arrive on ports 4 thru 7" would be far more likely to
> produce actual speedup over only using one CPU core (with a device having 8
> ports in my example scenario). This is the same kind of parallelism that
> many packet switching ASICs take advantage of, for those that are designed
> with multiple parallel pipelines inside of them.
> A table dependency graph (TDG) seems unlikely to me to be able to help
> make a software based P4 forwarding plane like bmv2 simple_switch run any
> faster. A TDG can help if you have a hardware design like that described
> in the RMT paper, e.g. as far as I know from public information Barefoot's
> Tofino ASIC has such a hardware architecture. Such an architecture has a
> fixed number of pipeline stages in which packet processing must be
> completed, and calculating a TDG and doing many other device-specific steps
> in the compiler can help execute multiple tables in parallel in the same
> pipeline stage of an RMT device, instead of only executing one per pipeline
> On Thu, Jan 11, 2018 at 3:21 PM, David Hancock <dhancock at cs.utah.edu>
>> If I have an action with independent primitives, will bmv2 simple switch
>> run them in parallel if multiple cores are available?
>> Also I recall the table dependency graph (TDG) being touted in the
>> original P4 paper as able to support construction of the optimally
>> efficient pipeline. I think, though, that TDG generation was turned off by
>> default in p4c-bmv2, possibly in response to my request, because at least
>> in my case (hundreds of tables) it could not be computed in a reasonable
>> amount of time.
>> Would it be possible to manually specify a TDG? If so, how could I then
>> use that to deploy a more efficient image of my program to bmv2 simple
>> P4-dev mailing list
>> P4-dev at lists.p4.org
> P4-dev mailing list
> P4-dev at lists.p4.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the P4-dev