Brujo has been a programmer since the age of 10 and he has more than 15 years in the industry. He was a VB, .NET, Java and Haskell programmer until he found Erlang 8 years ago. He’s now Inaka's CTO and Erlang Solutions’ Tech Lead and Trainer. He’s an active member of the open-source community and his blog (Erlang Battleground on Medium) was the most active Erlang blog of 2016.
11 July 2016
Black Holes and Blind Spots in the Performance Analysis of the MongoDB Java Driver
In this article I show how to measure and analyze the performance of client drivers to backend storage services such as MongoDB external services through the use of theSatoris
egress metering extension that creates a derived meter from a base meter, such as
clock.time, representing the outbound service (latency) time. This particular measurement technique can be applied to any other type of outbound service call from an application or client driver library.
MongoDB (from “humongous”) is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.
When doing a performance analysis of a software application that communicates with an external services or storage system via some client interfacing (driver) library I initially start with a simple conceptual model having 3 basic layers: the application, the client driver and the networking (or file system). The application represents the entry point (ingress) for interactions that eventually pass through the driver (transit) and out to the external system and service (egress). When I measure latency and cost (consumption) at the application entry it includes both the transit as well as the egress costs. But invariably I like to get a breakdown of the cost for each of these call flow classifications. One way to do this is to aggregate latency and execution cost at the package level and then group each package into a particular call flow class. But this has its problems because you need to track the ingress, transit and egress timing and costing within the ingress, transit and egress layers themselves. It becomes recursive because when software makes a call into a package it typically makes various internal calls within its own package before calling into another package that has a different call flow classification grouping. Another limitation with package grouping is that you want to have the breakdown of on a per service entry point basis. For example how much of the request processing time for transaction X can be attributed to remote calls to services
B. If you view this in terms of a call tree you are looking to compare the cost at a root call node with the summed time spent in leaf call nodes. Frequently you don’t really care about the breakdown of latency, or cost, across packages for a particular application entry point call. You just want to know how much of the transaction time is actually spent in the JVM. How much do you have to work with in optimizing the processing assuming the external service is not under your own change control.
There are two measurement concerns to bear in mind during such performance analysis: black holes and blind spots. A black hole is a term I use when referring to outbound calls to a remote service. I am aware that there is this black box the application communicates with but I am not sure of how I can determine the surface area (leaf call nodes) and the time spent (lost) there in what I refer to as the “outer space”. This task is further complicated because of a client driver library that creates an “inner space” around a portal to the remote destination. A blind spot occurs in the context of black holes when there is some form of coordination (and coupling) between two inner space call interaction points that communicate with the outer space. A request is dispatched to the server and then there is this wait within the inner space, the client driver library, for a paired response. If we only measure the writing of the client request and then measure the reading of the server response, the time (and possibly cost) between these two events is not visible. If the calling thread is in some way suspended from doing other work while waiting for a response we should really attribute this time in the inner space to the outer space. There is a slightly different form of this in the typical web request processing pattern observed in JEE applications. A web request comes in and makes a number of SQL query statement executions. The time to execute the SQL queries is measured but not the time in iterating over the resulting record sets, which can actually have a much greater latency than what was measured in retrieving the first initially set of records back in executing the SQL query. As the application iterates over the record set the client driver library communicates back and forth with the remote server to retrieve more of the records matching the query.
Hopefully the following example based on the MongoDB Java Driver API will make this far more understandable. Here is the mini application that will be used to execute a series of common remote database storage interactions: insert, find and remove. The code was inspired by an example within the MongoDB Github repository.
Lets start with a configuration of the
hotspot metering extension enabled by default within the Satoris instrumentation agent. Methods on average (not in the statistical sense) having a clock time of less than 5 microseconds will be disabled.
Here is the resulting
hotspot metering model collected on completion of the application execution. The majority of the inherent (self) clock time is spent in the
Note: Probes having the same total and inherent total are leaf call nodes in terms of the metering engine.
IMPORTANT NOTICE: The focus of the article is on the measurement technique employed, what questions it answers and the challenges in doing so. The numbers are not to be taken as representative considering the benchmark test executed for only 7 minutes and was run local to the database server. I will do a more lengthy benchmark run and append the results to the article later this week.
In the following configuration I enabled the
nonreentrant metering extension. This metering extension prevents the metering engine from measuring (metering) a method within the execution scope of itself. An alternative and more efficient metering extension,
nonrecursive, is available for filtering on direct reentrant method calls. I generally see the turning on of this particular as a data cleaning exercise as averaging of reentrant same named method call always seems somewhat confusing and misleading to me.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true
The biggest change in the metering model with the
nonreentrant metering extension enabled is the reduction in the measurement count for
Bits.readFully, dropping down from 9M to 3M. The reason for this is that the
Bits.readFully method is overloaded and chained. This is also the case for various methods in the
DBCollectionImpl classes dropping down from 2M to 1M.
Lets try going a bit lower with the threshold hopefully collecting more measurements methods for methods previously disabled by the
hotspot metering extension.
j.s.p.hotspot.threshold=1 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true
The metering table filtered on
hotspot labeled probes is now much larger. A slight increase in the response times but even there it is probably overstated due to the relatively short run duration and the fact that it would take longer for the metering engine to eliminate extraneous probes and their measurement. I could always switch in other types of latency meters that eliminate clock access from the metered thread context, like I did in my previous article titled An Allocation (Re)Action to Reactor, but I don’t believe it is warranted considering the remote execution nature of the test. Lets move on.
Now begins the more interesting part of the analysis with the enablement of the
egress metering extension and the creation of a new meter,
io.time, derived from the base
clock.time meter. What the
egress metering extension does is track the
clock.time metering for all leaf call nodes. A leaf is a metered probe that did not have any nested metered probes. The keyword here is metered as all metering extensions operate on the methods that are instrumented and continually metered.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=io.time
With the above configuration we now have two metering records for each probe (named method):
io.time is the actual
clock.time for leaf probes called directly or indirectly by a probe. Looking at the metering records for the
DBCollection.findOne probe we can now see that out of an average
clock.time of 117 microseconds we can attribute 103 microseconds to the
io.time, or the
clock.time spent in leaf probes. Incidentally this is a good example of where average reporting is useful. The average makes for a far easier comparison value than a significantly large total. What I particular like about this is that I don’t need to walk through a typical profiler call tree only to get lost in the expanding of a large number of nodes and seeing a large % of attributed time disappear in front of me without any useful insight offered up in the meantime.
The above metering table only showed those probes labeled as a
hotspot. In the metering table below, sorted on the inherent total, many of the probes observed as call leafs by the
egress metering extension are shown. Some of these probes would probably not be deemed IO related, especially the object constructors (
Any metering extension activated by the metering engine can have its visibility reduced using filters. In the following configuration the
egress metering extension now only sees those probes within the
org.bson.io package and sub packages.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=io.time j.s.p.egress.filter.enabled=true j.s.p.egress.filter.include.name.groups=org.bson.io
Here is the adjusted metering table now with the
io.time metering only pertaining to leaf calls within the
org.bson.io package, called directly or indirectly by a probe. The
DBCollection.findOne() probe has a total
clock.time of 118 microseconds and a
io.time metering of 103 microseconds. Only 15 microseconds is actually spent executing non-IO related methods within the scope of the
findOne() method. For
DBCollection.remove() the delta is 38 microseconds (127 – 96) and 27 microseconds (154 – 127) respectively.
Eventually the calls made in the
Stress class on the
DBCollection end up invoking a
doOperation() method in class
DBTCPConnector. Below is the code snippet from the
DBPort class. What do you see? At least two separate IO operations. The piping (or dispatch) of a client request and then the reading of the result, done strangely enough in the
Response object constructor. A possible blind spot depending on how much runtime execution cost follows the last IO write and how much precedes the first IO read. It may be better to consider the
doOperation() as the exit (egress) point in the remote procedure call (RPC) plumbing.
In the following configuration I’ve changed the filtering on the
egress metering extension to only include the two
doOperation() methods mentioned above. I’ve also renamed the
io.time meter to
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=rpc.time j.s.p.egress.filter.enabled=true j.s.p.egress.filter.include.name.groups=com.mongodb.DBPort.doOperation,com.mongodb.DBTCPConnector.doOperation
The delta for the
DBCollection.findOne() method is now 9 microseconds (115 – 106). For
DBCollection.remove() the delta is 20 microseconds (125 – 105) and 17 microseconds (152 – 135) respectively. All
rpc.time metering averages have increased over the
io.time meterings and the delta, with
clock.time, has decreased. It is now relatively easy to take a look at any method at the surface of the Java driver and very quickly determine whether a possible code optimization investigation is worth considering knowing how much time is actually spent calling outbound.
You might be tempted to change the
egress filtering from the
DBPort.doOperation() method to the
DBPort.call() methods but then the
rpc.time metering would include the time spent within the runtime performing concurrency control as the
doOperation() method is
synchronized. This may well be acceptable with the renaming of the meter from
- Life is too short to be spending time clicking around huge call trees in typical Java profiling tools, especially when very little is gained.
- Painting a black box up on some slide is not terribly useful when questions need to be answered. Boxes come in various shapes and sizes.
- Top software performance engineers look to understand the nature of code. They’re not shy of code, cost, consumption and control.
- The most important skill is in the framing of a question and then figuring out a pathway from a measure to a set of probable answers.
- 8 out of 10 Java profilers can only help you find 2 out of 10 performance problems. They just hope that the 2 happens 8 times out of 10.
- Don’t do your thinking within a performance profiling solution otherwise nothing much will come of it. Instead drive the tool towards the answer.
- If a profiler promises to find a problem in minutes then it’s already limited the questions it will answer. Rarely is it that simple outside demos.
I don’t offer Satoris or many of the other products I’ve developed over the years to help with performance investigation for commercial license outside of a professional service contract. I do offer expertise to clients , both onsite and offsite, that utilize the technologies in the course of performance tuning workshop, software performance benchmark analysis and production problem diagnosis. Feel free to contact me if you need to move fast in speeding up and scaling out your software services and systems.
I changed the
Stress class to run 10 times longer and also made some configuration changes to measure more, more accurately and more cheaply.
j.s.p.hotspot.initial=10000 j.s.p.hotspot.lower=20000 j.s.p.hotspot.threshold=1 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonrecursive.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=rpc.time j.s.p.egress.filter.enabled=true j.s.p.egress.filter.include.name.groups=com.mongodb.DBPort.doOperation,com.mongodb.DBTCPConnector.doOperation
Here is the resulting metering model filtered on
25 Jul 2016
A video exploring the potential of fast simulated...
08 Aug 2016
Gradle did come to stay with us. Although...
22 Aug 2016
Software Testing is not for Attention...
19 Sep 2016
New video from #droidconpl2015 is out!...
19 Sep 2016
Last night I attended a software testing...
21 Oct 2016
A summary of my visit from SystemD Conference...
24 Oct 2016
Up until yesterday, I had only gone...
24 Oct 2016
Updating sources (versions, revisions, tags)...
14 Nov 2016
Elixir is a joy to work with, an easy...
16 Nov 2016
A few weeks ago I attended Mobiconf, one of...
20 Jan 2017
When you can’t bang! This story...
26 Jan 2017
I have been using Ruby/Rails for 8 years and...
07 Feb 2017
I am an old Java man, I never allocated many of...
07 Feb 2017
Earlier today, Dorothy Graham presented...
07 Feb 2017
I have seen great posts about Elixir release...
22 Feb 2017
I first time I came across the term of "dark...
22 Feb 2017
You probably know...
22 Feb 2017
I am quite happy to share that I have...
06 Mar 2017
Who will find this interesting If you’re...
07 Mar 2017
A peer2peer and/or crowdfunding, blockchain...