QUIC flow control stream parameters explanatory graphic

I have started to spend time contributing to Neqo, Mozilla’s implementation of the QUIC/HTTP3 standard. I got a foothold in the code by attempting to tackle an easy first task. N.B.: Names can be deceiving. In the process, I found a few of the transport parameters of a QUIC connection (values exchanged between peers at the start of a connection) very, very confusing:

  • initial_max_stream_data_bidi_local
  • initial_max_stream_data_bidi_remote

Each of these parameters is described in the standard itself, https://tools.ietf.org/html/draft-ietf-quic-transport-24#section-18.2 but I had a very hard time parsing the semantics of the description. For example, here is the description of initial_max_stream_data_bidi_local from the standard:

This parameter is an integer value specifying the initial flow control limit for locally-initiated bidirectional streams.  This limit applies to newly created bidirectional streams opened by the endpoint that sends the transport parameter.  In client transport parameters, this applies to streams with an identifier with the least significant two bits set to 0x0; in server transport parameters, this applies to streams with the least significant two bits set to 0x1.

That’s a mouthful, isn’t it? After several hours days of textual analysis, I thought I figured it out and consulted with QUIC experts Robin Marx, Daniel Stenberg and Lucas Pardue (besides being experts, the members of this triumvirate are really great people) to make sure I understood.

As I discussed my understanding of the parameters with them, it became clear to me that I should attempt to diagram the intent of these parameters in a way that might save others the same headache.

If you’d like to use this diagram in any form, please do so. I have licensed it under the CC by SA. I am happy to provide it in other formats as well!

Update: The first version of this blog post included the wrong section of the QUIC spec quoted for initial_max_stream_data_bidi_local. I think it is correct now! Sorry.

From Awesome Bar to Parser (1/n)

I recently had the chance to fix a timing bug deep in the HTML parser of Firefox. As a result of the investigation, I got to learn lots about the way that the browser efficiently loads and parses HTML as it is downloaded from the network. One of the things that has always bothered me is that I never had a handle on the process, from start to finish, of how Firefox turns a user’s text input in the Awesome Bar into a network connection, then into a stream of HTML and, ultimately, into a rendered page on the screen. In this post I will explore the first part of one of those three areas by starting to answer the following question: How does Firefox translate a URL the user enters in the Awesome Bar into network activity?

At a very, very high level, the parts of Firefox visible to the end user are built in two separate components. First, there is an engine known as Gecko that renders HTML/CSS into a visual representation and executes JavaScript. Second, there is the part with which the user interacts that controls Gecko. This so-called chrome around Gecko consists of, for instance, the history UI, the developer tools, the preferences UI, and, most important for the purposes of this post, the Awesome Bar. See the image below.

The areas of a Firefox window under the control of Gecko are shown in green; the areas of a Firefox window under the control of the chrome are shown in blue.

Known internally as the URL Bar, the Awesome Bar is implemented with a model/view/controller design pattern. The controller is known as the UrlbarController and implemented in browser/components/urlbar/UrlbarController.jsm and handles the input that the user types into the Awesome Bar. So, when the user types in, say, cnn.com and then presses Enter, the UrlbarController hears those key presses and is responsible for carrying out the action that the user expects. In this case, that action is to tell Gecko to load the HTML of cnn.com and render it on the screen.

We start down the rabbit hole here, then. The switch statement in the handleKeyNavigation function of the UrlbarController is executed each time the user presses a key while the cursor is in the Awesome Bar. When the user presses Enter, the code on line 309 is executed and the handleCommand method of the controller’s input field is executed. The handleCommand method checks for any special actions that the user may have expected to happen based on their input. If no special actions are to be taken (ie, using a search engine to query the internet), the handleCommand method assumes that the user typed a URL and they meant to have Gecko load that website. handleCommand takes that URL from the Awesome Bar and passes control to its _loadURL method. Because the URL comes directly from user input, it is considered a trusted link. _loadURL uses openTrustedLinkIn of the window object to continue loading.

The openTrustedLinkIn method is defined in browser/base/content/utilityOverlay.js. This file contains a set of global functions that are needed throughout the implementation of the browser’s chrome. After guaranteeing the validity of its parameters, openTrustedLinkIn passes control to openUILinkIn, defined in the same file. Execution has a cup of coffee in openUILinkIn before continuing to openLinkIn. openLinkIn‘s primary responsibility is to translate its where parameter into a target browser, the browser that will render the contents of the URL entered by the user. In this case, where is the string "current", which is a canonical reference to the browser in the foreground tab. After translating the where parameter to a target browser, openLinkIn calls the target browser’s loadURI function.

The target browser’s loadURI method is bound to the global _loadURI function defined in browser/base/content/browser.js and this is the point where we can start to see the light at the end of the tunnel (sorry for mixing metaphors!). _loadURI invokes loadURI on browser‘s webNavigation object. The webNavigation object is …

It turns out that finishing that sentence is harder than it seems. The webNavigation object is a property of browser. That means that Javascript gives the implementer of browser the opportunity to write a getter function for it. And, remember, browser is just a normal web browser, right? Wrong!

Throughout the implementation of Firefox, “browser” is a term that carries several meanings. On many occasions, the reason for referring to an element as a “browser” is historical and no longer even remotely applicable. I have it on good authority that our engineers and developers apologize for this. In this context, browser is a UI element implemented as a XULFrameElement that represents the content area (i.e., where the contents of the web page are rendered) of the currently selected tab.

The normal implementation of the getter for the webNavigation property of a XULFrameElement is overidden and has very interesting behavior, to say the least. In this case, browser‘s isRemoteWebBrowser value is set to true and the getter returns its remote web browser. I am not even going to pretend that I understand. Fortunately, the immensely talented and smart Mike Conley came to my rescue:

The reason is historical. The WebNavigation property is supposed to be the interface through which one can command the browser to navigate somewhere. Before multi-process Firefox, this was an interface to the underlying DocShell that was running in the parent process.
With multi-process Firefox, that interface was kept, but we overrode it so that the commands were being “remoted” to another process over the message manager.
More recently, that remote-ing is now going mostly over IPC directly in the native layer by calling methods on the BrowsingContext instead.

Private correspondence with the author.

On marches Firefox by executing the loadURI method on a remote web browser, which turns out to be a canonical browsing context (1, 2).

Here, finally, Firefox transitions from execution of code written in Javascript to the execution of code written in C++. The journey's end draws near. A canonical browsing context is implemented by CanonicalBrowsingContext, a subclass of BrowsingContext. Both have LoadURI methods. However, their parameter list is different which makes them overloaded functions. Execution continues in CanonicalBrowsingContext::LoadURI before eventually continuing to BrowsingContext::LoadURI. CanonicalBrowsingContext::LoadURI creates a load state using the values of the parameters set by its caller. As its final act, CanonicalBrowsingContext::LoadURI invokes BrowsingContext::LoadURI with the newly created load state and NULL pointers for the other two parameters. Calling BrowsingContext::LoadURI in this way triggers a specific execution path.

The chrome of Firefox (along with several other subsystems of the browser) is implemented in a so-called parent process. Firefox uses one or more child processes to control the “execution” of the content in browser tabs. This post is following a journey initiated by the user from the UI, so the actions executed to this point have been done in the context of the parent process. Because of the way that CanonicalBrowsingContext::LoadURI invoked BrowsingContext::LoadURI and the fact that it is executing in the context of the parent process, BrowsingContext::LoadURI simply transfers the user’s request to load a page to the proper child process – the one controlling the “execution” of the content of the current foreground tab, which in a previous context was called the target browser.

As I’ve been writing, I’ve been imagining myself telling this story over a campfire to a circle middle schoolers who thought they were going to hear a scary story. Depending on how you have felt reading this, they may or may be disappointed by what they’ve heard so far.

Either way, this is right about where the story gets really good so it seems like a good place to pause, take a break and inhale. When I started writing, I thought that I knew the entire execution path from the research I had done trying to fix the timing bug. I was wrong. Putting this journey into words has forced me to reconsider several of my assumptions and, as a result, I’ve learned some very important lessons.

Join me again soon when we continue our trek From Awesome Bar to Parser!

I want to say a special thank you to Mike Conley for providing valuable feedback on this post. Any errors that remain are my fault, however. Try as he might, sometimes I just couldn’t be taught.

Mobile Redirection Statistics of the Alexa Top 50 Sites

As a follow-up to my previous post on redirection behavior of the Alexa Top 50 sites, my coworkers suggested that I run the same analysis when accessing those sites from a mobile browser.

Of course I didn’t gather the data in the first study manually and the only way to fetch the contents of a URL from the command line is curl, obviously. So, the first step in conducting this data analysis was reading the curl man page to figure out how to include a custom user-agent string in requests:

-A, –user-agent <name>

(HTTP) Specify the User-Agent string to send to the HTTP server. To encode blanks in the string, surround the string with single quote marks. This can also be set with the -H, –header option of course.

If this option is used several times, the last one will be used.

I chose to use the UA String from Fenix:

Mozilla/5.0 (Android 9; Mobile; rv:71.0) Gecko/71.0 Firefox/71.0

The results were surprising: Only 15 of the Alexa Top 50 used HTTP redirects to send a mobile user to a custom mobile-friendly URL. The other 35, presumably,

  • Use JS redirection to send the user to a mobile-friendly version of their website;
  • Use responsive design to build their website;
  • Give the same user experience to desktop users and mobile users.

As before, the results are available online.

Redirection Statistics of the Alexa Top 50 Sites

We were asking a question at work about whether it would be useful to prefetch http://www.url.tld, https://url.tld or https://www.url.tld if a user browses to http://url.tld.

I did a study on the sites ranked in the Alexa Top 50 to see how often a prefetch like that would be useful. In other words, I attempted to count how often a site, bare.tld, from the Alexa Top 50 redirects to www.bare.tld and/or whether the redirection changes protocols from HTTP to HTTPS.

In order to make the study easier, I only considered those redirects done with the HTTP/S headers. In other words, if the connection will be redirected by some JavaScript in a page, those redirections are not reflected in these stats.

Here are the results:

  1. 32 of the 50 (64%) redirect from bareurl.tld to www.bareurl.tld
  2. 37 of the 50 (74%) redirect from http to https
  3. 27 of the 50 (54%) do both

Here is the data, if you want to take a look at it yourself.

Update: As it turns out, some redirection behavior is User-Agent-String (UA String) specific. The default UA String in curl is curl/7.58.0 and using it gives the results shown above. If, however, you specify a more reasonable UA String (Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0), the results are (slightly) different:

  1. 32 of the 50 (64%) redirect from bareurl.tld to www.bareurl.tld
  2. 41 of the 50 (82%) redirect from http to https
  3. 31 of the 50 (62%) do both

Notice that there is no change in the number of sites that redirect from bareurl.tld to www.bareurl.tld but there is an increase in the number of websites that do redirects for secure HTTP.

Data for these results are available here.

Importance of the Order of Calls to postMessage and addEventListener

Things are strange in the world of JavaScript. I wanted to take a few minutes to document something that is, well, interesting.

In the living version of the specification of HTML from the WHATWG, there is an API known as Cross-document messaging. This allows two documents, from the same or different domains, to send messages to one another. The API allows, for example, two documents embedded alongside one another in iframes to exchange messages to coordinate behavior.

Although this is not how it is defined in the specification, consider that a message logically has a sender and zero or more receivers. In the design of a message-passing framework, there are (at least) two different ways to deliver messages:

  1. deliver the message to existing registered receivers and delete the message or
  2. deliver the message to all registered receivers and store that message for delivery to receivers that register in the future.

Obviously, 1 makes the most sense. To implement delivery method 2 would mean that a single sender transmitting a high rate of messages could cripple the delivery system by forcing it to store every sent message for potential future delivery.

The Cross-document message API implements 1. Therefore, it would make sense that if a receiver r wanted to hear a message sent by sender s, r must be registered before s sends a message. The HTML spec says that s calls postMessage() to send a message and r calls addEventHandler() to become a receiver.

However, that logical ordering is not always required to use the Cross-document messaging API! Don’t believe me? Go to the executable version of the code below and click the sendMessageAndSetHandler button! You’ll find that, even though the listener registers after the message is sent, it still receives the sender’s transmission!

<html>
<head>
<script>
function sendMessage() {
  window.postMessage("I can hear you!");
}
function setMessageHandler() {
  function messageHandler(event) {
    alert("message: " + event.data);
  }
  window.addEventListener("message", messageHandler);
}
function sendMessageAndSetHandler() {
  sendMessage();
  setMessageHandler();
}
</script>
</head>
<body>
<button onclick="sendMessageAndSetHandler();">sendMessageAndSetHandler</button><br>
</body>
</html>

How is this possible?

The listener registration takes place immediately but, as the specification says, postMessage() only queues a task to send the message and does not actually send it. The task added to the queue by the sender to deliver the message is only actually executed when it is at the top of the so-called posted message task source queue and that queue is serviced by the browser. The posted message task source queue is serviced during the execution of the event loop and (modulo my understanding of very intricate JS behavior) the execution of a JavaScript function will never be interrupted to service the event loop.

In other words, between the call to sendMessage() and setMessageHandler() in the example above, the browser does not actually transmit the message. Therefore, by the time the message is actually sent, the receiver will be listening!

Makes perfect sense, right?

The Impact of android:debuggable on Android Application Startup Time

For several weeks members of my team and I have been trying to reconcile discrepancies in performance data for our application’s startup when measured locally and when measured on our CI platform. We weren’t really sure what was causing the discrepancies and resorted to brainstorming about the possible causes.

One of our most promising hypothesis was a theory that there was more of a difference between measurement techniques than we imagined. While we weren’t sure how the application’s startup time was being measured on the CI platform, we knew how it was being measured locally — using the Android Profiler.

We were testing the same build variant (a “release” build) locally and on the CI platform which meant that we were executing the same byte code on both. What other variables could account for the difference in measurements? It occurred to us that the CI platform might be taking its measurements with some tool other than the Android Profiler. One of the prerequisites of the Android Profiler is that the application under profile be built with the android:debuggable flag set to true.

Our team operated under the assumption that the value of this flag simply controlled an Android OS permission that dictated whether a debugger/profiler could attach to the process at runtime. In other words, we believed that the value of this flag should not change the way the Android OS’s runtime treated the application. However, this was the only potential difference we could imagine between the local and CI platform test environment, so we decided to investigate.

Turns out this was a good idea.

To properly isolate whether the debuggable flag is the cause of the change in startup performance, we had to figure out how to make two different versions of the application — both with the same byte code but one debuggable and the other not. In other words, we needed to generate two applications with the same bytecode but different values of the debuggable flag so that we could definitively attribute any differences in application start up time to the runtime. Ideally, then, we would toggle the flag on two copies of the same compiled version of the application (an apk).

The android:debuggable flag exists in an apk’s AndroidManifest.xml file. In older versions of Android, this file was, as the extension suggests, plain text xml. However, in modern versions of Android, the AndroidManifest.xml file is stored in a binary format in the apk.

There is not much publicly available documentation (that we could find quickly) about this file format. Several tools are available for inspecting its contents (e.g., aapt from the Android SDK) and we found some tools that promised to help modify the contents of an existing apk (e.g., apktool) but couldn’t get them to work properly.

That meant we were going to have to modify the binary-formatted xml file by hand! Fortunately, there is a good, open-source, standalone tool, axmldec, whose source code implicitly contains enough information about the file format to give us an idea of the binary format’s structure.

The android:debuggable flag is an attribute for an element. In the case of the android:debuggable flag, that element is the application itself. According to the source code, each element attribute is stored on disk in 20-byte structure:

ns
name
raw_value
value

4 bytes
4 bytes
4 bytes
8 bytes

Taking a cue from Java class file encoding techniques, this binary format stores every string in a table and the use of a string is stored simply as a pointer to an entry in that table. Because the namespace (ns) and name (name) of the attribute are strings, their values in this structure are 4-byte pointers into the string table. In this case, those strings are “android” and “debuggable”. The raw_value and value fields are more interesting. If the raw_value field is not 0xffffffff, then it represents a pointer to a string in the string table. In other words, the value of that attribute is some string and the value is meaningless. On the other hand, if the raw_value field is 0xffffffff, then the type of the element is some primitive whose value can be represented in 8 bytes.

A separate format defines the structure of the data in the raw_value field:

size
res0
data_type
data

2 bytes
1 byte
1 byte
4 bytes

The data and data_type type fields are the most useful. Again, according to the source code, a boolean (the type which we suppose the android:debuggable attribute to be) has a value of 0x12. For a boolean, “false” is 0x0000 in the data field and “true” is anything else.

With that understanding, we compiled a version of the application with the debuggable flag set to true and unzipped the resulting apk to get access to AndroidManifest.xml. Then, we used a modified version of axmldec to find the offset of the android:debuggable attribute in the application element. Using our newfound knowledge of the format of the elements, we used vi and xxd to manually change the attribute’s value from “true” to “false”. We used that modified AndroidManifest.xml file to create a non-debuggable version of the apk by zipping the contents of the directory generated when we unzipped the original apk. Finally, we signed the apk with our key and verified that our change to the debuggable flag worked ($ aapt d badging <path to apk> | grep -i debug).

Breath.

After all this work we learned …? Nothing. Yet.

We did, however, have a basis for a/b testing: Two applications, identical except for their debuggable flags.

We had to answer one final question before starting to test: How to fairly measure the application startup time? Remember, because only one of the two variants is marked as debuggable, we cannot use the Android Profiler. We found our answer in the am_ flags.

The Android Activity Monitor emits am_* flags when certain meaningful events occur. You can see a list of all the am_* flags here. We decided to rely on the am_proc_start flag which contains information about the time an application takes to start.

We didn’t need to build a raw data set where n was, say, 1000. We just needed something “rough and ready”. Our test tool ran each version of the application 5 times and recorded the am_proc_start times for each run. We recorded and analyzed the values.

You can see the raw data here.

Well, wow! The results indicate that our assumption was almost entirely wrong! Simply changing the value of the android:debuggable flag has a significant impact on the time it takes an application to start.

For us, the implications of this discovery are enormous.

  1. The measurements that we are taking of the time it takes to start our application on the CI platform and in the profiler cannot be compared directly with the reported startup times of comparable applications (unless, of course, they are also measuring versions of their application with the android:debuggable flag set).
  2. A corollary to (1) is that we can use timing results from Android Profiler and the CI platform to monitor performance changes over time. The caveat is that we have to be vigilant against the possibility that our optimizations are only improving the performance of startup components that the runtime executes when the debuggable flag is set. While these optimizations will not negatively affect startup performance under routine conditions, spending time on them would be a waste of engineering resources.
  3. We must, now, consider whether the value of the android:debuggable attribute changes other aspects of runtime performance. Although we do not believe that it does, these results show that we can no longer be certain of that.

As a result of this investigation, we are going to spend time researching where the runtime checks the debuggable flag and how it changes its behavior based on that value. We have several theories.

If you have any information leading to the arrest of the party responsible for the lost performance, please contact us in the comments!

Thank you to my fantastic teammate Anny who provided valuable feedback on a draft of this post. If you think that the post is well written and informative, it’s because of her. If you think that the post is confusing and boring, I take full responsibility.

Correcting ProGuard Configuration To Improve the Startup Time of Firefox’s New Mobile Browser

As a member of the performance engineering group at Mozilla, I am one of the people helping Firefox Preview’s development team with optimizing its startup time. Over several weeks, we have implemented many “easy” optimizations that significantly improved application startup time. Easy is in quotes to emphasize the irony — none of those startup performance optimizations were truly easy. The entire Firefox Preview team contributed to that effort and we were able to improve application startup by, primarily but not exclusively, deferring or parallelizing time-consuming startup tasks. More blog posts to come detailing the team’s earlier startup optimization achievements; the scope of this entry is more limited.

After shaking free all the low-hanging fruit from the startup speed tree, we wanted to step back and look at application startup time from a wider perspective. I began to wonder if there was a way that we could optimize the application’s on-disk representation to make it easier for the Android OS to load it into memory. I invited Jonathan Almeida, a talented, smart colleague into the investigation and the first thing we noticed was that on a Pixel 2 it took the OS more than three quarters of a second to load the application before even starting the Activity lifecycle.

Using the Android Profiler, we found that almost the entirety of application loading was being spent reading and parsing the application’s dex files. dex files contain the application’s byte code in Dalvik Executable format. Using Android Studio’s tools for analyzing an apk, the compressed file that contains all the application resources (including code, images, layouts, etc), Jonathan and I were able to inspect the size and composition of Firefox Preview’s dex files.

Figure 1
Figure 1

Figure 1 shows Studio’s Apk Analyzer at work. We saw that this version of Firefox Preview’s code requires three dex files, making it a so-called multi-dex application. Upon discovering that Firefox Preview’s code did not fit in a single dex file and because Jonathan knew that multi-dex files have implications for runtime performance, we hypothesized that they also negatively affect startup performance.

There are several reasons an application may end up with multiple dex files. Size is one of them. Firefox Preview has a robust set of dependencies and we weren’t sure how well we minimized and tracked the library dependencies listed in our Gradle build files. If there were dependencies listed that we did not use, removing those would reduce the size of Firefox Preview’s code and, Jonathan and I hoped, shrink the size of the bytecode to the point where it could fit in a single dex file.

Following best practices of Android development, we already used ProGuard within our build process to strip out unused methods and functions. The use of ProGuard means that a manual accounting of dependencies like this should have been unnecessary. Jonathan and I decided to do the manual inspection for excess code regardless, mostly out of curiosity. We found just a single dependent library that was listed in the Gradle build file but not actually used: Glide.

Glide is a library that ‘offers an easy to use API, a performant and extensible resource decoding pipeline and automatic resource pooling’ for ‘image loading … focused on smooth scrolling’.

The next step was to see if that code made it into Firefox Preview’s dex files. Again, because we were already using ProGuard, that code should not have been present. But, lo and behold, it was.

Figure 2

Figure 2 shows that the code for Glide did end up in Firefox Preview’s dex file. Relative to the overall size of Firefox Preview’s three dex files, removing these unused classes and methods was not going to have a major impact on their size.

Critically, though, it did indicate that something about our use of ProGuard was amiss.

There are a bewildering number of options available for configuring ProGuard. By complete accident, we noticed the -whyareyoukeeping option and configured ProGuard to tell us why it thought that it could not remove Glide from the output dex file despite the fact that Firefox Preview never instantiated any of its classes, used any of its static methods, etc.

Figure 3

Figure 3 shows ProGuard thought it needed to keep Glide’s code not because it was used but rather because it was obligated to follow a rule that told it to keep public and static methods from all classes. Those public methods, in turn, introduced a cascading set of dependencies on most of the library’s methods and classes.

The presence of such a catch-all rule was odd — it wasn’t anywhere in Firefox Preview’s codebase! Perhaps it was in the baseline rules included in the Android SDK? Nope, not there. We widened our search into the source code for the libraries that Firefox Preview uses. We didn’t get far before we found the rule in the ProGuard configuration for GeckoView (GV).

Neither Jonathan nor I is an expert at ProGuard, so it was a riddle how a ProGuard rule in a dependent library could “escape” and “infect” higher-level packages. We started to search through ProGuard information on the Internet and stumbled upon this “helpful” tidbit:

For library modules and for libraries distributed as AARs, the maintainer of the library can specify rules that will be supplied with the AAR and automatically exposed to the library consumer’s build system by adding this snippet to the module’s build.gradle file: … The rules that you put in the consumer-proguard.txt file will be appended to the main ProGuard configuration and used during the full application build.

Troubleshooting ProGuard issues on Android

Well, oops. GeckoView’s conservative ProGuard configuration seemed to be effectively negating Firefox Preview’s ProGuard code-shrinking optimizations. We were cautiously optimistic that fixing this issue would result in not only the removal of Glide’s methods and classes from Firefox Preview but an overall smaller amount of byte code and, perhaps, an application whose code fit entirely within a single dex file.

It was time to do some work. Jonathan and I removed that directive from GeckoView’s ProGuard configuration, rebuilt Firefox Preview and went back to our friend, the Android Studio Apk Analyzer.

Figure 4

Wow. From three dex files totaling almost 5.5MB to a single dex file totaling 3.4MB all with a single change. Major progress! But, there are still two outstanding questions.

First, why was Glide still referenced in the new, smaller dex file? Although there are fewer methods kept, it was puzzling to see it there at all. Remember, Firefox Preview never uses it! To answer this question, we went back to the trusty -whyareyoukeeping option in ProGuard. Again, we found that ProGuard kept those methods because of a configuration rule.

Figure 5

Funny, though, this rule existed nowhere in any of Mozilla’s code. greping through the Firefox Preview source code and the codebases of the various Mozilla components used by Firefox Preview confirmed this. The only remaining option was that a dependency used by Firefox Preview was doing the same thing that GeckoView was doing: leaking a static ProGuard configuration to its users that was overriding ProGuard’s willingness to strip Glide entirely from the output dex file. We guessed that it was more than likely in Glide itself.

Jonathan and I found that very ProGuard configuration directive in the Glide source code. The sad implication of this finding is that we could not rely on ProGuard to automatically remove Glide entirely and we would have to manually update Firefox Preview’s dependency configuration. Following the appropriate two-line change to Firefox Preview’s build.gradle file, we rebuilt Firefox Preview’s apk to test our work.

Figure 6

Whew. Glide is finally exorcised from the dex file!

Second, and more importantly: Did going from a multi-dex to a single-dex application have an impact on startup performance? The Android profiler could help Jonathan and I definitively answer this question. Figures 7 and 8 show the relevant portion of the profile of Firefox Preview startup before and after the ProGuard configuration change:

Figure 7
Figure 8

Profiling proved our initial hypothesis: multi-dex Android applications negatively effect startup performance and runtime performance.

The path Jonathan and I followed to achieve a decrease in the time that the Android OS took to load Firefox Preview led us on an Alice-In-Wonderland-like trip where we found ourselves profiling application runtime, analyzing dex files and apks and learning more about ProGuard than we ever wanted to know. The savings we gained was far more than we could have imagined when we started and more than justified the time we spent traveling this long and winding road.

Update: This post has been edited to remove references to Mozilla-internal names and to explicitly reference my previously unnamed, mystery collaborator.

Update 2: Removed the first paragraph because it contained an implicit promise that I did not intend.

Range Semantics in Go

Introduction

A few months ago, famous Gopher Dave Cheney posted the following question on Twitter:

The answer, for the impatient, is that it does terminate. He and I went back and forth about why and it got me thinking about the semantics of the range statement. I promised Dave that I would write up the results of my experiments and I am finally doing so in this blog post.

The range clause controls a for loop and is used when the programmer wants to loop through all the values of an array, slice, map or channel. In this post I will deal only with the range clause over an array or slice.

Implicit in Dave’s question is how the range clause applied to an array behaves when the loop body modifies that array. In pseudo code, that looks something like this:

for index, value of all indexes and values of A {
    if there is an element of A at (index + 1)
        A[index + 1] = value + 1;
    print value
}

That may seem contrived, but its a relatively straightforward way to illustrate the semantics of the range clause.

For the purposes of this post, there will be two arrays:

a = {1,2,3}
b = {4,5,6}

The first thing to remember is that the “[t]he range expression is evaluated once before beginning the loop”. Yes, there is an exception to that rule. No, I am not going to focus on that now. The second important thing to remember is that assignments in go are done by copy. That means that functions receive copies of the parameters placed in the caller’s invocation. That means that a caller gets a copy of the return values from a function it invokes. And, with respect to the range clause, that means that the range clause creates a copy of the input variable to use during iteration.

By Value Range Clause

The first example explores the semantics of, what I will call, the “by value range clause”.

package main

import (
    "fmt"
)

func array_by_value() {
    var a, b [3]int

    a = [3]int{1, 2, 3}
    b = [3]int{4, 5, 6}
    
    fmt.Printf("Ref: 1 2 3\n")
    fmt.Printf("Act: ")
    for i, v := range a {
        fmt.Printf("%v ", v)
        if i < len(a)-1 {
            a[i+1] = b[i+1]
        }
    }
    fmt.Printf("\n")
}

func main() {
    array_by_value()
}

To run this example, you can go to the playground https://play.golang.org/p/fzKuCGhC4n

The output from this example is

Ref: 1 2 3
Act: 1 2 3

So, what happened to our assignments? Why aren’t the updated values printed at each iteration?

When the range statement is evaluated (again, remember that the range clause is evaluated exactly once at the start of the loop), the range clause copies a (again, remember that everything in Go is a copy). Because a is the array, its contents are copied into a temporary. That temporary value is indexed at each iteration of the loop to determine the value v.

See Figure 1 for a visual representation of the contents of memory at the start of the 0th iteration and the start of the 1st iteration. The array values highlighted in red are those indexed at each iteration to produce the value v.

Range by Value

What is important to notice about the above is how Go is making a copy of the contents of the array at the beginning of the loop. This will not be a problem when the array contains a few elements. There are, however, very interesting implications for this when the array has many elements. I will discuss the implications of range clause semantics for application performance below.

By Reference Range Clause

The second example explores the semantics of, what I will call, the “by reference range clause”. I call it “by reference” because passing an address or a pointer is how programmers achieve “by reference” semantics in “by value” languages like C[2] (and Go).

package main

import (
    "fmt"
)

func array_by_ref() {
    var a, b [3]int

    a = [3]int{1, 2, 3}
    b = [3]int{4, 5, 6}
    
    fmt.Printf("Ref: 1 5 6\n")
    fmt.Printf("Act: ")
    for i, v := range &a {
        fmt.Printf("%v ", v)
        if i < len(a)-1 {
            a[i+1] = b[i+1]
        }
    }
    fmt.Printf("\n")
}

func main() {
    array_by_ref()
}

To run this example, you can go to the playground
https://play.golang.org/p/TFYKoq0csJ

The output from this example is

Ref: 1 5 6
Act: 1 5 6

And, wow. In this version we do get the updated values! So, what’s going on?

When the range statement is evaluated (again, remember that the range clause is evaluated exactly once at the start of the loop), the range clause copies &a (again, remember that everything in Go is a copy). Because &a is just the address of the array, the address is copied into a temporary. That temporary value is indexed at each iteration of the loop to determine the value v.

See the figure below for a visual representation of what is happening in this “by reference” range clause example at the start of the 0th iteration and the start of the 1st iteration. The array values highlighted in red are those indexed at each iteration to produce the value v.

Range by Reference

To reiterate, Go is copying the reference and not the array to the temporary at the start of the loop. I will discuss the performance implications of this later.

By Slice Value Range Clause

The third example explores the semantics of, what I will call, the “by slice value range clause”. I think the name will be self explanatory when you see the code. The name reflects the fact that the range clause copies a  slice by value.

package main

import (
    "fmt"
)

func slice_by_value() {
    var a, b [3]int

    a = [3]int{1, 2, 3}
    b = [3]int{4, 5, 6}
    c := a[:]

    fmt.Printf("Ref: 1 5 6\n")
    fmt.Printf("Act: ")
    for i, v := range c {
        fmt.Printf("%v ", v)
        if i < len(a)-1 {
            a[i+1] = b[i+1]
        }
    }
    fmt.Printf("\n")
}

func main() {
    slice_by_value()
}

To run this example, you can go to the playground
https://play.golang.org/p/1OYBLTd30P

The output from this example is

Ref: 1 5 6
Act: 1 5 6 

And, wow again! We are seeing updated values despite the “by value” copy of the slice. What is going on here?

The slice is very much like a reference. Above we saw that a reference is a simple pointer to an underlying object. The slice is a pointer and metadata. The pointer is to the underlying array that supplies the slice’s values and the metadata is the length and capacity of the slice. For more information on the internals of slices in Go, see [3].

In the above example, the slice is copied by value. In other words, the pointer and the metadata are copied and not the array itself. For a visual representation, see the figure below. Again, the array values highlighted in red are those indexed at the start of iteration 0 and iteration 1 to produce the value v.

Range by Slice

To reiterate, Go is copying the slice and not the array to the temporary at the start of the loop. I will discuss the performance implications of the three different range semantics next.

Performance Implications

As mentioned earlier, each of the different semantics of the range clause have implications on performance. To understand these implications, let’s consider a scenario where the program is going to repeatedly execute a loop over an array that contains a large number of elements.

For our example, we will have an array called base that we are going to loop over 50000 times. base will contain 50000 entries. In this example the inner loop will be relatively contrived.[4]

Here is the Go version of the code that we will test:

package main

import (
    "fmt"
    "time"
)

func speed_of_value_range() {
    const size int = 50000
    const iterations int = 50000
    var base [size]int
    var start_time time.Time
    var k int

    for i := 0; i < size; i++ {
        base[i] = i
    }

    start_time = time.Now()
    for i := 0; i < iterations; i++ {
        for _, j := range base {
            j++
            k = j
        }
    }
    fmt.Printf("%v Speed: %v\n", k, time.Now().Sub(start_time))
}

func main() {
    speed_of_value_range()
}

On my machine, this code takes 2.153868183s to execute.

Let’s compare that with the runtime for two almost identical functions. The only difference will be the way that we use the range clause.

In the next example, we will execute the range clause by reference.

package main

import (
    "fmt"
    "time"
)

func speed_of_ref_range() {
    const size int = 50000
    const iterations int = 50000
    var base [size]int
    var start_time time.Time
    var k int

    for i := 0; i < size; i++ {
        base[i] = i
    }

    start_time = time.Now()
    for i := 0; i < iterations; i++ {
        for _, j := range &base {
            j++
            k = j
        }
    }
    fmt.Printf("%v Speed: %v\n", k, time.Now().Sub(start_time))
}

func main() {
    speed_of_ref_range()
}

On my machine, this code takes 1.323862716s to execute. Wow! That’s half the time!

Let’s try a third and final example where we will execute the range clause by slice value.

package main

import (
    "fmt"
    "time"
)

func speed_of_slice_range() {
    const size int = 50000
    const iterations int = 50000
    var base [size]int
    var start_time time.Time
    var k int

    c := base[:]

    for i := 0; i < size; i++ {
        base[i] = i
    }

    start_time = time.Now()
    for i := 0; i < iterations; i++ {
        for _, j := range c {
            j++
            k = j
        }
    }
    fmt.Printf("%v Speed: %v\n", k, time.Now().Sub(start_time))
}

func main() {
    speed_of_slice_range()
}

On my machine, this takes 1.279790841s to execute. Wow, even faster still (albeit by a smaller amount).

It seems obvious that we should execute range clauses by reference or by slice value whenever possible in order to gain as much performance as possible. Why is this?

If we look at the object code generated by the Go compiler for each of these loops, the answer is clear!

Here is the relevant section of object code for the first example, speed_of_value_range():

  486296: 31 db                 xor    ebx,ebx
  486298: 31 f6                 xor    esi,esi
  48629a: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  4862a1: 7d 59                 jge    4862fc 
  4862a3: 48 8d bc 24 e0 1a 06  lea    rdi,[rsp+0x61ae0]
  4862aa: 00
  4862ab: 49 89 f0              mov    r8,rsi
  4862ae: 48 8d 74 24 60        lea    rsi,[rsp+0x60]
  4862b3: 41 89 c9              mov    r9d,ecx
  4862b6: 48 c7 c1 50 c3 00 00  mov    rcx,0xc350
  4862bd: f3 48 a5              rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
  4862c0: 31 c9                 xor    ecx,ecx
  4862c2: 48 8d b4 24 e0 1a 06  lea    rsi,[rsp+0x61ae0]
  4862c9: 00
  4862ca: 48 81 f9 50 c3 00 00  cmp    rcx,0xc350
  4862d1: 7d 17                 jge    4862ea 
  4862d3: 4c 8b 16              mov    r10,QWORD PTR [rsi]
  4862d6: 48 83 c6 08           add    rsi,0x8
  4862da: 48 ff c1              inc    rcx
  4862dd: 4d 8d 42 01           lea    r8,[r10+0x1]
  4862e1: 48 81 f9 50 c3 00 00  cmp    rcx,0xc350
  4862e8: 7c e9                 jl     4862d3 
  4862ea: 48 ff c3              inc    rbx
  4862ed: 44 89 c9              mov    ecx,r9d
  4862f0: 4c 89 c6              mov    rsi,r8
  4862f3: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  4862fa: 7c a7                 jl     4862a3

And here is the relevant section of object code for the second example, speed_of_ref_range():

  4864d6: 31 db                 xor    ebx,ebx
  4864d8: 31 f6                 xor    esi,esi
  4864da: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  4864e1: 7d 33                 jge    486516
  4864e3: 31 ff                 xor    edi,edi
  4864e5: 4c 8d 44 24 60        lea    r8,[rsp+0x60]
  4864ea: 48 81 ff 50 c3 00 00  cmp    rdi,0xc350
  4864f1: 7d 17                 jge    48650a
  4864f3: 4d 8b 08              mov    r9,QWORD PTR [r8]
  4864f6: 49 83 c0 08           add    r8,0x8
  4864fa: 48 ff c7              inc    rdi
  4864fd: 49 8d 71 01           lea    rsi,[r9+0x1]
  486501: 48 81 ff 50 c3 00 00  cmp    rdi,0xc350
  486508: 7c e9                 jl     4864f3
  48650a: 48 ff c3              inc    rbx
  48650d: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  486514: 7c cd                 jl     4864e3

Although there are a number of differences between the two snippets of object code, there seems to be one major difference. The speed_of_value_range() contains this instruction:

rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]

How is that instruction causing us such a problem? It’s easy to see once we think about the semantics of that instruction. The rep modifier to the instruction means that we will repeat the movs operation the number of times contained in the count register (e/r)cx. The instruction before the rep movs is

mov    rcx,0xc350

In other words, we are moving 0xc350 quad words (four bytes) every time we execute the inner loop. That means throughout the life of the speed_of_value_range() function the CPU will perform 0xc350 * 0xc350 moves that it will not do in the speed_of_ref_range() function. [5]

That’s where we get the speedup!

Conclusion

I hope that this post helped you understand the different semantics of the range clause. As a programmer when you update the values used in a range clause during the body of the loop, the type of range clause will affect what values you see in future iterations.

There are also performance implications. The simple advice for the best performing code is to use by value range semantics sparingly!

I’d love to hear what you think about these results and whether you agree or disagree. Please let me know in the comments below!

 

References

[1] This comes from the Go Programming Language Specification at https://golang.org/ref/spec.

[2] For an explanation of this, see http://denniskubes.com/2012/08/20/is-c-pass-by-value-or-reference/

[3] Go slices: Usage and internals https://blog.golang.org/go-slices-usage-and-internals

[4] The inner loop body is contrived just so that we can trick the compiler. The compiler aggressively optimizes loops when the body does not use the range value and index variables. This will affect our timing results.

[5] It will not do those moves in the speed_by_slice_range() either. Proving this is left as an exercise for the reader! I love when people say that!