Importance of the Order of Calls to postMessage and addEventListener

Things are strange in the world of JavaScript. I wanted to take a few minutes to document something that is, well, interesting.

In the living version of the specification of HTML from the WHATWG, there is an API known as Cross-document messaging. This allows two documents, from the same or different domains, to send messages to one another. The API allows, for example, two documents embedded alongside one another in iframes to exchange messages to coordinate behavior.

Although this is not how it is defined in the specification, consider that a message logically has a sender and zero or more receivers. In the design of a message-passing framework, there are (at least) two different ways to deliver messages:

  1. deliver the message to existing registered receivers and delete the message or
  2. deliver the message to all registered receivers and store that message for delivery to receivers that register in the future.

Obviously, 1 makes the most sense. To implement delivery method 2 would mean that a single sender transmitting a high rate of messages could cripple the delivery system by forcing it to store every sent message for potential future delivery.

The Cross-document message API implements 1. Therefore, it would make sense that if a receiver r wanted to hear a message sent by sender s, r must be registered before s sends a message. The HTML spec says that s calls postMessage() to send a message and r calls addEventHandler() to become a receiver.

However, that logical ordering is not always required to use the Cross-document messaging API! Don’t believe me? Go to the executable version of the code below and click the sendMessageAndSetHandler button! You’ll find that, even though the listener registers after the message is sent, it still receives the sender’s transmission!

<html>
<head>
<script>
function sendMessage() {
  window.postMessage("I can hear you!");
}
function setMessageHandler() {
  function messageHandler(event) {
    alert("message: " + event.data);
  }
  window.addEventListener("message", messageHandler);
}
function sendMessageAndSetHandler() {
  sendMessage();
  setMessageHandler();
}
</script>
</head>
<body>
<button onclick="sendMessageAndSetHandler();">sendMessageAndSetHandler</button><br>
</body>
</html>

How is this possible?

The listener registration takes place immediately but, as the specification says, postMessage() only queues a task to send the message and does not actually send it. The task added to the queue by the sender to deliver the message is only actually executed when it is at the top of the so-called posted message task source queue and that queue is serviced by the browser. The posted message task source queue is serviced during the execution of the event loop and (modulo my understanding of very intricate JS behavior) the execution of a JavaScript function will never be interrupted to service the event loop.

In other words, between the call to sendMessage() and setMessageHandler() in the example above, the browser does not actually transmit the message. Therefore, by the time the message is actually sent, the receiver will be listening!

Makes perfect sense, right?

The Impact of android:debuggable on Android Application Startup Time

For several weeks members of my team and I have been trying to reconcile discrepancies in performance data for our application’s startup when measured locally and when measured on our CI platform. We weren’t really sure what was causing the discrepancies and resorted to brainstorming about the possible causes.

One of our most promising hypothesis was a theory that there was more of a difference between measurement techniques than we imagined. While we weren’t sure how the application’s startup time was being measured on the CI platform, we knew how it was being measured locally — using the Android Profiler.

We were testing the same build variant (a “release” build) locally and on the CI platform which meant that we were executing the same byte code on both. What other variables could account for the difference in measurements? It occurred to us that the CI platform might be taking its measurements with some tool other than the Android Profiler. One of the prerequisites of the Android Profiler is that the application under profile be built with the android:debuggable flag set to true.

Our team operated under the assumption that the value of this flag simply controlled an Android OS permission that dictated whether a debugger/profiler could attach to the process at runtime. In other words, we believed that the value of this flag should not change the way the Android OS’s runtime treated the application. However, this was the only potential difference we could imagine between the local and CI platform test environment, so we decided to investigate.

Turns out this was a good idea.

To properly isolate whether the debuggable flag is the cause of the change in startup performance, we had to figure out how to make two different versions of the application — both with the same byte code but one debuggable and the other not. In other words, we needed to generate two applications with the same bytecode but different values of the debuggable flag so that we could definitively attribute any differences in application start up time to the runtime. Ideally, then, we would toggle the flag on two copies of the same compiled version of the application (an apk).

The android:debuggable flag exists in an apk’s AndroidManifest.xml file. In older versions of Android, this file was, as the extension suggests, plain text xml. However, in modern versions of Android, the AndroidManifest.xml file is stored in a binary format in the apk.

There is not much publicly available documentation (that we could find quickly) about this file format. Several tools are available for inspecting its contents (e.g., aapt from the Android SDK) and we found some tools that promised to help modify the contents of an existing apk (e.g., apktool) but couldn’t get them to work properly.

That meant we were going to have to modify the binary-formatted xml file by hand! Fortunately, there is a good, open-source, standalone tool, axmldec, whose source code implicitly contains enough information about the file format to give us an idea of the binary format’s structure.

The android:debuggable flag is an attribute for an element. In the case of the android:debuggable flag, that element is the application itself. According to the source code, each element attribute is stored on disk in 20-byte structure:

ns
name
raw_value
value

4 bytes
4 bytes
4 bytes
8 bytes

Taking a cue from Java class file encoding techniques, this binary format stores every string in a table and the use of a string is stored simply as a pointer to an entry in that table. Because the namespace (ns) and name (name) of the attribute are strings, their values in this structure are 4-byte pointers into the string table. In this case, those strings are “android” and “debuggable”. The raw_value and value fields are more interesting. If the raw_value field is not 0xffffffff, then it represents a pointer to a string in the string table. In other words, the value of that attribute is some string and the value is meaningless. On the other hand, if the raw_value field is 0xffffffff, then the type of the element is some primitive whose value can be represented in 8 bytes.

A separate format defines the structure of the data in the raw_value field:

size
res0
data_type
data

2 bytes
1 byte
1 byte
4 bytes

The data and data_type type fields are the most useful. Again, according to the source code, a boolean (the type which we suppose the android:debuggable attribute to be) has a value of 0x12. For a boolean, “false” is 0x0000 in the data field and “true” is anything else.

With that understanding, we compiled a version of the application with the debuggable flag set to true and unzipped the resulting apk to get access to AndroidManifest.xml. Then, we used a modified version of axmldec to find the offset of the android:debuggable attribute in the application element. Using our newfound knowledge of the format of the elements, we used vi and xxd to manually change the attribute’s value from “true” to “false”. We used that modified AndroidManifest.xml file to create a non-debuggable version of the apk by zipping the contents of the directory generated when we unzipped the original apk. Finally, we signed the apk with our key and verified that our change to the debuggable flag worked ($ aapt d badging <path to apk> | grep -i debug).

Breath.

After all this work we learned …? Nothing. Yet.

We did, however, have a basis for a/b testing: Two applications, identical except for their debuggable flags.

We had to answer one final question before starting to test: How to fairly measure the application startup time? Remember, because only one of the two variants is marked as debuggable, we cannot use the Android Profiler. We found our answer in the am_ flags.

The Android Activity Monitor emits am_* flags when certain meaningful events occur. You can see a list of all the am_* flags here. We decided to rely on the am_proc_start flag which contains information about the time an application takes to start.

We didn’t need to build a raw data set where n was, say, 1000. We just needed something “rough and ready”. Our test tool ran each version of the application 5 times and recorded the am_proc_start times for each run. We recorded and analyzed the values.

You can see the raw data here.

Well, wow! The results indicate that our assumption was almost entirely wrong! Simply changing the value of the android:debuggable flag has a significant impact on the time it takes an application to start.

For us, the implications of this discovery are enormous.

  1. The measurements that we are taking of the time it takes to start our application on the CI platform and in the profiler cannot be compared directly with the reported startup times of comparable applications (unless, of course, they are also measuring versions of their application with the android:debuggable flag set).
  2. A corollary to (1) is that we can use timing results from Android Profiler and the CI platform to monitor performance changes over time. The caveat is that we have to be vigilant against the possibility that our optimizations are only improving the performance of startup components that the runtime executes when the debuggable flag is set. While these optimizations will not negatively affect startup performance under routine conditions, spending time on them would be a waste of engineering resources.
  3. We must, now, consider whether the value of the android:debuggable attribute changes other aspects of runtime performance. Although we do not believe that it does, these results show that we can no longer be certain of that.

As a result of this investigation, we are going to spend time researching where the runtime checks the debuggable flag and how it changes its behavior based on that value. We have several theories.

If you have any information leading to the arrest of the party responsible for the lost performance, please contact us in the comments!

Thank you to my fantastic teammate Anny who provided valuable feedback on a draft of this post. If you think that the post is well written and informative, it’s because of her. If you think that the post is confusing and boring, I take full responsibility.

Correcting ProGuard Configuration To Improve the Startup Time of Firefox’s New Mobile Browser

As a member of the performance engineering group at Mozilla, I am one of the people helping Firefox Preview’s development team with optimizing its startup time. Over several weeks, we have implemented many “easy” optimizations that significantly improved application startup time. Easy is in quotes to emphasize the irony — none of those startup performance optimizations were truly easy. The entire Firefox Preview team contributed to that effort and we were able to improve application startup by, primarily but not exclusively, deferring or parallelizing time-consuming startup tasks. More blog posts to come detailing the team’s earlier startup optimization achievements; the scope of this entry is more limited.

After shaking free all the low-hanging fruit from the startup speed tree, we wanted to step back and look at application startup time from a wider perspective. I began to wonder if there was a way that we could optimize the application’s on-disk representation to make it easier for the Android OS to load it into memory. I invited Jonathan Almeida, a talented, smart colleague into the investigation and the first thing we noticed was that on a Pixel 2 it took the OS more than three quarters of a second to load the application before even starting the Activity lifecycle.

Using the Android Profiler, we found that almost the entirety of application loading was being spent reading and parsing the application’s dex files. dex files contain the application’s byte code in Dalvik Executable format. Using Android Studio’s tools for analyzing an apk, the compressed file that contains all the application resources (including code, images, layouts, etc), Jonathan and I were able to inspect the size and composition of Firefox Preview’s dex files.

Figure 1
Figure 1

Figure 1 shows Studio’s Apk Analyzer at work. We saw that this version of Firefox Preview’s code requires three dex files, making it a so-called multi-dex application. Upon discovering that Firefox Preview’s code did not fit in a single dex file and because Jonathan knew that multi-dex files have implications for runtime performance, we hypothesized that they also negatively affect startup performance.

There are several reasons an application may end up with multiple dex files. Size is one of them. Firefox Preview has a robust set of dependencies and we weren’t sure how well we minimized and tracked the library dependencies listed in our Gradle build files. If there were dependencies listed that we did not use, removing those would reduce the size of Firefox Preview’s code and, Jonathan and I hoped, shrink the size of the bytecode to the point where it could fit in a single dex file.

Following best practices of Android development, we already used ProGuard within our build process to strip out unused methods and functions. The use of ProGuard means that a manual accounting of dependencies like this should have been unnecessary. Jonathan and I decided to do the manual inspection for excess code regardless, mostly out of curiosity. We found just a single dependent library that was listed in the Gradle build file but not actually used: Glide.

Glide is a library that ‘offers an easy to use API, a performant and extensible resource decoding pipeline and automatic resource pooling’ for ‘image loading … focused on smooth scrolling’.

The next step was to see if that code made it into Firefox Preview’s dex files. Again, because we were already using ProGuard, that code should not have been present. But, lo and behold, it was.

Figure 2

Figure 2 shows that the code for Glide did end up in Firefox Preview’s dex file. Relative to the overall size of Firefox Preview’s three dex files, removing these unused classes and methods was not going to have a major impact on their size.

Critically, though, it did indicate that something about our use of ProGuard was amiss.

There are a bewildering number of options available for configuring ProGuard. By complete accident, we noticed the -whyareyoukeeping option and configured ProGuard to tell us why it thought that it could not remove Glide from the output dex file despite the fact that Firefox Preview never instantiated any of its classes, used any of its static methods, etc.

Figure 3

Figure 3 shows ProGuard thought it needed to keep Glide’s code not because it was used but rather because it was obligated to follow a rule that told it to keep public and static methods from all classes. Those public methods, in turn, introduced a cascading set of dependencies on most of the library’s methods and classes.

The presence of such a catch-all rule was odd — it wasn’t anywhere in Firefox Preview’s codebase! Perhaps it was in the baseline rules included in the Android SDK? Nope, not there. We widened our search into the source code for the libraries that Firefox Preview uses. We didn’t get far before we found the rule in the ProGuard configuration for GeckoView (GV).

Neither Jonathan nor I is an expert at ProGuard, so it was a riddle how a ProGuard rule in a dependent library could “escape” and “infect” higher-level packages. We started to search through ProGuard information on the Internet and stumbled upon this “helpful” tidbit:

For library modules and for libraries distributed as AARs, the maintainer of the library can specify rules that will be supplied with the AAR and automatically exposed to the library consumer’s build system by adding this snippet to the module’s build.gradle file: … The rules that you put in the consumer-proguard.txt file will be appended to the main ProGuard configuration and used during the full application build.

Troubleshooting ProGuard issues on Android

Well, oops. GeckoView’s conservative ProGuard configuration seemed to be effectively negating Firefox Preview’s ProGuard code-shrinking optimizations. We were cautiously optimistic that fixing this issue would result in not only the removal of Glide’s methods and classes from Firefox Preview but an overall smaller amount of byte code and, perhaps, an application whose code fit entirely within a single dex file.

It was time to do some work. Jonathan and I removed that directive from GeckoView’s ProGuard configuration, rebuilt Firefox Preview and went back to our friend, the Android Studio Apk Analyzer.

Figure 4

Wow. From three dex files totaling almost 5.5MB to a single dex file totaling 3.4MB all with a single change. Major progress! But, there are still two outstanding questions.

First, why was Glide still referenced in the new, smaller dex file? Although there are fewer methods kept, it was puzzling to see it there at all. Remember, Firefox Preview never uses it! To answer this question, we went back to the trusty -whyareyoukeeping option in ProGuard. Again, we found that ProGuard kept those methods because of a configuration rule.

Figure 5

Funny, though, this rule existed nowhere in any of Mozilla’s code. greping through the Firefox Preview source code and the codebases of the various Mozilla components used by Firefox Preview confirmed this. The only remaining option was that a dependency used by Firefox Preview was doing the same thing that GeckoView was doing: leaking a static ProGuard configuration to its users that was overriding ProGuard’s willingness to strip Glide entirely from the output dex file. We guessed that it was more than likely in Glide itself.

Jonathan and I found that very ProGuard configuration directive in the Glide source code. The sad implication of this finding is that we could not rely on ProGuard to automatically remove Glide entirely and we would have to manually update Firefox Preview’s dependency configuration. Following the appropriate two-line change to Firefox Preview’s build.gradle file, we rebuilt Firefox Preview’s apk to test our work.

Figure 6

Whew. Glide is finally exorcised from the dex file!

Second, and more importantly: Did going from a multi-dex to a single-dex application have an impact on startup performance? The Android profiler could help Jonathan and I definitively answer this question. Figures 7 and 8 show the relevant portion of the profile of Firefox Preview startup before and after the ProGuard configuration change:

Figure 7
Figure 8

Profiling proved our initial hypothesis: multi-dex Android applications negatively effect startup performance and runtime performance.

The path Jonathan and I followed to achieve a decrease in the time that the Android OS took to load Firefox Preview led us on an Alice-In-Wonderland-like trip where we found ourselves profiling application runtime, analyzing dex files and apks and learning more about ProGuard than we ever wanted to know. The savings we gained was far more than we could have imagined when we started and more than justified the time we spent traveling this long and winding road.

Update: This post has been edited to remove references to Mozilla-internal names and to explicitly reference my previously unnamed, mystery collaborator.

Update 2: Removed the first paragraph because it contained an implicit promise that I did not intend.

Range Semantics in Go

Introduction

A few months ago, famous Gopher Dave Cheney posted the following question on Twitter:

The answer, for the impatient, is that it does terminate. He and I went back and forth about why and it got me thinking about the semantics of the range statement. I promised Dave that I would write up the results of my experiments and I am finally doing so in this blog post.

The range clause controls a for loop and is used when the programmer wants to loop through all the values of an array, slice, map or channel. In this post I will deal only with the range clause over an array or slice.

Implicit in Dave’s question is how the range clause applied to an array behaves when the loop body modifies that array. In pseudo code, that looks something like this:

for index, value of all indexes and values of A {
    if there is an element of A at (index + 1)
        A[index + 1] = value + 1;
    print value
}

That may seem contrived, but its a relatively straightforward way to illustrate the semantics of the range clause.

For the purposes of this post, there will be two arrays:

a = {1,2,3}
b = {4,5,6}

The first thing to remember is that the “[t]he range expression is evaluated once before beginning the loop”. Yes, there is an exception to that rule. No, I am not going to focus on that now. The second important thing to remember is that assignments in go are done by copy. That means that functions receive copies of the parameters placed in the caller’s invocation. That means that a caller gets a copy of the return values from a function it invokes. And, with respect to the range clause, that means that the range clause creates a copy of the input variable to use during iteration.

By Value Range Clause

The first example explores the semantics of, what I will call, the “by value range clause”.

package main

import (
    "fmt"
)

func array_by_value() {
    var a, b [3]int

    a = [3]int{1, 2, 3}
    b = [3]int{4, 5, 6}
    
    fmt.Printf("Ref: 1 2 3\n")
    fmt.Printf("Act: ")
    for i, v := range a {
        fmt.Printf("%v ", v)
        if i < len(a)-1 {
            a[i+1] = b[i+1]
        }
    }
    fmt.Printf("\n")
}

func main() {
    array_by_value()
}

To run this example, you can go to the playground https://play.golang.org/p/fzKuCGhC4n

The output from this example is

Ref: 1 2 3
Act: 1 2 3

So, what happened to our assignments? Why aren’t the updated values printed at each iteration?

When the range statement is evaluated (again, remember that the range clause is evaluated exactly once at the start of the loop), the range clause copies a (again, remember that everything in Go is a copy). Because a is the array, its contents are copied into a temporary. That temporary value is indexed at each iteration of the loop to determine the value v.

See Figure 1 for a visual representation of the contents of memory at the start of the 0th iteration and the start of the 1st iteration. The array values highlighted in red are those indexed at each iteration to produce the value v.

Range by Value

What is important to notice about the above is how Go is making a copy of the contents of the array at the beginning of the loop. This will not be a problem when the array contains a few elements. There are, however, very interesting implications for this when the array has many elements. I will discuss the implications of range clause semantics for application performance below.

By Reference Range Clause

The second example explores the semantics of, what I will call, the “by reference range clause”. I call it “by reference” because passing an address or a pointer is how programmers achieve “by reference” semantics in “by value” languages like C[2] (and Go).

package main

import (
    "fmt"
)

func array_by_ref() {
    var a, b [3]int

    a = [3]int{1, 2, 3}
    b = [3]int{4, 5, 6}
    
    fmt.Printf("Ref: 1 5 6\n")
    fmt.Printf("Act: ")
    for i, v := range &a {
        fmt.Printf("%v ", v)
        if i < len(a)-1 {
            a[i+1] = b[i+1]
        }
    }
    fmt.Printf("\n")
}

func main() {
    array_by_ref()
}

To run this example, you can go to the playground
https://play.golang.org/p/TFYKoq0csJ

The output from this example is

Ref: 1 5 6
Act: 1 5 6

And, wow. In this version we do get the updated values! So, what’s going on?

When the range statement is evaluated (again, remember that the range clause is evaluated exactly once at the start of the loop), the range clause copies &a (again, remember that everything in Go is a copy). Because &a is just the address of the array, the address is copied into a temporary. That temporary value is indexed at each iteration of the loop to determine the value v.

See the figure below for a visual representation of what is happening in this “by reference” range clause example at the start of the 0th iteration and the start of the 1st iteration. The array values highlighted in red are those indexed at each iteration to produce the value v.

Range by Reference

To reiterate, Go is copying the reference and not the array to the temporary at the start of the loop. I will discuss the performance implications of this later.

By Slice Value Range Clause

The third example explores the semantics of, what I will call, the “by slice value range clause”. I think the name will be self explanatory when you see the code. The name reflects the fact that the range clause copies a  slice by value.

package main

import (
    "fmt"
)

func slice_by_value() {
    var a, b [3]int

    a = [3]int{1, 2, 3}
    b = [3]int{4, 5, 6}
    c := a[:]

    fmt.Printf("Ref: 1 5 6\n")
    fmt.Printf("Act: ")
    for i, v := range c {
        fmt.Printf("%v ", v)
        if i < len(a)-1 {
            a[i+1] = b[i+1]
        }
    }
    fmt.Printf("\n")
}

func main() {
    slice_by_value()
}

To run this example, you can go to the playground
https://play.golang.org/p/1OYBLTd30P

The output from this example is

Ref: 1 5 6
Act: 1 5 6 

And, wow again! We are seeing updated values despite the “by value” copy of the slice. What is going on here?

The slice is very much like a reference. Above we saw that a reference is a simple pointer to an underlying object. The slice is a pointer and metadata. The pointer is to the underlying array that supplies the slice’s values and the metadata is the length and capacity of the slice. For more information on the internals of slices in Go, see [3].

In the above example, the slice is copied by value. In other words, the pointer and the metadata are copied and not the array itself. For a visual representation, see the figure below. Again, the array values highlighted in red are those indexed at the start of iteration 0 and iteration 1 to produce the value v.

Range by Slice

To reiterate, Go is copying the slice and not the array to the temporary at the start of the loop. I will discuss the performance implications of the three different range semantics next.

Performance Implications

As mentioned earlier, each of the different semantics of the range clause have implications on performance. To understand these implications, let’s consider a scenario where the program is going to repeatedly execute a loop over an array that contains a large number of elements.

For our example, we will have an array called base that we are going to loop over 50000 times. base will contain 50000 entries. In this example the inner loop will be relatively contrived.[4]

Here is the Go version of the code that we will test:

package main

import (
    "fmt"
    "time"
)

func speed_of_value_range() {
    const size int = 50000
    const iterations int = 50000
    var base [size]int
    var start_time time.Time
    var k int

    for i := 0; i < size; i++ {
        base[i] = i
    }

    start_time = time.Now()
    for i := 0; i < iterations; i++ {
        for _, j := range base {
            j++
            k = j
        }
    }
    fmt.Printf("%v Speed: %v\n", k, time.Now().Sub(start_time))
}

func main() {
    speed_of_value_range()
}

On my machine, this code takes 2.153868183s to execute.

Let’s compare that with the runtime for two almost identical functions. The only difference will be the way that we use the range clause.

In the next example, we will execute the range clause by reference.

package main

import (
    "fmt"
    "time"
)

func speed_of_ref_range() {
    const size int = 50000
    const iterations int = 50000
    var base [size]int
    var start_time time.Time
    var k int

    for i := 0; i < size; i++ {
        base[i] = i
    }

    start_time = time.Now()
    for i := 0; i < iterations; i++ {
        for _, j := range &base {
            j++
            k = j
        }
    }
    fmt.Printf("%v Speed: %v\n", k, time.Now().Sub(start_time))
}

func main() {
    speed_of_ref_range()
}

On my machine, this code takes 1.323862716s to execute. Wow! That’s half the time!

Let’s try a third and final example where we will execute the range clause by slice value.

package main

import (
    "fmt"
    "time"
)

func speed_of_slice_range() {
    const size int = 50000
    const iterations int = 50000
    var base [size]int
    var start_time time.Time
    var k int

    c := base[:]

    for i := 0; i < size; i++ {
        base[i] = i
    }

    start_time = time.Now()
    for i := 0; i < iterations; i++ {
        for _, j := range c {
            j++
            k = j
        }
    }
    fmt.Printf("%v Speed: %v\n", k, time.Now().Sub(start_time))
}

func main() {
    speed_of_slice_range()
}

On my machine, this takes 1.279790841s to execute. Wow, even faster still (albeit by a smaller amount).

It seems obvious that we should execute range clauses by reference or by slice value whenever possible in order to gain as much performance as possible. Why is this?

If we look at the object code generated by the Go compiler for each of these loops, the answer is clear!

Here is the relevant section of object code for the first example, speed_of_value_range():

  486296: 31 db                 xor    ebx,ebx
  486298: 31 f6                 xor    esi,esi
  48629a: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  4862a1: 7d 59                 jge    4862fc 
  4862a3: 48 8d bc 24 e0 1a 06  lea    rdi,[rsp+0x61ae0]
  4862aa: 00
  4862ab: 49 89 f0              mov    r8,rsi
  4862ae: 48 8d 74 24 60        lea    rsi,[rsp+0x60]
  4862b3: 41 89 c9              mov    r9d,ecx
  4862b6: 48 c7 c1 50 c3 00 00  mov    rcx,0xc350
  4862bd: f3 48 a5              rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
  4862c0: 31 c9                 xor    ecx,ecx
  4862c2: 48 8d b4 24 e0 1a 06  lea    rsi,[rsp+0x61ae0]
  4862c9: 00
  4862ca: 48 81 f9 50 c3 00 00  cmp    rcx,0xc350
  4862d1: 7d 17                 jge    4862ea 
  4862d3: 4c 8b 16              mov    r10,QWORD PTR [rsi]
  4862d6: 48 83 c6 08           add    rsi,0x8
  4862da: 48 ff c1              inc    rcx
  4862dd: 4d 8d 42 01           lea    r8,[r10+0x1]
  4862e1: 48 81 f9 50 c3 00 00  cmp    rcx,0xc350
  4862e8: 7c e9                 jl     4862d3 
  4862ea: 48 ff c3              inc    rbx
  4862ed: 44 89 c9              mov    ecx,r9d
  4862f0: 4c 89 c6              mov    rsi,r8
  4862f3: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  4862fa: 7c a7                 jl     4862a3

And here is the relevant section of object code for the second example, speed_of_ref_range():

  4864d6: 31 db                 xor    ebx,ebx
  4864d8: 31 f6                 xor    esi,esi
  4864da: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  4864e1: 7d 33                 jge    486516
  4864e3: 31 ff                 xor    edi,edi
  4864e5: 4c 8d 44 24 60        lea    r8,[rsp+0x60]
  4864ea: 48 81 ff 50 c3 00 00  cmp    rdi,0xc350
  4864f1: 7d 17                 jge    48650a
  4864f3: 4d 8b 08              mov    r9,QWORD PTR [r8]
  4864f6: 49 83 c0 08           add    r8,0x8
  4864fa: 48 ff c7              inc    rdi
  4864fd: 49 8d 71 01           lea    rsi,[r9+0x1]
  486501: 48 81 ff 50 c3 00 00  cmp    rdi,0xc350
  486508: 7c e9                 jl     4864f3
  48650a: 48 ff c3              inc    rbx
  48650d: 48 81 fb 50 c3 00 00  cmp    rbx,0xc350
  486514: 7c cd                 jl     4864e3

Although there are a number of differences between the two snippets of object code, there seems to be one major difference. The speed_of_value_range() contains this instruction:

rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]

How is that instruction causing us such a problem? It’s easy to see once we think about the semantics of that instruction. The rep modifier to the instruction means that we will repeat the movs operation the number of times contained in the count register (e/r)cx. The instruction before the rep movs is

mov    rcx,0xc350

In other words, we are moving 0xc350 quad words (four bytes) every time we execute the inner loop. That means throughout the life of the speed_of_value_range() function the CPU will perform 0xc350 * 0xc350 moves that it will not do in the speed_of_ref_range() function. [5]

That’s where we get the speedup!

Conclusion

I hope that this post helped you understand the different semantics of the range clause. As a programmer when you update the values used in a range clause during the body of the loop, the type of range clause will affect what values you see in future iterations.

There are also performance implications. The simple advice for the best performing code is to use by value range semantics sparingly!

I’d love to hear what you think about these results and whether you agree or disagree. Please let me know in the comments below!

 

References

[1] This comes from the Go Programming Language Specification at https://golang.org/ref/spec.

[2] For an explanation of this, see http://denniskubes.com/2012/08/20/is-c-pass-by-value-or-reference/

[3] Go slices: Usage and internals https://blog.golang.org/go-slices-usage-and-internals

[4] The inner loop body is contrived just so that we can trick the compiler. The compiler aggressively optimizes loops when the body does not use the range value and index variables. This will affect our timing results.

[5] It will not do those moves in the speed_by_slice_range() either. Proving this is left as an exercise for the reader! I love when people say that!