When reviewing a team member's pull request recently, I saw code like this:
string.split("/").first
Split is a super useful method in Ruby. It does exactly what it says on the tin; it splits strings into arrays. The first parameter is the pattern on which to split (usually a string or regular expression), and the second is the limit of split substrings. That means that "dogs,cats,turtles".split(",", 2)
will return the array ["dogs", "cats,turtles"]
.
Since we only care about the first substring in the above code, we can use this second parameter, but I wondered if we should. I decided to pull out benchmark-ips and take a look.
If you're unfamiliar with benchmark-ips, it measure the number of times a specific piece of code will run in a second, and then compares that to other code. The first test I did was this (threw in regex just because I was curious):
STR = "sup/" * 5
Benchmark.ips do |x|
x.report("split no param") { STR.split("/").first }
x.report("split param") { STR.split("/", 2).first }
x.report("regex") { STR.gsub(/\/.*/, "") }
x.compare!
end
Which yielded these results:
Warming up --------------------------------------
split no param 117.230k i/100ms
split param 214.826k i/100ms
regex 61.869k i/100ms
Calculating -------------------------------------
split no param 1.569M (± 6.7%) i/s - 7.854M in 5.033527s
split param 3.497M (± 5.7%) i/s - 17.616M in 5.054218s
regex 713.532k (± 7.6%) i/s - 3.588M in 5.061920s
Comparison:
split param: 3497489.0 i/s
split no param: 1568554.7 i/s - 2.23x slower
regex: 713532.1 i/s - 4.90x slower
Cool! So using the second parameter is about twice as fast as not using it. My guess here is that between not having to search the string further (you can see in the source that the execution is short-circuited based on limit
), and not needing to allocate the memory for the further string objects, we're getting some good wins.
But do the performance wins scale? Or is it relatively constant? I next tried the same experiment with a much larger string:
STR = "sup/" * 1000
Benchmark.ips do |x|
x.report("split no param") { STR.split("/").first }
x.report("split param") { STR.split("/", 2).first }
x.report("regex") { STR.gsub(/\/.*/, "") }
x.compare!
end
With good results!
Warming up --------------------------------------
split no param 1.438k i/100ms
split param 192.331k i/100ms
regex 495.000 i/100ms
Calculating -------------------------------------
split no param 16.318k (± 5.2%) i/s - 81.966k in 5.037491s
split param 3.263M (± 6.2%) i/s - 16.348M in 5.032953s
regex 5.011k (± 6.0%) i/s - 25.245k in 5.057195s
Comparison:
split param: 3262671.0 i/s
split no param: 16317.8 i/s - 199.95x slower
regex: 5011.1 i/s - 651.09x slower
Now 200 times slower without this second parameter. Out of curiosity, I tried a string that is large before the first slash but without a lot of slashes:
STR = "#{"s" * 4000}/up/sup"
Benchmark.ips do |x|
x.report("split no param") { STR.split("/").first }
x.report("split param") { STR.split("/", 2).first }
x.report("regex") { STR.gsub(/\/.*/, "") }
x.compare!
end
Which gave some interesting results:
Warming up --------------------------------------
split no param 82.791k i/100ms
split param 91.227k i/100ms
regex 24.193k i/100ms
Calculating -------------------------------------
split no param 1.259M (±12.3%) i/s - 6.292M in 5.081082s
split param 1.511M (± 8.5%) i/s - 7.572M in 5.051203s
regex 278.416k (± 7.3%) i/s - 1.403M in 5.071412s
Comparison:
split param: 1511242.5 i/s
split no param: 1258551.1 i/s - same-ish: difference falls within error
regex: 278415.6 i/s - 5.43x slower
Only slightly faster now! This points to performance gains coming from not having to walk the string as much in earlier examples.
It's worth noting that this is only measuring execution time, not memory consumption, where you might see some wins. Either way, passing the limit to split
seems worthwhile if you know how many substrings you'll actually care about.
And, if anything, don't use regex for this specific problem.