Javascript-Proposal proper string split: Proper string splitting for ECMAScript

Proper String Split

Image from @DasSurma

Status

Champion(s): Luca Casonato

Author(s): Luca Casonato

Stage: 0

Motivation

The string split method in JavaScript behaves unlike the string split methods in nearly all other languages. In JavaScript, a splitN (split with max N return values), is essentially a regular split, but with the output array truncated to the first N values.

In most other languages splitN instead splits the original string N amount of times. The last value in the returned array is the "remainder" of the string.

# Perl

print join('\n', split(/\|/, 'a|b|c|d|e|f', 2))

# a
# b|c|d|e|f
// Java

class Playground {
  public static void main(String[] args) {
    String s = "a|b|c|d|e|f";
    for(String val : s.split("|", 2)) {
      System.out.println(val);
    }
  }
}

// a
// b|c|d|e|f
<!-- PHP -->
<?php

print join("\n", explode("|", "a|b|c|d|e|f", 2));

# a
# b|c|d|e|f
# Ruby

print 'a|b|c|d|e|f'.split('|', 2)

# ["a", "b|c|d|e|f"]
# Python

print('a|b|c|d|e|f'.split('|', 2))

# ['a', 'b', 'c|d|e|f']
// Go

package main

import (
  "fmt"
  "strings"
)

func main() {
  fmt.Printf("%#v", strings.SplitN("a|b|c|d|e|f", "|", 2))
}

// []string{"a", "b|c|d|e|f"}
// Rust

fn main() {
  let v = "a|b|c|d|e|f".splitn(2, "|").collect::<Vec<_>>();
  println!("{:?}", v);
}

// ["a", "b|c|d|e|f"]
// JavaScript

console.log("a|b|c|d|e|f".split("|", 2));

// ["a", "b"]

JavaScipt is definitly the odd one out here. Python's behavior is a little weird, but at least no part of the input string is lost. All of the other tested languages agree with each other.

This behaviour is good, because it makes split the reverse function to join. In JS this assertion is not true for all values of N: val.split(sep, N).join(sep) === val.

Proposal

The proposal is to add a new String.prototype.splitn method that "splits" N number and returns the remainder rather than splitting N + 1 times and throwing away the remainder.

console.log("a|b|c|d|e|f".splitn("|", 2));
// ["a", "b|c|d|e|f"]

The naming is taken from Rust and Go.

Q&A

Could this be an extra option for the split method?

Yes. This could also be an option in a new options bag for split. Example:

console.log("a|b|c|d|e|f".split("|", { n: 2 }));
// or
console.log("a|b|c|d|e|f".split("|", 2, true));
// or
console.log("a|b|c|d|e|f".split("|", { n: 2, remainder: true }));

The former may be confusing to users though, as it is not obvious that the return value between split("|", 2) and split("|", { n: 2 }) is different. These kinds of overloads exist on the web platform (e.g. addEventListener), but the form you use does not impact behaviour.

The latter is more clear, but at the same time also less clear, because it is not obvious what the true value in the third argument is.

The last option is the most clear, but is also the most verbose. The verbosity may make it cumbersome to use.

Which of the 4 proposed options should ultimately be used should be up to the committee as a whole. I don't really care (although I prefer the splitn or the final option).

Comments

  • Interpretation of count argument
    Interpretation of count argument

    Jan 19, 2022

    In #3 this was mentioned and it seems to merit its own thread:

    It seems to me like Python and JavaScript are the only ones doing the intuitive thing

    I agree, I’d also expect the count to refer to the number of splits, not the number of parts including the remainder. I guess this is pretty subjective and may be chiefly dependent on what languages one is accustomed to. However I think there are also some less subjective reasons to use Python's behavior / the existing count semantics:

    • having a different interpretation of the count argument between two similarly named methods that do almost the same thing could be a pretty nasty refactoring hazard.
    • people will still use split given most usage doesn’t require the count param and the name is familiar, so we’ll end up having to memorize a new arbitrary difference.
    • zero is a valid/rational count. positive-integer-but-not-zero seems uncommon in today JS (toPrecision ... any others?).

    Re: zero, I’m pretty sure I’m splitting zero things at the moment, so it’d be weird to discover that’s mathematically impossible! Though I’m sure there are plenty of exceptions, when zero is not a valid number-of-things that seems like a hint that something might be off conceptually ... off by one.

    Reply
  • Will a RegExp be allowed as the first argument?
    Will a RegExp be allowed as the first argument?

    Jan 19, 2022

    The current split method accepts a RegExp as the first argument. In that case matched text for defined groups are included in the returned array.

    EDIT: This is the original example which was incorrect as pointed out below.

    const a = 'a|b|c|d';
    a.split(/|/);    // ['a', '|', 'b', '|', 'c', '|', 'd']
    a.split(/|/, 2); // ['a', '|']
    

    EDIT: This is a valid example.

    const a = 'a|b|c|d';
    a.split(/(\|)/);    // ['a', '|', 'b', '|', 'c', '|', 'd']
    a.split(/(\|)/, 2); // ['a', '|']
    

    Will splitn allow RegExp arguments? If so, how will it treat them?

    Reply
  • will having two split methods cause confusion?
    will having two split methods cause confusion?

    Jan 20, 2022

    I fear we're just going to create a "splice" scenario where most people are going to just have to look at the documentation every time to figure out what it does now that there's more than one option.

    Reply
  • what are some examples of things that are difficult to do with the existing split method?
    what are some examples of things that are difficult to do with the existing split method?

    Jan 20, 2022

    I'm struggling with the motivation for this proposal. It seems like it is a solution in search of a problem. "Other programming languages do it this way" is not significantly motivating if there's an easy way to accomplish the task with the existing method. Do you have any examples where it's hard to do it with what we provide today? Is it just about increased readability? Can you show what a polyfill of your proposed method would look like?

    Reply
  • N seems off for most languages in the table
    N seems off for most languages in the table

    Jan 12, 2022

    In most other languages splitN instead splits the original string N amount of times

    When passing 2, i'd thus expect two separators to be consumed - ie, Python's results. However, every other language in your table only consumes 1 separator.

    It seems to me like Python and JavaScript are the only ones doing the intuitive thing to me, they just disagree about what to do with the remainder of the string.

    (obviously another mental model could be, "2 means i want 2 parts", but in that case, #2 means it should be a red x, and JS could be green because it gives you two parts)

    Reply
  • Options object as a third parameter
    Options object as a third parameter

    Jan 10, 2022

    As an addition to the current signatures in the Q&A section: would it be feasible to also consider options object as an extra parameter to the current .split?

    "a;b;c;d".split(";", 2, { remainder: true })
    // ["a", "b;c;d"]
    

    This would make it easy to have backward compatibility with the current specification, and also be verbose at the same time.

    Extra parameter as an object is also a future proof method as it allows more options to be added if needed. It doesn't require a new function to be added and the same scheme has been used in the EventTarget.addEventListener() you mention in the proposal.

    Reply
  • python has a green check but doesn't match the others
    python has a green check but doesn't match the others

    Jan 12, 2022

    In your table, python doesn't return ['a', 'b|c|d|e|f'], it returns ['a', 'b', 'c|d|e|f'].

    Reply