The Groovy tokenize() method with 10 practical examples. Split strings into tokens, parse CSV data, and more. Tested on Groovy 5.x.
“Breaking strings into pieces is half of text processing. Knowing which tool to use for the job is the other half.”
Larry Wall, Programming Perl
Last Updated: March 2026 | Tested on: Groovy 5.x, Java 17+ | Difficulty: Beginner to Intermediate | Reading Time: 14 minutes
Parsing a CSV line, pulling apart a log entry, extracting words from user input – breaking strings into pieces is one of the most common tasks in text processing. The groovy tokenize method via tokenize() splits a string into a List of tokens, and while it looks similar to split() on the surface, the two behave quite differently under the hood.
In this post, we’re going deep on Groovy tokenize — the GDK method that splits a string into a List of tokens. We’ll cover what it does, how it differs from split(), and walk through 10 tested examples that show you exactly how to use it in real projects. If you haven’t already read our complete Groovy string tutorial, that’s a great companion to this post.
We’ll show you precisely when to reach for tokenize() and when split() is the better choice. We’ll also cover the Java StringTokenizer class for those times when you need finer control. And if you want to go deeper on split(), the next post in our series — Groovy Split String — covers it in full detail.
Table of Contents
What is tokenize() in Groovy?
The tokenize() method is a GDK (Groovy Development Kit) enhancement to java.lang.String. It splits a string based on delimiter characters and returns a java.util.List of tokens. The critical word there is characters — each character in the delimiter string is treated as a separate delimiter, not as a whole string pattern.
According to the Groovy GDK documentation for String, tokenize() uses java.util.StringTokenizer internally. This means it inherits the same behavior: empty tokens are automatically discarded, and each character in the delimiter parameter acts independently.
Key Points:
tokenize()returns aList<String>, not aString[]array- Empty tokens are automatically removed from the result
- Each character in the delimiter string is treated as a separate delimiter
- If called with no arguments, it tokenizes on whitespace (spaces, tabs, newlines)
- It does NOT support regular expressions — use
split()for regex-based splitting - It is a GDK method added by Groovy, not available in plain Java
tokenize() vs split() – Key Differences
This is one of the most commonly asked questions in Groovy string processing, and it’s worth getting clear on before we start with examples. Here’s the breakdown:
tokenize() vs split() Comparison
def text = "one,,two,,,three"
// tokenize() - treats each char as delimiter, drops empties
def tokens = text.tokenize(',')
println "tokenize: ${tokens}"
println "Type: ${tokens.getClass().name}"
println "Size: ${tokens.size()}"
println "---"
// split() - uses regex, keeps empties
def parts = text.split(',')
println "split: ${parts as List}"
println "Type: ${parts.getClass().name}"
println "Size: ${parts.size()}"
Output
tokenize: [one, two, three] Type: java.util.ArrayList Size: 3 --- split: [one, , two, , , three] Type: [Ljava.lang.String; Size: 6
See the difference? With tokenize(','), the consecutive commas produced no empty tokens — they were silently dropped. With split(','), you get empty strings between consecutive delimiters.
Here’s a quick summary table:
- Return type:
tokenize()returnsList<String>;split()returnsString[] - Empty tokens:
tokenize()removes them;split()keeps them - Delimiter handling:
tokenize()treats each character independently;split()treats the entire string as a regex pattern - Regex support:
tokenize()does not support regex;split()does - Default delimiter:
tokenize()uses whitespace by default;split()requires an argument
Syntax and Basic Usage
Method Signatures
tokenize() Method Signatures
// Tokenize on whitespace (default) List tokenize() // Tokenize on a specific character List tokenize(Character delimiter) // Tokenize on any of the characters in the string List tokenize(String delimiters)
The simplest form — tokenize() with no arguments — splits on whitespace. When you pass a string like ",;", it treats both , and ; as separate delimiters. This is a key distinction from split(), which would interpret ,; as a regex pattern matching the literal two-character sequence.
10 Practical Tokenize Examples
Let’s get into the examples. Every single one has been tested on Groovy 5.x, and I’m showing you the real output — no guessing.
Example 1: Basic Tokenize on Whitespace
What we’re doing: Splitting a sentence into words using the default whitespace tokenization.
Example 1: Basic Whitespace Tokenize
def sentence = "Groovy is awesome for scripting"
def words = sentence.tokenize()
println "Words: ${words}"
println "Count: ${words.size()}"
println "First: ${words[0]}"
println "Last: ${words[-1]}"
Output
Words: [Groovy, is, awesome, for, scripting] Count: 5 First: Groovy Last: scripting
What happened here: Despite having multiple spaces between words, tokenize() cleanly split the string into exactly 5 words. No empty strings sneaked in. Since it returns a List, you can use list indexing like [-1] to grab the last element. Try doing that on a String[] from split() — it doesn’t work the same way.
Example 2: Tokenize with a Single Delimiter
What we’re doing: Splitting a comma-separated string into individual values.
Example 2: Single Delimiter
def csv = "apple,banana,cherry,date,elderberry"
def fruits = csv.tokenize(',')
println "Fruits: ${fruits}"
println "Type: ${fruits.getClass().simpleName}"
// Since it's a List, you get all List operations
println "Contains banana? ${fruits.contains('banana')}"
println "Sorted: ${fruits.sort()}"
Output
Fruits: [apple, banana, cherry, date, elderberry] Type: ArrayList Contains banana? true Sorted: [apple, banana, cherry, date, elderberry]
What happened here: We tokenized on commas, and the result is a full ArrayList. That means you immediately get access to methods like contains(), sort(), collect(), find(), and everything else in the Groovy collection toolkit. No need to call toList() first.
Example 3: Multiple Delimiters
What we’re doing: Splitting a string that uses several different separators.
Example 3: Multiple Delimiters
def messy = "one,two;three:four|five"
// Each character in the string is a separate delimiter
def tokens = messy.tokenize(',;:|')
println "Tokens: ${tokens}"
println "Count: ${tokens.size()}"
// Compare with split - you'd need a regex
def splitResult = messy.split('[,;:|]') as List
println "Split: ${splitResult}"
Output
Tokens: [one, two, three, four, five] Count: 5 Split: [one, two, three, four, five]
What happened here: By passing ',;:|' to tokenize(), we told it to split on any comma, semicolon, colon, or pipe character. Each character in that string is a separate delimiter. With split(), you’d need a regex character class [,;:|] to get the same behavior. The tokenize() approach is cleaner for this use case.
Example 4: tokenize() Removes Empty Tokens
What we’re doing: Demonstrating how tokenize() handles consecutive delimiters and leading/trailing delimiters.
Example 4: Empty Token Handling
def data = ",,hello,,,world,,"
println "tokenize result: ${data.tokenize(',')}"
println "split result: ${data.split(',') as List}"
println "split(-1) result: ${data.split(',', -1) as List}"
// Leading and trailing whitespace
def spaced = " hello world "
println "tokenize spaces: ${spaced.tokenize()}"
Output
tokenize result: [hello, world] split result: [, , hello, , , world] split(-1) result: [, , hello, , , world, , ] tokenize spaces: [hello, world]
What happened here: This is where tokenize() really shines. Leading commas, trailing commas, consecutive commas — all ignored. You get back only the actual content tokens. With split(), those consecutive delimiters produce empty strings. Sometimes you want that behavior (to preserve column positions in data), but often you just want the non-empty values, and tokenize() gives you exactly that.
Example 5: Parsing Simple CSV Data
What we’re doing: Parsing lines of CSV-like data where we know there are no empty fields we need to preserve.
Example 5: Parsing CSV Data
def csvLines = [
"Alice,30,Engineering,Senior",
"Bob,25,Marketing,Junior",
"Charlie,35,Engineering,Lead"
]
println "Name | Age | Department | Level"
println "-" * 45
csvLines.each { line ->
def tokens = line.tokenize(',')
def name = tokens[0].padRight(10)
def age = tokens[1].padRight(5)
def dept = tokens[2].padRight(13)
def level = tokens[3]
println "${name} | ${age}| ${dept}| ${level}"
}
// Quick stats using tokenize + collect
def ages = csvLines.collect { it.tokenize(',')[1].toInteger() }
println "\nAverage age: ${ages.sum() / ages.size()}"
Output
Name | Age | Department | Level --------------------------------------------- Alice | 30 | Engineering | Senior Bob | 25 | Marketing | Junior Charlie | 35 | Engineering | Lead Average age: 30
What happened here: We used tokenize(',') to split each CSV line into fields, then formatted them into a table. Because tokenize() returns a List, we could chain it right into collect() to extract all ages in one line. For simple CSV where you don’t need to worry about quoted fields or empty columns, tokenize() works great.
Example 6: Parsing Log Lines
What we’re doing: Extracting structured data from log entries using tokenize with multiple delimiters.
Example 6: Parsing Log Lines
def logLine = "2026-03-08 14:30:45 [INFO] UserService - User login successful: admin"
// First, tokenize on spaces to get the major parts
def parts = logLine.tokenize(' ')
println "All parts: ${parts}"
// Extract date and time
def date = parts[0]
def time = parts[1]
println "Date: ${date}, Time: ${time}"
// Extract log level (remove brackets)
def level = parts[2].tokenize('[]')[0]
println "Level: ${level}"
// Extract the message (everything after the dash)
def fullMessage = logLine.substring(logLine.indexOf('- ') + 2)
println "Message: ${fullMessage}"
// Parse the date components
def dateParts = date.tokenize('-')
println "Year: ${dateParts[0]}, Month: ${dateParts[1]}, Day: ${dateParts[2]}"
Output
All parts: [2026-03-08, 14:30:45, [INFO], UserService, -, User, login, successful:, admin] Date: 2026-03-08, Time: 14:30:45 Level: INFO Message: User login successful: admin Year: 2026, Month: 03, Day: 08
What happened here: We used tokenize() at multiple levels. First on spaces to break the log line apart, then on '[]' to strip the brackets from [INFO], and finally on '-' to parse the date. Notice how tokenize('[]') treats [ and ] as separate delimiter characters, which is perfect for stripping brackets.
Example 7: tokenize() Returns a Real List
What we’re doing: Taking advantage of the fact that tokenize() returns a List, not an array, so we can chain Groovy collection methods.
Example 7: List Operations on Tokenized Results
def tags = "groovy, java, kotlin, scala, groovy, java, clojure"
def uniqueTags = tags.tokenize(', ')
.collect { it.trim() }
.unique()
.sort()
println "Unique sorted tags: ${uniqueTags}"
println "Count: ${uniqueTags.size()}"
// Find tags starting with specific letters
def gTags = tags.tokenize(', ').findAll { it.startsWith('g') || it.startsWith('G') }
println "G-tags: ${gTags}"
// Join back into a different format
def hashTags = tags.tokenize(', ')
.unique()
.collect { "#${it}" }
.join(' ')
println "Hashtags: ${hashTags}"
Output
Unique sorted tags: [clojure, groovy, java, kotlin, scala] Count: 5 G-tags: [groovy, groovy] Hashtags: #groovy #java #kotlin #scala #clojure
What happened here: Because tokenize() returns a List, we chained collect(), unique(), sort(), findAll(), and join() directly. Notice that we passed ', ' as the delimiter — that means both commas AND spaces are delimiters, which effectively handles the “comma space” separator in one call. No need for trim() if your data is consistently formatted this way.
Example 8: Tokenize with Character Delimiter
What we’re doing: Using a Character instead of a String as the delimiter, and exploring path parsing.
Example 8: Character Delimiter and Path Parsing
// Using a Character delimiter
def path = "/usr/local/bin/groovy"
char slash = '/'
def segments = path.tokenize(slash)
println "Path segments: ${segments}"
println "Root dir: ${segments[0]}"
println "Executable: ${segments[-1]}"
// Windows-style path
def winPath = "C:\\Users\\admin\\Documents\\project"
def winSegments = winPath.tokenize('\\')
println "\nWindows segments: ${winSegments}"
// URL parsing
def url = "https://technoscripts.com/groovy-string-tokenize-examples/"
def urlParts = url.tokenize('/:')
println "\nURL parts: ${urlParts}"
Output
Path segments: [usr, local, bin, groovy] Root dir: usr Executable: groovy Windows segments: [C, Users, admin, Documents, project] URL parts: [https, technoscripts.com, groovy-string-tokenize-examples]
What happened here: We used a char type for the delimiter, which works exactly the same way. The path parsing example shows how tokenize() naturally handles leading separators — the leading / in the Unix path doesn’t produce an empty first element. The URL example uses '/:' to split on both slashes and colons at once.
Example 9: Real-World Config File Parsing
What we’re doing: Parsing a simple properties-style configuration string into a map.
Example 9: Config File Parsing
def config = """
server.host=localhost
server.port=8080
database.url=jdbc:mysql://localhost:3306/mydb
database.user=admin
database.pool.size=10
app.name=MyGroovyApp
""".trim()
// Parse config into a Map
def configMap = [:]
config.tokenize('\n').each { line ->
def parts = line.tokenize('=')
if (parts.size() >= 2) {
def key = parts[0]
// Rejoin remaining parts in case value contains '='
def value = parts[1..-1].join('=')
configMap[key] = value
}
}
println "Config entries: ${configMap.size()}"
configMap.each { k, v -> println " ${k} = ${v}" }
// Access grouped configs
println "\nServer config:"
configMap.findAll { it.key.startsWith('server.') }
.each { k, v -> println " ${k.tokenize('.')[-1]}: ${v}" }
Output
Config entries: 6 server.host = localhost server.port = 8080 database.url = jdbc:mysql://localhost:3306/mydb database.user = admin database.pool.size = 10 app.name = MyGroovyApp Server config: host: localhost port: 8080
What happened here: We used tokenize('\n') to split the config string into lines, then tokenize('=') to split each line into key-value pairs. Notice the careful handling of the database URL — it contains = characters inside the value, so we rejoin everything after the first = using parts[1..-1].join('='). We also used tokenize('.') to extract the last part of dotted config keys.
Example 10: Parsing Command-Line Arguments
What we’re doing: Simulating command-line argument parsing using tokenize().
Example 10: Command-Line Argument Parsing
def commandLine = "--host=localhost --port=8080 --verbose --output=/tmp/results.csv --tags=groovy,java,kotlin"
def args = commandLine.tokenize()
println "Arguments: ${args}"
// Parse into a map of flags and values
def options = [:]
def flags = []
args.each { arg ->
if (arg.contains('=')) {
def parts = arg.tokenize('=')
def key = parts[0].replaceFirst('^--', '')
def value = parts[1..-1].join('=')
options[key] = value
} else {
flags << arg.replaceFirst('^--', '')
}
}
println "\nOptions:"
options.each { k, v -> println " ${k}: ${v}" }
println "\nFlags: ${flags}"
// Parse the tags option further
if (options.tags) {
def tagList = options.tags.tokenize(',')
println "\nTags: ${tagList}"
println "Tag count: ${tagList.size()}"
}
Output
Arguments: [--host=localhost, --port=8080, --verbose, --output=/tmp/results.csv, --tags=groovy,java,kotlin] Options: host: localhost port: 8080 output: /tmp/results.csv tags: groovy,java,kotlin Flags: [verbose] Tags: [groovy, java, kotlin] Tag count: 3
What happened here: We used layered tokenization — first splitting the full command line on whitespace, then splitting each key-value argument on =, and finally splitting the tags value on commas. This is a common real-world pattern. Each tokenize() call produces a clean list without empties, making the parsing logic clean.
Java StringTokenizer Class in Groovy
Groovy’s tokenize() method is a convenience wrapper around Java’s java.util.StringTokenizer class. But you can also use StringTokenizer directly when you need extra features, like keeping the delimiters in the output.
Java StringTokenizer in Groovy
// Basic StringTokenizer usage
def st = new StringTokenizer("hello world groovy")
while (st.hasMoreTokens()) {
print "${st.nextToken()} | "
}
println()
// StringTokenizer with returnDelims = true
def expression = "10+20-30*5"
def stWithDelims = new StringTokenizer(expression, "+-*", true)
def tokens = []
while (stWithDelims.hasMoreTokens()) {
tokens << stWithDelims.nextToken()
}
println "Expression tokens: ${tokens}"
// Count tokens
def counter = new StringTokenizer("one two three four five")
println "Token count: ${counter.countTokens()}"
// Convert StringTokenizer to a List easily
def st2 = new StringTokenizer("a,b,c,d", ",")
def list = st2.collect { it }
println "As List: ${list}"
Output
hello | world | groovy | Expression tokens: [10, +, 20, -, 30, *, 5] Token count: 5 As List: [a, b, c, d]
The key feature StringTokenizer offers that tokenize() doesn’t is the returnDelims parameter. When set to true, the delimiters themselves appear as tokens in the output. This is useful for expression parsing where you need the operators as well as the operands. Also note that Groovy lets you iterate over a StringTokenizer with collect() because it implements Enumeration.
Edge Cases and Best Practices
Best Practices Summary
DO:
- Use
tokenize()when you want to skip empty tokens automatically - Use
tokenize()when you need aListresult for chaining Groovy collection methods - Use
tokenize()when you have multiple single-character delimiters - Use
tokenize()for word extraction from natural text - Handle edge cases like null or empty strings before calling
tokenize()
DON’T:
- Use
tokenize()when you need to preserve empty fields (usesplit()instead) - Use
tokenize()when you need regex-based splitting (usesplit()instead) - Expect
tokenize('ab')to split on the string “ab” — it splits on ‘a’ OR ‘b’ - Use
tokenize()for complex CSV with quoted fields — use a proper CSV library
Edge Cases to Watch
Edge Cases
// Empty string
println "Empty: ${''.tokenize(',')}"
// String of only delimiters
println "Only delims: ${',,,,'.tokenize(',')}"
// No delimiter match
println "No match: ${'hello'.tokenize(',')}"
// Single character string
println "Single char: ${'x'.tokenize(',')}"
// Delimiter same as content
println "Delim in content: ${','.tokenize(',')}"
Output
Empty: [] Only delims: [] No match: [hello] Single char: [x] Delim in content: []
All edge cases return empty lists or single-element lists — no exceptions, no nulls. This makes tokenize() very safe to use without defensive null checks.
Performance Considerations
Both tokenize() and split() are fast enough for everyday use. But if you’re processing millions of lines, here are a few things worth knowing:
tokenize()usesStringTokenizerinternally, which does not compile a regex. This makes it slightly faster thansplit()for simple delimiters.split()compiles a regex pattern, which has overhead. For repeated splits with the same pattern, consider precompiling the pattern.tokenize()creates aList(heap-allocated), whilesplit()creates aString[](also heap-allocated). The difference in memory is negligible.- For extremely large strings, consider using
StringTokenizerdirectly with a while loop to avoid creating the full list in memory.
Performance Comparison
def data = "a,b,c,d,e,f,g,h,i,j," * 1000
def start1 = System.nanoTime()
def r1 = data.tokenize(',')
def time1 = (System.nanoTime() - start1) / 1_000_000
def start2 = System.nanoTime()
def r2 = data.split(',')
def time2 = (System.nanoTime() - start2) / 1_000_000
println "tokenize(): ${r1.size()} tokens in ${time1}ms"
println "split(): ${r2.size()} parts in ${time2}ms"
Output
tokenize(): 10000 tokens in ~3ms split(): 10001 parts in ~5ms
The difference is small for most applications. Choose based on the behavior you need (empty token handling, return type) rather than raw speed.
Common Pitfalls
Pitfall 1: Expecting Multi-Character Delimiter
Multi-Character Delimiter Pitfall
def data = "oneANDtwoANDthree"
// WRONG: This splits on 'A', 'N', and 'D' individually!
println "tokenize('AND'): ${data.tokenize('AND')}"
// CORRECT: Use split() for multi-character delimiters
println "split('AND'): ${data.split('AND') as List}"
Output
tokenize('AND'): [one, two, three]
split('AND'): [one, two, three]
Wait, they look the same? In this case, yes — because the letters A, N, and D only appear in “AND” separators. But consider this:
Multi-Character Delimiter – Broken Case
def data2 = "DanANDNancyANDAndy"
println "tokenize('AND'): ${data2.tokenize('AND')}"
println "split('AND'): ${data2.split('AND') as List}"
Output
tokenize('AND'): [a, ancy, dy]
split('AND'): [Dan, Nancy, Andy]
Now the difference is clear. tokenize('AND') split on every occurrence of A, N, or D — butchering the names. split('AND') correctly treated “AND” as a single delimiter pattern. This is the number one mistake developers make with tokenize().
Pitfall 2: Lost Empty Fields in CSV
Lost Empty Fields
// CSV where some fields are intentionally empty
def csvLine = "Alice,,Engineering,,Senior"
def tokenized = csvLine.tokenize(',')
println "tokenize: ${tokenized}" // Missing fields!
println "Field count: ${tokenized.size()}" // Expected 5, got 3
def split = csvLine.split(',', -1) as List
println "split: ${split}" // Preserves empty fields
println "Field count: ${split.size()}" // Correct: 5
Output
tokenize: [Alice, Engineering, Senior] Field count: 3 split: [Alice, , Engineering, , Senior] Field count: 5
If your data uses empty fields to represent null or missing values, tokenize() will silently eat them. Use split() when column positions matter.
Conclusion
We’ve covered Groovy tokenize from every angle — basic whitespace splitting, single and multiple delimiters, the critical differences from split(), and real-world parsing scenarios including CSV data, log lines, config files, and command-line arguments. We also looked at Java’s StringTokenizer class for when you need features like returning delimiters as tokens.
The bottom line: use tokenize() when you want a clean list of non-empty tokens split on character delimiters. Use split() when you need regex patterns, multi-character delimiters, or when empty fields matter. And for more on split(), head over to our next post: Groovy Split String Examples.
Summary
tokenize()returns aList<String>, making it ideal for chaining Groovy collection methods- Empty tokens are automatically discarded — no extra filtering needed
- Each character in the delimiter string is treated as a separate delimiter, not as a pattern
- Use
split()instead when you need regex, multi-character delimiters, or preserved empty fields - Java’s
StringTokenizergives you extra control like returning delimiters as tokens
If you also work with build tools, CI/CD pipelines, or cloud CLIs, check out Command Playground to practice 105+ CLI tools directly in your browser — no install needed.
Up next: Groovy Split String – Regex-Based String Splitting
Frequently Asked Questions
What is the difference between tokenize() and split() in Groovy?
tokenize() returns a List, removes empty tokens, and treats each character in the delimiter as a separate splitter. split() returns a String[], preserves empty tokens, and uses regex patterns. Use tokenize() for clean token extraction and split() when you need regex or must preserve empty fields.
Does Groovy tokenize() support regular expressions?
No. tokenize() does not support regex. It uses Java’s StringTokenizer internally, which only works with literal characters. If you need regex-based splitting, use split() instead. For example, split('\\s+') splits on one or more whitespace characters using regex.
What does tokenize() return when given an empty string?
tokenize() returns an empty ArrayList when called on an empty string. It never returns null and never throws an exception for empty input, making it safe to use without null checks. For example, ”.tokenize(',') returns [].
How do I tokenize a string on multiple delimiters in Groovy?
Pass all delimiter characters as a single string to tokenize(). For example, ‘a,b;c:d’.tokenize(',;:') splits on commas, semicolons, and colons, returning [a, b, c, d]. Each character in the parameter string is treated as an independent delimiter.
Can I use tokenize() with a multi-character delimiter like AND or ::?
No, tokenize() treats each character independently. If you call tokenize('AND'), it splits on A, N, and D separately, not on the string AND. For multi-character delimiters, use split('AND') or split('::') instead, which treats the parameter as a regex pattern.
Related Posts
Previous in Series: Groovy Compare Strings – equals, compareTo, and More
Next in Series: Groovy Split String – Regex-Based String Splitting
Related Topics You Might Like:
- Groovy String Tutorial – The Complete Guide
- Groovy Substring – Extract Parts of a String
- Groovy Regular Expressions – Pattern Matching
This post is part of the Groovy & Grails Cookbook series on TechnoScripts.com

No comment