General Refine Expression Language (GREL)
Intro to GREL GREL is the scripting language that is supported by default in OpenRefine. GREL is a very flexible way of working with your data when the operation you want to perform isn’t supported out of the box.
Using GREL, you can: Transform Cells: Modify the value of cells Add Columns: Add new columns to your dataset using existing values Custom Facets: Define custom facets to explore your data in new ways
When creating a GREL, the expression is evaluated on every single row of the column which you’re acting on. Think of this as though you’re going down the column and repeating the same steps for every value you come across. Variables When editing a GREL expression, a number of variables are available to you. A variable is a named reference to a data value. Variables allow us to describe our data manipulations in a way that can be repeated across multiple values. The most common variables are:
Variable Description value The value in the cell that is currently being evaluated. row The current row of data that is being evaluated. This is a complex variable, which actually contains a number of distinct pieces of information including: row.index: The zero-based index of the current row row.cells: The cells contained in this row row.columnNames: The column names in the dataset row.starred: Indicates if the row is starred row.flagged: Indicates if the row is flagged
cells All of the cells that compose the current row of data. The cells object has fields associated with every column in the dataset. Individual field values can be accessed using one of two notations:
cells.columnname.value
Or
cells[“columnname”].value
Variable types Variable are often described by their type, which describes the kind of value that is contained in a variable. Some common variable types are:
Type Description Examples Boolean A value which represents the logical concepts of truth and falseness true, false Number A numerical value 1, -1, 2.223 String A textual value “John Smith”, “Data Wrangling” Array A collection of values [0, 1, 2, 3], [“Data”, “Is”, “Fun”] Date A date or time Nov-09, 1/4/2012 13:30:00
Functions Functions allow you to perform some operation on a value or set of values which are called “arguments”. When using a function a value is produced, which is the result of the operation which has been applied to the arguments. There are many different functions which take different types of arguments and produce some sort of useful result. To use a function you “call” it using the following syntax:
functionName(arg0, arg1, …)
Or
arg0.functionName(arg1, …)
As an example, consider the toLowercase function, which takes some textual value, which is called a string and returns that same string with all characters converted to lower case. You can apply or call the toLowercase function like so:
value.toLowercase()
Or
toLowercase(value)
OpenRefine provides a wide variety of functions for working with values. You can find a list of GREL functions on the OpenRefine Wiki. GREL Tasks These GREL formulas can be used in a number of places such as: Edit Column > Add column based on this column... Edit Cells > Transform… Facet > Custom text facet… Facet > Custom numeric facet... Replacing characters, words, etc value.replace(“Hello”, “Hi there”)
Running this GREL expression on the value string “Hello, Tom!” would produce “Hi there, Tom!”.
The replace function will also work with something called Regular Expressions. Regular expressions allow you to define patterns to match when replacing your values. For example:
value.replace(/[abc]/, ‘d’)
Running the above code on the value string “abcd” would produce the value “dddd”.
Removing Characters or Words Removing characters or words is a special case of string replacement, where the second argument of the replace function is empty.
value.replace(‘remove_me’, ‘’)
Changing Date Format For example to change a date in the format of M/D/Y to D/M/Y:
value.toDate("M/d/y").toString("d/M/y")
The arguments to the toDate() and the toString() functions follow the Simple Date Format specification, described here. Date Manipulation Date objects make it easy to alter date values using the inc function:
value.inc(15, ‘month’) value.inc(-3, ‘week’) Pad a Numerical Value with Leading Zeroes The following code will pad a number with up to four leading zeroes:
slice("0000", 0, 4-value.length()) + value
This is a bit complicated, but it’s not so bad once you break it down. A key thing to note is that in this code we’re treating the value like a string, even though it is numeric data. When used on strings, the ‘+’ operator means “join these strings together”. The slice function returns a subset of the characters in a string. It takes a string and two additional parameters, a starting position and a length of characters to include. Convert Numerical Units For example, to convert US pouds to kilograms:
value.toNumber()/2.20462 Round to a Number Decimal Points round(value*100) / 100.0
Note that the decimal point in the divisor is significant in that it forces the division operation to return a floating point value. Remove HTML entities (& > < ) value.unescape(“html”) Fix Odd Characters Often, data that is collected from multiple sources can contain strange characters. This issue is caused by interpreting the data using the wrong character encoding. To address this, OpenRefine allows you to change the character set used to interpret the value.
value.reinterpret(“utf-8”) value.reinterpret(“latin-1”)
These are the most common, but there are a number of options to choose from. Reverse Name Order Move from “Smith, John” to “John Smith”: value.split(", ")[1] + " " + value.split(", ")[0]
Move from“John Smith” to “Smith, John”: value.split(" ")[1] + ", " + value.split(" ")[0]
Parsing XML / HTML Content HTML Extract the content of a element with an ID: value.parseHtml().select("#content").toString()
Get the value of an element attribute: value.parseHtml().select(“a#home”).htmlAttr(“href”).toString() XML Assume that we have an XML value that looks something like:
XML - Extract value of an element: value.parseHTML.select(‘imdb|Movie’)[0].htmlText()
Loop over a collection of elements and extract a value from each: forEach(value.parseHtml().select('imdb|Movie imdb|director')[0].htmlText(), n, n.toTitlecase()).join(‘, ’) Parsing JSON content Assume that you have the following JSON structure:
{ "movie": { "title": "Gone With the Wind", "year": 1939, "directors": [ "Victor Flemming", "George Cukor", "Sam Wood" ], "cast": [ "Thomass Mitchell", "Barbara O'Neil", "Vivien Leigh" ], "budget": 3977000, "gross": 198655278 } }
To extract a value: value.parseJson().movie.title
Advanced Operations Create Custom Facets Using the H-1B data, let’s consider approval terms: Add a custom facet by clicking the dropdown for a column and selecting facet > Custom facet diff(cells['Certified_End_Date'].value.toDate(), cells['Certified_Begin_Date'].value.toDate(), 'months') Add a Column by Fetching URLs For example, let’s Geocode the H1B data. Add a column to create a field that contains the full address, name the column “Full_Address”:
join(forEach(filter([cells['Address_1'], cells['Address_2'], cells['City'], cells['State'], cells['Zip_Code']], v, isNonBlank(v)), v, v.value), " ")
Next add another column, but this time select the dropdown on the Full_Address column and select “Add column by fetching URLs…”
Call this column Google_Geocode_Result, set the Throttle Delay to 200 milliseconds, and provide the following GREL code to build the URL for the request:
"https://maps.googleapis.com/maps/api/geocode/json?address=" + value.escape(‘url’)
Now go make a coffee or tea while you wait for the process to finish. When it’s done, you should see a new column containing the JSON formatted results of your API call. If you want to get a better idea of the contents of this response, try pasting the response into a service like https://jsonformatter.curiousconcept.com/ which will help you to better understand the structure of the response.
Now that we’ve got our JSON responses, let’s pull the latitude and longitude values out of this response and into their own columns.
For latitude: value.parseJson().results[0].geometry.location.lat
For longitude: value.parseJson().results[0].geometry.location.lng
There is a wide variety of APIs that can be used in exactly this same way. Here’s a list of just a few: Wikipedia Google books Twitter - Count number of times a URL has been used iTunes search data Spotify search data Is the ISS overhead? Flikr Open Movie Database (OMDB) Country data Google maps geocoding Reconcile Values Against a Controlled Vocabulary Additional Notes For large datasets, you might need to reconfigure the application so it will be able to use more memory. If your operations are taking a very long time to finish, or if you are receiving “OutOfMemoryErrors” then it’s a good sign that you need to increase the memory allocations for OpenRefine.
Exercises Importing data Resolving inconsistencies Clustering and merging Working will mixed data type cells (e.g.12 million) Splitting columns Cleaning dates Exporting data