Importing a dataset

Basic Operations

Several basic operations can be completed using menu driven mass editing and transformation features. Types of operations that can be performed using these features include:

  • Renaming and reordering columns
  • Splitting multi-valued cells into several columns
  • Fixing spelling errors (e.g. color → colour) or coding values (e.g. 1=‘Yes’, 2=‘No’, 3=‘Maybe’, -9999=‘Valid skip’)
  • Merging similar values into one value (e.g. ‘UK’, ‘U.K.’, ‘United Kingdom’ → United Kingdom)
  • Removing rows

All basic operations can be accessed using the drop-down options menu at the top of each column.

Editing columns

Column operations include renaming, reordering, creating a new column based on an existing column, and splitting a column into multiple columns.

Splitting a column into several columns

If a column is composed of multi-valued cells, this option can be used to split the cell into multiple columns.

For example if you have a variable that represents Locations which is formatted ‘Location1, Location2, Location3’, the column Locations could be split into three columns based on the comma delimiter.

Locations Location 1 Location 2 Location 3
School, Convenience Store, Community Centre School Convenience Store Community Centre

Click on the arrow beside the column name to open the drop-down menu Select Edit Column to open the sidebar menu listing all the operations that can be performed on the column Select Split into several columns In the Split column [ColumnName] into several columns window, you can select how the column should be split By separator (e.g. comma, space) By field length Check off by separator Enter the separator value (e.g. comma ‘,’) Enter the number of columns the current column should be split into Click OK Renaming a column Click on the arrow beside the column name to open the drop-down menu Select Edit Column Select Rename this column from the side-bar menu In the Enter new column name window, enter the new name and click OK Reordering column placement Click on the arrow beside the column name to open the drop-down menu Select Edit Column There are four different options for changing the placement of a column within the data Move column to beginning Move column to end Move column left Move column right Sorting Data rows can be sorted based on a specific column. Click on the arrow beside the column name to open the drop-down menu Select Sort The Sort by [ColumnName] window will open Sorting options are dependent on the data type assigned to the column Select how you would like to sort (i.e. for text ‘a-z’ or ‘z-a’; for numbers ‘smallest first’ or ‘largest first’) Re-order the sort position of valid values, errors and blanks if desired by dragging and dropping in the re-order window Click OK To remove the sort, click on the column menu, select Sort and then Remove sort Faceting Faceting can be used to explore the data to determine the range and frequency of values found within a variable, to filter the data based on a value, to locate any inconsistencies or errors in the values, and to perform basic editing of the values.

There are four types of facets: text, numeric, timeline, and scatterplot. Text and numeric are the most commonly used types of facets. Creating a facet Click on the arrow beside the column name to open the drop-down menu Select Facet Select Text facet from the side-bar menu The facet is created in the left bar Filtering by facet value Click on a value in the facet Only matching records are now displayed Editing facet values You may need to perform edits on a value such as correcting a spelling error or coding string values into numeric values. Place your cursor on the value you wish to edit The options Edit and Include will appear to the right of the value Click Edit Enter a new value Click Apply Clustering The cluster operation can be used to find values within a variable which are similar and group them together. You can then select a group and merge the values together into one value.

For example: ‘Tim Hortons’ and ‘tim hortons’ would be grouped ‘February’, ‘Fabruary’ and ‘Febuary’ would be grouped ‘USA’, ‘U.S.A.’ and ‘U.S.’ would be grouped ‘’Coté, Mary’ and ‘Cote, Mary’ would be grouped

To cluster: Click Edit Cells Click Cluster and edit The Cluster & Edit column “ColumnName’ window opens Clusters are automatically created based on an algorithm. Some experimentation with the Method and Keying Function comparison algorithms may be required to get the best solution. Select which cluster(s) to merge by checking off the box in the Merge column Enter the correct value for the cluster in the New Cell Value text box Click Merge Selected & Close Create a cluster using a facet You can also cluster based on a facet. Create a facet In the facet window in the left bar click the Cluster button. The Cluster & Edit column “ColumnName’ window opens. Follow the same steps as described in the Clustering section above. Editing rows You may be required to remove rows from the data such as in the case of duplicate records or blank rows. Removing rows You may need to remove a particular row or set of rows from the data. For instance, extra spaces within cells or carriage returns at the end of a row can lead to the creation of a blank row that needs to be removed.

To select and remove rows, use the following steps: Star (or flag) a row by clicking on the star icon in the All column for the row Create a facet based on starred rows by clicking on the drop-down menu for the All column, select Facet and Facet by star In the facet window click on the value True to show only those rows that meet the criterion. In the drop-down menu for the All column, select Edit rows and Remove all matching rows Removing duplicate rows Sometimes there are duplicate entries in a data file that will need to be removed. This process involves first identifying duplicate rows, blanking out the duplicate row and then removing the blank rows from the data. Note that to identify duplicate rows, the data should be sorted based on a variable that is composed of unique values such as a ‘Unique ID’ or ‘Record ID’ variable.

Click on the arrow beside the ‘Unique ID’ column name to open the drop-down menu Select Sort and select to sort by numbers A new Sort menu opens at the top of the OpenRefine window Click Sort to open the drop-down menu Select Reorder rows permanently Identical rows are now adjacent to each other For the same column, open the drop-down menu and select Edit cells Select Blank down from the side-bar menu All cells in the ‘Unique ID’ column that have the same value as the the row above will be blanked, marking it as a duplicate For the same column, open the drop-down menu and select Facet Select Customized facet Select Facet by blank In the facet, select True Open the drop-down menu for the All column Select Edit rows Select Remove all matching rows This eliminates all rows that have a blank cell in the ‘Unique ID’ column Cell Transformations Several common transformations can be completed using menu options, without having to write script. Transformations are applied to all cells within a column. Select Edit cells Select Common transforms Select the transformation you wish to perform

Transformation Explanation Trim leading and trailing whitespace Remove any extra whitespace from the beginning or end of a value Collapse consecutive whitespace Remove any extra whitespace from within the value To titlecase Converts the first letter of each word in a string to uppercase To uppercase Converts entire string to uppercase To lowercase Convents entire string text to lowercase To number Converts cell data type to number (if the column is composed of numerical values) To date Converts cell data type to date To text Converts cell data type to string (text) (if the column is composed of string values)

Reversing order of string content (e.g. names) Transposing and Reshaping Tracking data cleaning operations All operations performed on the data are tracked in OpenRefine. A copy of the operation history can be extracted and saved as a text file. Click on Undo/Redo to view the operation log Click on Extract In the Extract Operation History window, select the log (which is presented in JSON) and copy it (CTRL-C) Paste the copied log into Notepad (or any text editor) and save as a text file with an explanatory file name Exporting the cleaned data The cleaned data can be exported out of OpenRefine in a variety of formats. Export project The entire project is exported using this option including the operation history, data file, and metadata for the exported data file. Click Export in the top right hand corner to open the drop-down menu of options The entire project can be exported as a ‘tar.gz’ zipped and compressed archive by selecting Export project Export file This option can be used to export only the data file. Click Export in the top right hand corner to open the drop-down menu of options Select a file format. The data can be exported in a variety of file formats including: Tab-separated values (.tab) Comma-separated value (.csv) Excel (.xls) or Excel 2007+ (.xlsx) HTML table Template (JSON) Export a subset of the data Use this option to export a portion of the entire dataset. Select the columns and the order of the columns you wish to export. Click Export in the top right hand corner to open the drop-down menu of options Select Custom tabular exporter Use the options in the Content tab of the Custom tabular exporter window to Select the variables to export by checking off the box next to the variable name Reorder the variables by clicking on a variable name and dragging it to a new position in the list Set output characteristics In the Download tab of the Custom tabular exporter, select the file format for download and click the Download button In the Upload tab of the Custom tabular exporter, you can chose to upload the data to: A new Google spreadsheet A new Google Fusion table In the Option Code tab of the Custom tabular exporter, you can extract text which describes the options you set in the other tabs. Closing OpenRefine To ensure no data it lost or corrupted, the OpenRefine application must be properly shut down. Close the OpenRefine application in your browser Close the command window Windows - In the command window press CTRL-C OSX - Click on the OpenRefine application icon in the dock and invoke its Quit command

results matching ""

    No results matching ""