Tech Tip: Pattern Matching with Regular Expressions
Quick Fields and Workflow let you use regular expressions to find information buried in text, allowing you to quickly locate and use the data you need. In Workflow, the information found with a regular expression is stored in a token you can insert into fields, annotations or anywhere else tokens are permitted. In Quick Fields, you can use regular expressions to edit tokens, identify a document as belonging to a document class, automatically annotate text that meets the pattern or substitute different text.
Example: You scan an application and use regular expressions to pick out the social security number from the page of text. Then you place the social security number in a field to store it as part of the document’s metadata.
A regular expression is a series of special symbols used to identify specific patterns of text. Regular expressions let you extract information from tokens in Workflow, such as finding information in text that the Retrieve Text activity returns. In addition to extracting information from tokens in Quick Fields, you can also use regular expressions with the Auto-Annotation, Pattern Matching, Substitution and Text Identification processes. Anywhere regular expressions can be used in Workflow and Quick Fields, there will be a pattern matching icon that you can click for a list of regular expression symbols and a dialog to test your regular expression.
Here’s what the icon looks like:
Here are some examples of common uses for regular expressions:
[section-header] Example 1: Extracting a Social Security Number [/section-header]
You want to fill in the Social Security Number field with a social security number listed on a scanned form. Quick Fields (with the OCR process) or Workflow (with the Retrieve Text activity) produces a token that will be replaced by the entire text of a document. You can find the social security number in the block of document text by looking for a pattern of characters with the regular expression: (ddd-dd-dddd).
Note: The parentheses determine which information is extracted from the text. The other characters determine the pattern that will be looked for. For example, ddd-dd-(dddd) will find the social security number and return the last four digits of it.
[section-header]Example 2: Removing Spaces Before and After Numbers [/section-header]
A Zone OCR process captures the total from an invoice and places that number in the Invoice Total field. Because the constraints on that field only allow numbers and not whitespaces, any whitespaces the Zone OCR accidentally captures before or after the number must be removed. You can clean up these extra spaces by applying the regular expression (d+) to the Zone OCR token when it is assigned to the Invoice Total field.
Tip: To remove extra spaces in a list field, use the regular expression (S+). In regular expressions, “S” refers to non-whitespace characters, and the “+” symbol indicates you are looking for one or more of those characters.
[section-header] Example 3: Removing Spaces between Numbers [/section-header]
Your Zone OCR process has a zone around the Product ID number you want to extract from a form. However, this process sometimes adds blank spaces in the middle of the string of ID numbers. You can remove those spaces by using the Pattern Matching process in Quick Fields to apply the regular expression d+ with the option “All matches (combined with no spaces).”
Note: This screenshot is a preview of Quick Fields 8.3. The details and appearances of certain elements may change between now and the final release.