×

☰ Table of Contents

TigerGraph Docs : GSQL Language Reference Part 1 - Defining Graphs and Loading Data v1.0

Version 1.0
Document updated:

Copyright © 2015-2017 TigerGraph. All Rights Reserved .
For technical support on this topic, contact support@tigergraph.com with a subject line beginning with "GSQL"


Abbreviated Table of Contents

Complete Table of Contents - Click to expand


Part 1 includes system basics, defining a graph schema, and loading data.  Part 2 includes querying.

Introduction

Contents of this Section

The GSQL™ software program is the TigerGraph comprehensive environment for designing graph schemas, loading and managing data to build a graph, and querying the graph to perform data analysis.  In short, TigerGraph users do most of their work via the GSQL program. This document presents the syntax and features of the GSQL language.

This document is a reference manual, not a tutorial. The user should read GSQL Tutorial and Demo Examples prior to using this document. There are also User Guides or Tutorials for particular aspects of the GSQL environment. This document is best used when the reader already has some basic familiarity with running GSQL and then wants a more detailed understanding of a particular topic.

This document is Part 1 of the GSQL Language Reference, which describes system basics, defining a graph schema, and loading data.  Part 2 describes querying.

A handy GSQL Reference Card lists the syntax for the most commonly used GSQL commands for graph definition and data loading . Look for the reference card on our User Document home page.

GSQL Workflow

The GSQL workflow has four major steps:

  1. Define a graph schema or model.
  2. Load data into the TigerGraph system.
  3. Create and install queries.
  4. Run queries.

After initial data and queries have been installed, the user can run queries or go back to load more data and create additional queries. This document provides specifications and details for steps 1 and 2. The Appendix contains flowcharts which provide a visual understanding of the required and allowed sequence of commands to proceed through the workflow.

Language Basics

  • Identifiers
    Identifiers are user-defined names. An identifier consists of letters, digits, and the underscore.  Identifiers may not begin with a digit.  Identifiers are case sensitive.

  • Keywords and Reserved Words
    Keywords are words with a predefined semantic meaning in the language. Keywords are not case sensitive. Reserved words are set aside for use by the language, either now or in the future. Reserved words may not be reused as user-defined identifiers.  In most cases, a keyword is also a reserved word.  For example, VERTEX is a keyword.  It is also a reserved word, so VERTEX may not be used as an identifier.

  • Statements
    Each line corresponds to one statement (except in multi-line mode). Usually, there is no punctuation at the end of a top-level statement. Some statements, such as CREATE LOADING JOB, are block statements which enclose a set of statements within themselves. Some punctuation may be needed to separate the statements within a block.

  • Comments
    Within a command file, comments are text that is ignored by the language interpreter.
    Single line comments begin with either # or //. A comment may be on the same line with interpreted code . Text to the left of the comment marker is interpreted, and text to the right of the marker is ignored.
    Multi-line comment blocks begin with /* and end with */

Documentation Notation

In the documentation, code examples are either template code (formally describing the syntax of part of the language) or actual code examples .  Actual code examples show code that can be run exactly as shown, e.g., copy-and-paste. Template code, on the other hand, cannot be run exactly as shown because it uses placeholder names and additional symbols to explain the syntax. It should be clear from context whether an example is template code or actual code.

This guide uses conventional notation for software documentation.  In particular, note the following:

  • Shell prompts
    Most of the examples in this document take place within the GSQL shell.  When clarity is needed, the GSQL shell prompt is represented by a greater-than arrow: >
    When a command is to be issued from the operating system, outside of the GSQL shell, the prompt is the following: os$

  • Keywords
    In the GSQL language, keywords are not case sensitive, but user-defined identifiers are case sensitive. In code examples, keywords are in ALL CAPS to make clear the distinction between keywords and user-defined identifiers.

    In a very few cases, some option keywords are case-sensitive. For example, in the command to delete all data from the graph store,
    clear graph store -HARD

    the option -HARD must be in all capital letters.



  • Placeholder identifiers and values
    In template code, any token that is not a keyword, a literal value, or punctuation is a placeholder identifier or a placeholder value.
    Example:

    CREATE UNDIRECTED EDGE edge_type_name (FROM vertex_type_name1 , TO vertex_type_name2 ,
    attribute_name type [DEFAULT default_value ],...)

    The user-defined identifiers are edge_type_ n ame , vertex_type_name1, vertex_type_name2, attribute_name and default_value . As explained in the Create Vertex section, type is one of the attribute data types.
  • Quotation Marks
    When quotation marks are shown, they are to be typed as shown (unless stated otherwise). A placeholder for a string value will not have quotation marks in the template code, but if a template is converted to actual code, quotation marks should be used around string values.

  • Choices
    The vertical bar | is used to separate the choices, when the syntax requires that the user choose one out of a set of values. Example:  Either the keyword VERTEX or EDGE is to be used. Also, note the inclusion of quotation marks.

    Template:
    LOAD " file_path " TO VERTEX|EDGE object_type_name VALUES (id_expr, attr_expr1 , attr_expr2 ,...)

    Possible actual values:
    LOAD "data/users.csv" TO VERTEX user VALUES ($0, $1, $2)

  • Optional content
    Square brackets are used to enclose a portion that is optional.  Options can be nested. Square brackets themselves are rarely used as part of the GSQL language itself.
    Example: In the RUN JOB statement, the -n flag is optional.  If used, -n is to be followed by a value.

    RUN JOB [-n count ] job_name

    Sometimes, options are nested, which means that an inner option can only be used if the outer option is used:

    RUN JOB [-n [ first_line_num , ] last_line_num ] job_name

    means that first_line_num may be specified if and only if last_line_num is specified first. These options provide three possible forms for this statement:
    RUN JOB job_name
    RUN JOB -n last_line_num job_name

    RUN JOB -n first_line_num , last_line_num job_name

  • Repeated zero or more times
    In template code, it is sometimes desirable to show that a term is repeated an arbitrary number of times. For example, a vertex definition contains zero or more user-defined attributes. A loading job contains one or more LOAD statements. In formal template code, if an asterisk (Kleene star) immediately follows option brackets, then the bracketed term can be repeated zero or more times.  For example:

    TO VERTEX|EDGE object_name VALUES ( id_expr [, attr_expr ]*)

    means that the VALUES list contains at least one attribute expression. It may be followed by any number of additional attribute expressions. Each additional attribute expression must be preceded by a comma.

  • Long lines

    For more convenient display, long statements in this guide may sometimes be displayed on multiple lines.  This is for display purposes only; the actual code must be entered as a single line (unless the multiline mode is used).  When necessary, the examples may show a shell prompt before the start of a statement, to clearly mark where each statement begins.
    Example: A SELECT query is grammatically a single statement, so GSQL requires that it be entered as a single line.

    Long statement displayed as one line
    SELECT *|attribute_name FROM vertex_type_name [WHERE conditions] [ORDER BY attribute1,attribute2,...] [LIMIT k]

    However, the statement is easier to read and to understand when displayed one clause per line:

    Long statement displayed on multiple lines but with only one prompt
    > SELECT *|attribute_name 
        FROM vertex_type_name
        [WHERE conditions]
        [ORDER BY attribute1,attribute2,...]
        [LIMIT k]

System and Language Basics

Contents of this Section

Running GSQL

To run GSQL and enter the interactive command shell, type gsql from an operating system shell prompt:
os$ gsql
GSQL>

To exit the GSQL shell, type either exit or quit at the GSQL prompt:
GSQL> EXIT
or
GSQL> QUIT

Multiple Shell Sessions

Starting with v0.8 of the TigerGraph platform, it is possible to run multiple shell sessions of GSQL at the same time.  This feature can be used to have multiple clients (human or machine) using the system to perform concurrent operations. A basic locking scheme is used to maintain isolation and consistency.

Multi-line Mode - BEGIN, END, ABORT

In interactive mode, the default behavior is to treat each line as one statement; the GSQL interpreter will activate as soon as the End-Of-Line character is entered.

Multi-line mode allows the user to enter several lines of text without triggering immediate execution.  This is useful when a statement is very long and the user would like to split it into multiple lines. It is also useful when defining a JOB, because jobs typically contain multiple statements.

To enter multi-line mode, use the command BEGIN.  The end-of-line character is now disabled from triggering execution.  The shell remains in multi-line mode until the command END is entered.  The END command also triggers the execution of the multi-line block.  In the example below, BEGIN and END are used to allow the SELECT statement to be split into several lines:

Example: BEGIN and END defining a multi-line block
BEGIN SELECT member_id, last_name, first_name, date_joined, status FROM Member WHERE age >= 21 ORDER BY last_name, first_name END

Alternately, the ABORT command exits multi-line mode and discards the multi-line block.

Command Files

A command file is a text file containing a series of GSQL statements.  Blank lines and comments are ignored. By convention, GSQL command files end with the suffix . gsql , but this is not a requirement. Command files are automatically treated as multi-line mode, so BEGIN and END statements are not needed. Command files may be run either from within the GSQL shell by prefixing the filename with an @ symbol:
GSQL> @file.gsl

or from the operating system (i.e., a Linux shell) by giving the filename as the argument after gsql:
os$ gsql file.gsql

Help and Information

The help command displays a summary of the available GSQL commands:

GSQL> HELP [BASIC|QUERY]

Note that the HELP command has options for showing more details about certain categories of commands.

The ls command displays the catalog : all the vertex types, edge types, graphs, queries, jobs, and session parameters which have been defined by the user.

--reset option

The --reset option will clear the entire graph data store and erase all related definitions (graph schema, loading jobs, and queries) from the Dictionary.  The data deletion cannot be undone; use with extreme caution. The REST++, GPE, and GSE modules will be turned off.

$ gsql --reset Resetting the catalog. Shutdown restpp gse gpe ... Graph store /home/tigergraph/tigergraph/gstore/0/ has been cleared! The catalog was reset and the graph store was cleared.


Summary

The table below summaries the basic system commands introduced so far.

Command Description
HELP [BASIC|QUERY]
Display the help menu for all or a subset of the commands
LS
Display the catalog, which records all the vertex types, edge types, graphs, queries, jobs, and session parameters that have been defined
BEGIN
Enter multi-line edit mode (only for console mode within the shell)
END
Finish multi-line edit mode and execute the multi-line block.
ABORT
Abort multi-line edit mode and discard the multi-line block.
@file.gsql

Run the gsql statements in the command file file.gsql from within the GSQL shell.

os$ gsql file.gsql

Run the gsql statements in the command file file.gsql from an operating system shell.

os$ gsql --reset Clear the graph store and erase the dictionary.

Session Parameters

Session parameters are built-in system variables whose values are valid during the current session; their values do not endure after the session ends. In interactive command mode, a session starts and ends when entering and exiting interactive mode, respectively. When running a command file, the session lasts during the execution of the command file.

Use the SET command to set the value of a session parameter:

SET session_parameter = value


Session Parameter Meaning and Usage
sys.data_root The value should be a string, representing the absolute or relative path to the folder where data files are stored. After the parameter has been set, a loading statement can reference this parameter with $sys.data_root.
gsql_src_dir The value should be a string, representing the absolute or relative path to the root folder for the gsql system installation. After the parameter has been set, a loading statement can reference this parameter with $gsql_src_dir.
exit_on_error

When this parameter is true (default), if a semantic error occurs while running a GSQL command file, the GSQL shell will terminate. Accepted parameter values: true, false (case insensitive). If the parameter is set to false, then a command file which is syntactically correct will continue running, even if certain runtime errors in individual commands occur. Specifically, this affects these commands:

  • CREATE
  • INSTALL QUERY
  • RUN JOB

Semantic errors include a reference to a nonexistent entity or an improper reuse of an entity.

This session parameter does not affect GSQL interactive mode; GSQL interactive mode does not exit on any error.

This session parameter does not affect syntactic errors: GSQL will always exit on a syntactic error.

Example of exit_on_error = FALSE
# exitOnError.gsql SET exit_on_error = FALSE CREATE VERTEX v(PRIMARY_ID id INT, name STRING) CREATE VERTEX v(PRIMARY_ID id INT, weight FLOAT) #error 1: can't define VERTEX v CREATE UNDIRECTED EDGE e2 (FROM u, TO v) #error 2: vertex type u doesn't exist CREATE UNDIRECTED EDGE e1 (FROM v, TO v) CREATE GRAPH g(v) #error 3: no graph definition has no edge type CREATE GRAPH g2(*)
Results
os$ gsql exitOnError.gsql The vertex type v is created. Semantic Check Fails: The vertex name v is used by another object! Please use a different name. failed to create the vertex type v Semantic Check Fails: FROM or TO vertex type does not exist! failed to create the edge type e2 The edge type e1 is created. Semantic Check Fails: There is no edge type specified! Please specify at least one edge type! The graph g could not be created! Restarting gse gpe restpp ... Finish restarting services in 11.955 seconds! The graph g2 is created.


Attribute Data Types

Each attribute of a vertex or edge has an assigned data type. The following types are currently supported.

Primitive Types

name default value valid input format (regex) Range and Precision description
INT 0 [-+]?[0-9]+ from –2 63 to +2 63 - 1 (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807) 8-byte signed integer
UINT 0 [0-9]+ from 0 to 2 64 - 1 (18,446,744,073,709,551,615) 8-byte unsigned integer
FLOAT 0.0 [ -+ ] ? [ 0 - 9 ] * \. ? [ 0 - 9 ] + ( [ eE ] [ -+ ] ? [ 0 - 9 ] + ) ? +/- 3.4 E +/-38, 7 bits of precision

4-byte single-precision floating point number
Examples: 3.14159, .0065e14, 7E23
See note below.

DOUBLE 0.0 [ -+ ] ? [ 0 - 9 ] * \. ? [ 0 - 9 ] + ( [ eE ] [ -+ ] ? [ 0 - 9 ] + ) ? +/- 1.7 E +/-308, 15 bits of precision

8-byte double-precision floating point number.
Has the same input and output format as FLOAT, but the range and precision are greater. See note below.

BOOL false "true", "false" (case insensitive), 1, 0 true, false boolean true and false, represented within GSQL as true and false , and represented in input and output as 1 and 0
STRING empty string .* UTF-8 character string. The string value can optionally be enclosed by single quote marks or double quote marks. Please see the QUOTE parameter in Section "Other Optional LOAD Clauses".

For FLOAT and DOUBLE values, the GSQL Loader supports exponential notation as shown (e.g., 1.25 E-7).

The GSQL Query Language currently only reads values without exponents. It may display output values with exponential notation, however.

Some numeric expressions may return a non-numeric string result, such as "inf" for Infinity or "NaN" for Not a Number.

Advanced Types

name default value supported data format Range and Precision description
STRING COMPRESS empty string .* UTF-8 string with a finite set of categorical values. The GSQL system uses dictionary encoding to assign a unique integer to each new string value, and then to store the values as integers.
DATETIME UTC time 0 see Section " Loading DATETIME Attribute "

1582-10-15 00:00:00 to 9999-12-31 23:59:59

date and time (UTC) as the number of seconds elapsed since the start of Jan 1, 1970. Time zones are not supported. Displayed in YYYY-MM-DD hh:mm:ss format.
FIXED_BINARY( n ) N/A
N/A stream of n binary-encoded bytes

Additionally, GSQL also support following complex data types:

Complex Types

  • User Defined Tuple (UDT) : UDT represents an ordered structure of several fields of same or different types. The supported field types are listed below. Each field in a UDT has a fixed size. A STRING field must be given a size in characters, and the loader will only load the first given number of characters. A INT or UINT field can optionally be given a size in bytes.

    Field Type

    User-specified size?

    Size Choices (in Byte, except STRING)

    Range (N is size)

    INT optional 1, 2, 4 (default), 8 0 to 2^(N*8) - 1
    UINT optional 1, 2, 4 (default), 8 -2^(N*8 - 1) to 2^(N*8 - 1) - 1
    FLOAT no
    same as FLOAT attribute
    DOUBLE no
    same as DOUBLE attribute
    DATETIME no
    same as DATETIME attribute
    BOOL no
    true, false
    STRING required Any number of characters Any string in N characters


    Below is an example of defining a UDT:

    Example of a UDT
    TYPEDEF TUPLE <field1 INT (1), field2 UINT, field3 STRING (10), field4 DOUBLE > myTuple

    In this example, myTuple is the name of this UDT. It contains four fields: a 1-byte INT field named field1, a 4-byte UINT field named field2, a 10-character STRING field named field3, and a (8-byte) DOUBLE field named field4.

  • LIST/SET : A set is a unordered collection of unique elements of the same type; A list is an ordered collection of elements of the same type. A list can contain duplicate elements; a set cannot. The default value of either is an empty list/set. The supported element types of a list or a set are INT, DOUBLE, STRING, STRING COMPRESS, DATETIME, and UDT. To declare a list or set type, use <> brackets to enclose the element type, e.g.,
    SET<INT>, LIST<STRING COMPRESS>.

    Due to multithreaded GSQL loading, the initial order of elements loaded into a LIST might be different than the order in which they appeared in the input data.

  • MAP : A map is a collection of key-value pairs. It cannot contain duplicate keys, and each key maps to one value. The default value is an empty map. The supported key types are INT, STRING, STRING COMPRESS, and DATETIME. The supported value types are INT, DOUBLE, STRING, STRING COMPRESS, DATETIME, and UDT. To declare a map type, use <> to enclose the types, with a comma to separate the key and value types, e.g.,
    MAP<INT, DOUBLE>.

Back to Top


Defining a Graph Schema

Contents of this Section

Before data can be loaded into the graph store, the user must define a graph schema. A graph schema is a "dictionary" that defines the types of entities, vertices and edges , in the graph and how those types of entities are related to one another. In the figure below, circles represent vertex types, and lines represent edge types. The labeling text shows the name of each type. This example has four types of vertices: User, Occupation, Book, and Genre .  Also, the example has 3 types of edges: user_occupation, user_book_rating, and book_genre . Note that this diagram does not say anything about how many users or books are in the graph database.  It also does not indicate the cardinality of the relationship. For example, it does not specify whether a User may connect to multiple occupations.

An edge connects two vertices; in TigerGraph terminology these two vertices are the source vertex and the target vertex . An edge type can be either directed or undirected .  A directed edge has a clear semantic direction, from the source vertex to the target vertex. For example, if there is an edge type that represents a plane flight segment, each segment needs to distinguish which airport is the origin (source vertex) and which airport is the destination (target vertex).  In the example schema below, all of the edges are undirected. A useful test to decide whether an edge should be directed or undirected is the following: "An edge type is directed if knowing there is a relationship from A to B does not tell me whether there is a relationship from B to A." Having nonstop service from Chicago to Shanghai does not automatically imply there is nonstop service from Shanghai to Chicago.

Figure 1 - A schema for a User-Book-Rating graph

An expanded schema is shown below, containing all the original vertex and edge types plus three additional edge types: friend_of, sequel_of, and user_book_read . Note that friend_of joins a User to a User. The friendship is assumed to be bidirectional, so the edge type is undirected. Sequel_of joins a Book to a Book but it is directed, as evidenced by the arrowhead. The Two Towers is the sequel of The Fellowship of the Ring , but the reverse is not true. User_book_read is added to illustrate that there may be more than one edge type between a pair of vertex types.

Figure 2 - Expanded-User-Book-Rating schema with additional edges

The TigerGraph system user designs a graph schema to fit the source data and the user's needs and interests. The TigerGraph system user should consider what type of relationships are of interest and what type of analysis is needed. The TigerGraph system lets the user modify an existing schema, so the user is not locked into the initial design decision.

In the first schema diagram above, there are seven entities: four vertex types and three edge types.You may wonder why it was decided to make Occupation a separate vertex type instead of an attribute of User. Likewise, why is Genre a vertex type instead of an attribute of Book?  These are examples of design choices.  Occupation and Genre were separated out as vertex types because in graph analysis, if an attribute will be used as a query variable, it is often easier to work with as a vertex type.

Once the graph designer has chosen a graph schema, the schema is ready to be formalized into a series of GSQL statements.

CREATE VERTEX

The CREATE VERTEX statement defines a new vertex type, with a name and an attribute list.  At a high level of abstraction, the format is

CREATE VERTEX vertex_type_name (PRIMARY_ID id type [,attribute_list]) [ vertex_options ]

More specifically, the syntax is as follows:

CREATE VERTEX Syntax
CREATE VERTEX vertex_type_name (PRIMARY_ID id_name type [, attribute_name type [DEFAULT default_value] ]*) [WITH STATS="none"|"outdegree"|"outdegree_by_edgetype"]


PRIMARY_ID

The primary_id is a required field whose purpose is to uniquely identify each vertex instance. Only two data types are permitted, either STRING or UINT.

Vertex Attribute List

The attribute li st, enclosed in parentheses, is a list of one or more id definitions and attribute descriptions separated by commas:

(PRIMARY_ID id_name type ,
[ attribute_name type [DEFAULT default_value ] ]*)

    1. The first item is required to be PRIMARY_ID id type , where id is an identifier, and type is one of the attribute data types listed in the "Language Basics" section.
    2. Discontinued Feature

      The NULL and NOT NULL properties are not supported. NULL is not a supported value in the graph database.
      If an input value is not specified for an attribute when a vertex or edge instance is being created, then the attribute will have the default value for that data type.

    3. Every attribute data type has a built-in default value (e.g., the default value for INT type is 0). The DEFAULT default_value option overrides the built-in value.
    4. Any number of additional attributes may be listed after the id attribute. Each attribute has a name, type, and optional default value (for primitive-type, DATETIME, or STRING COMPRESS attributes only)

Example:

  • Create vertex types for the graph schema of Figure 1.

    Vertex definitions for User-Book-Rating graph
    CREATE VERTEX User (PRIMARY_ID user_id UINT, age UINT, gender STRING, postalCode STRING) WITH STATS="outdegree_by_edgetype" CREATE VERTEX Occupation (PRIMARY_ID occ_id STRING, occ_name STRING) WITH STATS="outdegree" CREATE VERTEX Book (PRIMARY_ID bookcode STRING, title STRING, pub_year UINT) CREATE VERTEX Genre (PRIMARY_ID genre_id UINT, genre_name STRING) WITH STATS="outdegree"

    Unlike the tables in a relational database, vertex types do not need to have a foreign key attribute for one vertex type to have a relationship to another vertex type.  Such relationships are handled by edge types.

WITH STATS

By default, when the loader stores a vertex and its attributes in the graph store, it also stores some statistics about the vertex's outdegree – how many connections it has to other vertices.  The optional WITH STATS clause lets the user control how much information is recorded. Recording the information in the graph store will speed up queries which need degree information, but it increases the memory usage.  There are three options. If "outdegree" is chosen, then each vertex records how many outward edges are connected to it (i.e., edges for which this vertex is the source vertex). If "outdegree_by_edgetype" is chosen, then each vertex also records a list of degree count values, one value for each type of edge in the schema. If "none" is chosen, then no degree statistics are recorded with each vertex. If the WITH STATS clause is not used, the loader acts as if "outdegree_by_edgetype" were selected.

If outdegree information is recorded, it can be retrieved in a query using the vertex's outdegree() function.

Example :
The graph below has two types of edges between persons: phone_call and text.  For Bobby, the "outdegree" option records the total number of connections to other vertices (Bobby.outdegree() = 3). The "outdegree_by_edgetype" option also records separate values for how many phone calls Bobby made (1) and how many text messages Bobby sent (2).

Figure 3 - Outdegree stats illustration

WITH STATS option (case insensitive) Bobby.outdegree() Bobby.outdegree("text") Bobby.outdegree("phone_call")
"none" not available not available not available
"outdegree" 3 not available not available
"outdegree_by_edgetype"
(default)
3 2 1

CREATE EDGE

There are two forms of the CREATE EDGE statement, one for directed edges and one for undirected edges.  Each edge type must specify that it connects FROM one vertex type TO another vertex type.  Additional attributes may be added.  Each attribute follows the same requirements as described in the Attribute List subsection for the "CREATE VERTEX" section.

CREATE UNDIRECTED EDGE
CREATE UNDIRECTED EDGE edge_type_name (FROM vertex_type_name, TO vertex_type_name [, attribute_name type [DEFAULT default_value]]* )


CREATE DIRECTED EDGE
CREATE DIRECTED EDGE edge_type_name (FROM vertex_type_name, TO vertex_type_name [, attribute_name type [DEFAULT default_value]]* ) [WITH REVERSE_EDGE="rev_name"]

Viewed at a higher level of abstraction, the format is

CREATE UNDIRECTED|DIRECTED EDGE edge_type_name (FROM vertex_type_name , TO vertex_type_name ,
edge_attribute_list ) [ edge_options ]

Note that edges do not have a PRIMARY_ID field. Instead, each edge is uniquely identified by a FROM vertex, a TO vertex, and optionally other attributes.  The edge type may also be a distinguishing characteristic. For example, as shown in Figure 2 above, there are two types of edges between User and Book.  Therefore, both types would have attribute lists which begin (FROM User, To Book,...).

Discontinued Feature

The NULL and NOT NULL properties are not supported. NULL is not a supported value in the graph database.

An edge type can be defined which connects FROM any type of vertex and/or TO any type of vertex.  Use the wildcard symbol * to indicate "any vertex type". For example, the any_edge type below can connect from any vertex to any other vertex:

Wildcard edge type
CREATE DIRECTED EDGE any_edge (FROM *, TO *, label STRING)


WITH REVERSE_EDGE

If a CREATE DIRECTED EDGE statement includes the WITH REVERSE_EDGE=" rev_name " optional clause, then an additional directed edge type called " rev_name " is automatically created, with the FROM and TO vertices swapped.  Moreover, whenever a new edge is created, a reverse edge is also created. The reverse edge will have the same attributes, and whenever the principal edge is updated, the corresponding reverse edge is also updated.

In a TigerGraph system, reverse edges provide the most efficient way to perform graph queries and searches that need to look "backwards". For example, referring to the schema of Figure 2, the query "What is the sequel of Book X, if it has one?" is a forward search, using sequel_of edges.  However, the query "Is Book X a sequel? If so, what Book came before X?" requires examining reverse edges.

Example:

Create undirected edges for the three edge types in Figure 1.

Edge definitions for User-Book-Rating graph
CREATE UNDIRECTED EDGE user_occupation (FROM User, TO Occupation) CREATE UNDIRECTED EDGE book_genre (FROM Book, TO Genre) CREATE UNDIRECTED EDGE user_book_rating (FROM User, TO Book, rating UINT, date_time UINT)

The user_occupation and book_genre edges have no attributes. A user_book_rating edge symbolizes that a user has assigned a rating to a book. Therefore it  includes an additional attribute rating . In this case the rating attribute is defined to be an integer, but it could just as easily have been set to be a float attribute.
Example :

Create the additional edges depicted in Figure 2.

Additional Edge definitions for Expanded-User-Book-Rating graph
CREATE UNDIRECTED EDGE friend_of (FROM User, TO User, on_date UINT) CREATE UNDIRECTED EDGE user_book_read (FROM User, To Book, on_date UINT) CREATE DIRECTED EDGE sequel_of (FROM Book, TO Book) WITH REVERSE_EDGE="preceded_by"

Every time the GSQL loader creates a sequel_of edge, it will also automatically create a preceded_by edge, pointing in the opposite direction.

Special Options

Sharing a Compression Dictionary

The STRING COMPRESS and STRING_SET COMPRESS data types achieve compression by mapping each unique attribute value to a small integer. The mapping table ("this string" = "this integer") is called the dictionary. If two such attributes have the same or similar sets of possible values, then it is desirable to have them share one dictionary because it uses less storage space.

When a STRING COMPRESS attribute is declared in a vertex or edge, the user can optionally provide a name for the dictionary. Any attributes which share the same dictionary name will share the same dictionary. For example, v1.attr1, v1.attr2, and e.attr1 below share the same dictionary named "e1".

Shared STRING COMPRESS dictionaries
CREATE VERTEX v1 (PRIMARY_ID main_id STRING, att1 STRING COMPRESS e1, att2 STRING COMPRESS e1) CREATE UNDIRECTED EDGE e (FROM v1, TO v2, att1 STRING COMPRESS e1)

CREATE GRAPH

After all the required vertex and edge types are created, the CREATE GRAPH command defines a graph schema which contains the given vertex types and edge types, and prepares the graph store to accept data. The vertex types and edge types may be listed in any order. Currently, the TigerGraph system supports only one graph definition at a time.

CREATE GRAPH syntax
CREATE GRAPH gname (vertex_or_edge_type, vertex_or_edge_type...)


Instead of providing a list of specific vertex types and edge types, it is also possible to define a graph type which includes all the available vertex types and edge types:

CREATE GRAPH including all necessary vertex and edge types
CREATE GRAPH gname (*)

If there is already a graph defined, the previous definition must be removed, using DROP GRAPH (or DROP ALL) before the new graph can be defined. Note that data from the previous graph is cleared.


Examples :

Create graph Book_rating for the edge and vertex types defined for Figure 1:

Graph definition for User-Book-Rating graph
CREATE GRAPH Book_rating (*)


The following code example shows the full set of statements to define the expanded user-book-rating graph:

Full definition for the Expanded User-Book-Rating graph
CREATE VERTEX User (PRIMARY_ID user_id UINT, age UINT, gender STRING, postalCode STRING) WITH STATS="outdegree_by_edgetype" CREATE VERTEX Occupation (PRIMARY_ID occ_id STRING, occ_name STRING) WITH STATS="outdegree" CREATE VERTEX Book (PRIMARY_ID bookcode STRING, title STRING, pub_year UINT) CREATE VERTEX Genre (PRIMARY_ID genre_id UINT, genre_name STRING) WITH STATS="outdegree" CREATE UNDIRECTED EDGE user_occupation (FROM User, TO Occupation) CREATE UNDIRECTED EDGE book_genre (FROM Book, TO Genre) CREATE UNDIRECTED EDGE user_book_rating (FROM User, TO Book, rating UINT, date_time UINT) CREATE UNDIRECTED EDGE friend_of (FROM User, TO User, on_date UINT) CREATE UNDIRECTED EDGE user_book_read (FROM User, To Book, on_date UINT) CREATE DIRECTED EDGE sequel_of (FROM Book, TO Book) WITH REVERSE_EDGE="preceded_by" CREATE GRAPH Book_rating (*)

DROP GRAPH

DROP GRAPH syntax
DROP GRAPH gname

The DROP GRAPH command not only deletes all the data in the graph, but it also drops the definitions of all vertex types, all edge types, and the graph type itself.  To delete only selected vertex types or edge types, see DROP VERTEX | EDGE in the Section "Modifying a Graph Schema".  You must DROP GRAPH before defining a new graph type.

Back to Top


Modifying a Graph Schema

Contents of this Section

After a graph schema has been created , it can be modified. Data already stored in the graph and which is not logically part of the change will be retained. For example, if you had 100 Book vertices and then added an attribute to the Book schema, you would still have 100 Books, with default values for the new attribute. If you dropped a Book attribute, you still would have all your books, but one attribute would be gone.

To safely update the graph schema, the user should follow this procedure:

  • Create a SCHEMA_CHANGE JOB, which defines a sequence of ADD, ALTER and/or DROP statements.
  • Run the SCHEMA_CHANGE JOB (i.e. RUN JOB job_name ), which will do the following:
    • Attempt the schema change.
    • If the change is successful, invalidate any loading job or query definitions which are incompatible with the new schema.
    • if the change is unsuccessful, report the failure and return to the state before the attempt.

A schema change will invalidate any loading jobs or query jobs which relate to an altered part of the schema. Specifically:

  • A loading job becomes invalid if it refers to a vertex or and an edge which has been dropped (deleted) or altered .
  • A query becomes invalid if it refers to a vertex, and edge, or an attribute which has been dropped .

Invalid loading jobs are dropped, and invalid queries are uninstalled. After the schema update, the user will need to c reate and install new load and query jobs based on the new schema.

Jobs and queries for unaltered parts of the schema will still be available and do not need to be reinstalled.  However, even though these jobs are valid (e.g., they can be run), the user may wish to examine whether they still perform the preferred operations (e.g., do you want to run them?)

Load or query operations which begin before the schema change will be completed based on the pre-change schema. Load or query operations which begin after the schema change, and which have not been invalidated, will be completed based on the post-change schema.


CREATE SCHEMA_CHANGE JOB

The CREATE SCHEMA_CHANGE JOB block defines a sequence of ADD, ALTER, and DROP statements for changing a particular graph. It does not perform the schema change.

CREATE SCHEMA_CHANGE JOB syntax
CREATE SCHEMA_CHANGE JOB job_name FOR GRAPH graph_name { [sequence of DROP, ALTER, and ADD statements, each line ending with a semicolon] }

By its nature, a SCHEMA_CHANGE JOB may contain multiple statements. If the job block is used in the interactive GSQL shell, then the BEGIN and END commands should be used to permit the SCHEMA_CHANGE JOB to be entered on several lines. if the job is stored in a command file to be read in batch mode, then BEGIN and END are not needed.

Remember to include a semicolon at the end of each DROP, ALTER, or ADD statement within the JOB block.

If a SCHEMA_CHANGE JOB defines a new edge type which connects to a new vertex type, the ADD VERTEX statement should precede the related ADD EDGE statement. However, the ADD EDGE and ADD VERTEX statements can be in the same SCHEMA_CHANGE JOB.

ADD VERTEX | EDGE

The ADD statement defines a new type of vertex or edge and automatically adds it to a graph schema. The syntax for the ADD VERTEX | EDGE statement is analogous to that of the CREATE VERTEX | EDGE | GRAPH statements.  It may only be used within a SCHEMA_CHANGE JOB.

ADD VERTEX / UNDIRECTED EDGE / DIRECTED EDGE
ADD VERTEX vertex_type_name (PRIMARY_ID id type [, attribute_list]) [WITH STATS="none"|"outdegree"|"outdegree_by_edgetype"]; ADD UNDIRECTED EDGE edge_type_name (FROM vertex_type_name, TO vertex_type_name [, edge_attribute_list]); ADD DIRECTED EDGE edge_type_name (FROM vertex_type_name, TO vertex_type_name [, edge_attribute_list]) [WITH REVERSE_EDGE="rev_name"];

ALTER VERTEX | EDGE

The ALTER statement is used to add attributes to or remove attributes from an existing vertex type or edge type. It may only be used within a SCHEMA_CHANGE JOB.  The basic format is as follows:

ALTER VERTEX / EDGE
ALTER VERTEX|EDGE object_type_name ADD|DROP (attribute_list);

ALTER ... ADD

Added attributes are appended to the end of the schema.  The new attributes may include DEFAULT fields:

ALTER ... ADD
ALTER VERTEX|EDGE object_type_name ADD ATTRIBUTE ( attribute_name type [DEFAULT default_value] [, attribute_name type [DEFAULT default_value]]* );

ALTER ... DROP

ALTER ... DROP
ALTER VERTEX|EDGE object_type_name DROP ATTRIBUTE ( attribute_name [, attribute_name]* );

DROP VERTEX | EDGE

The DROP statement removes the specified vertex type or edge type from the catalog. The DROP statement should only be used when graph operations are not in progress.

drop vertex / edge
DROP VERTEX vertex_type_name [, vertex_type_name]* DROP EDGE edge_type_name [, edge_type_name]*

RUN SCHEMA_CHANGE JOB

RUN JOB job_name will perform the schema change job. After the schema has been changed, the GSQL system checks all existing GSQL queries (described in "GSQL Language Reference, Part 2: Querying"). If an existing GSQL query uses a dropped vertex, edge, or attribute, the query becomes invalid, and GSQL will show the message "Query query_name becomes invalid after schema update, please update it.".

Below is an example. The schema ch ange job dro p_year drops a year attribute from the user_occupation edge.

SCHEMA_CHANGE JOB example
CREATE SCHEMA_CHANGE JOB drop_year FOR GRAPH Book_rating { ALTER EDGE user_occupation DROP (year); } RUN JOB drop_year

Back to Top


Creating a Loading Job

Contents of this Section ( major sections only)

Afte r a graph sch ema has been created, the GSQL program is ready to load data into the graph store. The GSQL system can read data from any text file formatted in tabular "CSV" format.  That is, each line in the data file contains a series of data values, separated by commas, tabs, spaces, or any other designated ASCII characters (only single character separators are supported). A line should contain only data values and separators, without extra whitespace. From a tabular view, each line of data is a row, and each row consists of a series of column values.

The GSQL language offers easy-to-understand and easy-to-use commands for data loading which perform many of the same data conversion, mapping, filtering, and merging operations which are found in enterprise ETL (Extract,Transform, and Load) systems.

Loading data is a two-step process. First, a loading job is defined.  Next, the job is executed with the RUN JOB statement. These two statements, and the components with the loading job, are detailed below.

Prior to v0.8, there are two versions of loading operations, online loading and offline loading. Starting from v0.8, all GSQL loading is online.  For backward compatibility, the syntax for offline loading jobs is still accepted. However, the loader will translate an offline job into one or more online jobs.

Online loading does not have to specify the data source until run time.  Online loading also employs multithreaded execution for faster processing. In an offline loading job, the data source file is specified at job definition time. Also, offline loading jobs can use a header line in the data file to define column names.

The structure of a loading job will be presented hierarchically, top-down:

CREATE ... JOB, which may contain a set of DEFINE and LOAD statements

  • DEFINE statements
  • LOAD statements, which can have several clauses

Example loading jobs and data files for the book_rating schema defined earlier in the document are available in the /doc/examples/gsql_ref folder in your TigerGraph platform installation.


CREATE JOB Block

There are two versions of loading JOB blocks. This in turn will lead to two versions of LOAD and RUN JOB statements.

CREATE ONLINE_POST JOB

The CREATE ONLINE_POST JOB statement defines a block of LOAD and other statements for online loading. An ONLINE_POST JOB is associated with a particular graph ( graph_name ) when it is defined, but it is not bound to a particular data file until the job is run. That is, ONLINE_POST JOB defines a mapping from hypothetical data columns to vertex and edge attributes.  The sequence of DEFINE and LOAD statements is enclosed in curly braces. Each statement in the block, including the last one, should end with a semicolon.

CREATE LOAD for online loading
CREATE ONLINE_POST JOB job_name FOR GRAPH graph_name { [zero or more DEFINE statements, each ending in a semicolon] [zero or more Online LOAD statements, each ending in a semicolon] [zero or more DELETE statements, each ending in a semicolon] }

CREATE LOADING JOB (offline)

Specification change

Starting with v0.8, offline loading jobs are emulated by online loading jobs.

The CREATE LOADING JOB statement defines a block of LOAD statements for offline loading.  A LOADING JOB pertains to a particular graph ( graph_name ). Moreover, each of its LOAD statements is bound to a particular set of input data files; therefore, the data sources are a fixed feature of the job. The sequence of DEFINE and LOAD statements is enclosed in curly braces. Each statement in the block, including the last one, should end with a semicolon.

CREATE LOAD for offline loading
CREATE LOADING JOB job_name FOR GRAPH graph_name { [zero or more DEFINE statements, each ending in a semicolon] [one or more Offline LOAD statements, each ending in a semicolon] }


DROP JOB statement

To drop (remove) a job, run "DROP JOB job_name". The job will be removed from GSQL. To drop all jobs, run either of the following commands:
DROP JOB ALL
DROP JOB *

DEFINE statements

A DEFINE statement is used to define a local variable or expression to be used by the subsequent LOAD statements in the loading job.

DEFINE HEADER

The DEFINE HEADER statement defines a sequence of column names for an input data file, either for online or offline loading. The first column name maps to the first column, the second column name maps to the second column, etc.

DEFINE HEADER header_name = " column_name "[," column_name "]*;

DEFINE INPUT_LINE_FILTER

The DEFINE INPUT_LINE_FILTER statement defines a named Boolean expression whose value depends on column attributes from a row of input data. When combined with a USING reject_line_rule clause in a LOAD statement, the filter determines whether an input line is ignored or not.

DEFINE INPUT_LINE_FILTER filter_name = boolean_expression_using_column_variables ;

LOAD statements

A LOAD statement tells the GSQL loader how to parse a data line into column values (tokens), and then describes how the values should be used to create a new vertex or edge instance. One LOAD statement can be used to generate multiple vertices or edges, each vertex or edge having its own Destination_Clause , as shown below. Additionally, in offline loading, two or more LOAD statements may refer to the same input data file. In this case, the GSQL loader will merge their operations so that both of their operations are executed in a single pass through the data file.

The LOAD statement has many options. This reference guide provides examples of key features and options. The Platform Knowledge Base / FAQs and the tutorials, such as Get Started with TigerGraph , provide additional solution- and application-oriented examples.

Online LOAD statement

LOAD [TEMP_TABLE table_name ] Destination_Clause [, Destination_Clause ]* [USING clause ];

Offline LOAD statement

LOAD (" file_path "|TEMP_TABLE table_name) Destination_Clause [, Destination_Clause ]* [USING clause ];

The remainder of this section of the document will provide details on the format and use of the file_path, Destination_Clause, its subclauses. USING clause is introduced later in Section "Other Optional LOAD Clauses".

File Path (offline only)

For offline loading statements, file_path is the absolute or relative path to the input data file. The path should be enclosed in double quotes, as shown. In general, the input files for offline loading need to be specified at compile time. However, one way to set the relative location of data files at run-time is to use the session parameter sys.data_root .  The file_path value in the LOAD statement can begin with $sys.data_root, as in the following example:

Example: using sys.data_root in a loading job
CREATE LOADING JOB filePathEx FOR GRAPH gsql_demo { LOAD "$sys.data_root/persons.csv" TO ... } 

Then, when running this loading job, first set a value for the parameter, and then run the job:

Example: Setting sys.data_root session parameter
SET sys.data_root="/home/tigergraph/mydata" RUN JOB filePathEx

As the name implies, session parameters only retain their value for the duration of the current GSQL session.  If the user exits GSQL, the settings are lost.


If the session parameter sys.data_root was previously defined outside of the job definition, then the file_path may begin with $sys.data_root.

Example of sys.data_root in an Offline LOAD statement
# set the data root dir to system var SET sys.data_root="/graph/data_abc" # In offline loading job LOAD "$sys.data_root/data_file_name.csv" ...

Destination Clause

A Destination_Clause describes how the tokens from a data source should be used to construct one of three types of data objects : a vertex, an edge, or a row in a temporary table (TEMP_TABLE). The destination clause formats for the three types are very similar, but we show them separately for clarity:

Vertex Destination Clause
TO VERTEX vertex_type_name VALUES (id_expr [, attr_expr]*) [WHERE conditions] [OPTION (options)]


Edge Destination Clause
TO EDGE edge_type_name VALUES (source_id_expr, target_id_expr [, attr_expr]*) [WHERE conditions] [OPTION (options)]


TEMP_TABLE Destination Clause
TO TEMP_TABLE table_name (id_name [, attr_name]*) VALUES (id_expr [, attr_expr]*) [WHERE conditions] [OPTION (options)]

For the TO VERTEX and TO EDGE destination clauses, the vertex_type_name or edge_type_name must match the name of a vertex or edge type previously defined in a CREATE VERTEX or CREATE UNDIRECTED|DIRECTED EDGE statement.  The values in the VALUE list (id_expr, attr_expr1, attr_expr2,...) are assigned to the id(s) and attributes of a new vertex or edge instance, in the same order in which they are listed in the CREATE statement. id_expr obeys the same attribute rules as attr_expr , except that only attr_expr can use the reducer function, which is introduced later.

In contrast, the TO TEMP_TABLE clause is defining a new, temporary data structure.  Its unique characteristics will be described in a separate subsection. For now, we focus on TO VERTEX and TO EDGE.

Attributes and Attribute Expressions

A LOAD statement processes each line of an input file, splitting each line (according to the SEPARATOR character, see Section "Other Optional LOAD Clauses" for more details) into a sequence of tokens. Each destination clause provides a token-to-attribute mapping which defines how to construct a new vertex, an edge, or a temp table row instance (e.g., one data object). The tokens can also be thought of as the column values in a table. There are two ways to refer to a column, by position or by name.  Assuming a column has a name, either method may be used, and both methods may be used within one expression.

By Position : The columns (tokens) are numbered from left to right, starting with $0.  The next column is $1, and so on.

By Name : Columns can be named, either through a header line in the input file, or through a DEFINE HEADER statement.  If a header line is used, then the first line of the input file should be structured like a data line, using the same separator characters, except that each column contains a column name string instead of a data value. Names are enclosed in double quotes, e.g. $"age".

Data file name: $sys.file_name refers to the current input data file.

In a simple case, a token value is copied directly to an attribute. For example, in the following LOAD statement,

Example: using $sys.file_name in an attribute expression
LOAD "xx/yy/a.csv" TO VERTEX person VALUES ($0, $1, $sys.file_name)
  • The PRIMARY_ID of a person vertex comes from column $0 of the file "xx/yy/a.csv".
  • The next attribute of a person vertex comes from column $1.
  • The next attribute of a person vertex is given the value "xx/y/a.csv" (the filename itself).

Cumulative Loading

A basic principle in the GSQL Loader is cumulative loading. Cumulative loading means that a particular data object might be written to (i.e., loaded) multiple times, and the result of the multiple loads may depend on the full sequence of writes. This usually means that If a data line provides a valid data object, and the WHERE clause and OPTION clause are satisfied, then the data object is loaded.

  1. Valid input : For each input data line, each destination clause constructs one or more new data objects. To be a valid data object, it must have an ID value of the correct type, have correctly typed attribute values, and satisfy the optional WHERE clause. If the data object is not valid, the object is rejected (skipped) and counted as an error in the log file. The rules for invalid attributes values are summarized below:
    1. UINT: Any non-digit character. (Out-of-range values cause overflow instead of rejection)
    2. INT: Any non-digit or non-sign character. (Out-of-range values cause overflow instead of rejection)
    3. FLOAT and DOUBLE: Any wrong format
    4. STRING, STRING COMPRESS, FIXED_BINARY: N/A
    5. DATETIME: Wrong format, invalid date time, or out of range.
    6. Complex type: Depends on the field type or element type. Any invalid field (in UDT), element (in LIST or SET), key or value (in MAP) causes rejection.
  2. New data objects: If a valid data object has a new ID value, then the data object is added to the graph store.  Any attributes which are missing are assigned the default value for that data type or for that attribute.
  3. Overwriting existing data objects : If a valid data object has a ID value for an existing object, then the new object overwrites the existing data object, with the following clarifications and exceptions:
    1. The attribute values of the new object overwrite the attribute values of the existing data object.
    2. Missing tokens : If a token is missing from the input line so that the generated attribute is missing, then that attribute retains its previous value.

      A STRING token is never considered missing; if there are no characters, then the string is the empty string

  4. Skipping an attribute : A LOAD statement can specify that a particular attribute should NOT be loaded by using the special character _ (underscore) as its attribute expression (attr_expr).  For example,

    LOAD TO VEREX person VALUES ($0, $1, _, $2)

    means to skip the next-to-last attribute.  This technique is used when it is known that the input data file does not contain data for every attribute.

    1. If the LOAD is creating a new vertex or edge, then the skipped attribute will be assigned the default value.
    2. If the LOAD is overwriting an existing vertex or edge, then the skipped attribute will retain its existing value.

More Complex Attribute Expressions

An attribute expression may use column tokens (e.g., $0), literals (constant numeric or string values), any of the built-in loader token functions, or a user-defined token function. Attribute expressions may not contain mathematical or boolean operators (such as +, *, AND). The rules for attribute expressions are the same as those for id expressions, but an attribute expression can additionally use a reducer function:

  • id_expr := $column_number | $"column_name" | constant | $sys.file_name | token_function_name( id_expr [, id_expr ]* )
  • attr_expr := id_expr | REDUCE(reducer_function_name(id _expr ))

Note that token functions can be nested, that is, a token function can be used as an input parameter for another token function. The built-in loader token/reducer functions and user-defined token functions are described in the section "Built-In Loader Token Functions".

The subsections below describe details about loading particular data types.

Loading a DOUBLE or FLOAT Attribute

A floating point value has the basic format
[sign][digits].digits[[sign](e|E)digits]

or

[sign]digits[.[digits]][[sign](e|E)digits]

In the first case, the decimal point and following digits are required. In the second case, some digits are required (looking like an integer), and the following decimal point and digits are optional.

In both cases, the leading sign ( "+" or "-") is optional. The exponent, using "e" or "E", is optional. Commas and extra spaces are not allowed.

Examples of valid and invalid floating point values
# Valid floating point values -198256.03 +16. -.00036 7.14285e15 9.99E-22 # Invalid floating point values -198,256.03 9.99 E-22


Loading a DATETIME Attribute

When loading data into a DATETIME attribute, the GSQL loader will automatically read a string representation of datetime information and convert it to internal datetime representation.  The loader accepts any of the following string formats:

  • %Y-%m-%d %H:%M:%S (e.g., 2011-02-03 01:02:03)
  • %Y/%m/%d %H:%M:%S (e.g., 2011/02/03 01:02:03)
  • %Y-%m-%dT%H:%M:%S.000z (e.g., 2011-02-03T01:02:03.123z, 123 will be ignored)
  • %Y-%m-%d (only date, no time, e.g., 2011-02-03 )
  • %Y/%m/%d (only date, no time, e.g., 2011/02/03)
  • Any integer value (Unix Epoch time, where Jan 1, 1970 at 00:00:00 is integer 0)

Format notation:

%Y is a 4-digit year. A 2-digit year is not a valid value.

%m and %s are a month (1 to 12) and a day (1 to 31), respectively.  Leading zeroes are optional.

%H, %M, %S are hours (0 to 23), minutes (0 to 59) and seconds (0 to 59), respectively. Leading zeroes are optional.

When loading data, the loader checks whether the values of year, month, day, hour, minute, second are out of the valid range. If any invalid value is present, e.g. '2010-13-05' or '2004-04-31 00:00:00', the attribute is invalid and the object (vertex or edge) is not created.

Loading a User-Defined Type (UDT) Attribute

To load a UDT attribute, state the name of the UDT type, followed by the list of attribute expressions for the UDT's fields, in parentheses. See the example below.

syntax example
TYPEDEF TUPLE <f1 INT (1), f2 UINT, f3 STRING (10), f4 DOUBLE > myTuple # define a UDT CREATE VERTEX v_udt (PRIMARY_ID id STRING, att_udt myTuple) CREATE ONLINE_POST JOB load_udt FOR GRAPH test_graph { LOAD TO VERTEX v_udt VALUES ($0, myTuple($1, $2, $3, $4) ); # $1 is loaded as f1, $2 is loaded as f2, and so on }

Loading a LIST or SET Attribute

There are three methods to load a LIST or a SET.

The first method is to load multiple rows of data which share the same id values and append the individual attribute values to form a collection of values. The collections are formed incrementally by reading one value from each eligible data line and appending the new value into the collection. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not. If the id value(s) is new, then a new vertex or edge is created with a new list/set containing the single value. If the id(s) has been used before, then the value from the new line is appended to the existing list/set. Below shows an example:

Example: Cumulative loading of multiple rows to a SET/LIST
CREATE VERTEX test_vertex (PRIMARY_ID id STRING, iset SET<INT>, ilist LIST<INT>) CREATE UNDIRECTED EDGE test_edge(FROM test_vertex, TO test_vertex) CREATE GRAPH test_set_list (*) CREATE ONLINE_POST JOB load_set_list FOR GRAPH test_set_list { LOAD TO VERTEX test_vertex VALUES ($0, $1, $1); } RUN JOB load_set_list USING FILENAME="./list_set_vertex.csv", SEPARATOR=",", EOL="\n"

list_set_vertex.csv

list_set_vertex.csv
1,10
3,30
1,20
3,30
3,40
1,20

The job load_set_list  will load two test_vertex vertices because there are two unique id values in the data file. Vertex 1 has attribute values with iset = [10,20] and ilist = [10,20,20]. Vertex 3 has values iset = [30,40] and ilist = [30, 30, 40]. Note that a set doesn't contain duplicate values, while a list can contain duplicate values.

Because GSQL loading is multi-threaded, the order of values loaded into a LIST might not match the input order.

If the input file contains multiple columns which should be all added to the LIST or SET, then a second method is available. Use the LIST() or SET() function as in the example below:

Example: loading multiple columns to a SET/LIST
CREATE VERTEX v_set (PRIMARY_ID id STRING, nick_names SET<STRING>) CREATE VERTEX v_list (PRIMARY_ID id STRING, lucky_nums LIST<INT>) CREATE ONLINE_POST JOB load_set_list FOR GRAPH test_graph { LOAD TO VERTEX v_set VALUES ($0, SET($1,$2,$3) ); LOAD TO VERTEX v_list VALUES ($0, LIST($2,$4) ); }

The third method is to use the SPLIT () function to read a compound token and split it into a collection of elements, to form a LIST or SET collection. The SPLIT() function takes two arguments: the column index and the element separator. The element separator should be distinct from the separator through the whole file. Below shows an example:

Example: SET/LIST loading by SPLIT() example
CREATE VERTEX test_vertex (PRIMARY_ID id STRING, ustrset SET<STRING>, ilist LIST<INT>) CREATE UNDIRECTED EDGE test_edge(FROM test_vertex, TO test_vertex) CREATE GRAPH test_split (*) CREATE ONLINE_POST JOB set_list_job FOR GRAPH test_split { LOAD TO VERTEX test_vertex VALUES ($0, SPLIT($1,"|") , SPLIT($2,"#") ); } RUN JOB set_list_job USING FILENAME="./split_list_set.csv", SEPARATOR=",", EOL="\n"

split_list_set.csv

split_list_set.csv

vid,names,numbers
v1,mike|tom|jack, 1 # 2 # 3
v2,john, 5 # 4 # 8


The SPLIT() function cannot be used for UDT type elements.

Loading a MAP Attribute

There are three methods to load a MAP.

The first method is to load multiple rows of data which share the same id values. The maps are formed incrementally by reading one key-value pair from each eligible data line. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not. If the id value(s) is new, then a new vertex or edge is created with a new map containing the single key-value pair. If the id(s) has been used before, then the loading job checks whether the key exists in the map or not. If the key doesn't exist in the map, the new key-value pair is inserted. Otherwise, the value will be replaced by the new value.

The loading order might not be the same as the order in the raw data. If a data file contains multiple lines with the same id and same key but different values, loading them together results in a nondeterministic final value for that key.

Method 1 : Below is the syntax to load a MAP by the first method: Use an arrow (->)  to separate the map's key and value.

Loading a MAP by method 1: -> separator
CREATE VERTEX v_map (PRIMARY_ID id STRING, att_map MAP<INT, STRING>) CREATE ONLINE_POST JOB load_map FOR GRAPH test_graph { LOAD TO VERTEX v_map VALUES ($0, ($1 -> $2) ); }


Method 2 : The second method is to use the MAP() function. If there are multiple key-value pairs among multiple columns, MAP() can load them together. Below is an example:

Loading a MAP by method 2: MAP() function
CREATE VERTEX v_map (PRIMARY_ID id STRING, att_map MAP<INT, STRING>) CREATE ONLINE_POST JOB load_map FOR GRAPH test_graph { LOAD TO VERTEX v_map VALUES ($0, MAP( ($1 -> $2), ($3 -> $4) ) ); # $1 and $3 are keys and $2 and $4 are the corresponding values. }


Method 3 : The third method is to use the SPLIT() function. Similar to the SPLIT() in loading LIST or SET, the SPLIT() function can be used when the key-value pair is in one column and separated by a key-value separator, or multiple key-value pairs are in one column and separated by element separators and key-value separators. SPLIT() here has three parameters: The first is the column index, the second is the key-value separator, and the third is the element separator. The third parameter is optional. If one row of raw data only has one key-value pair, the third parameter can be skipped. Below are the examples without and with the given element separator.

one_key_value.csv

example data with one key-value pair per line
vid,key_value v1,1:mike v2,2:tom v1,3:lucy

multi_key_value.csv

example data with multiple key-value pairs per line
vid,key_value_list v1,1:mike#4:lin v2,2:tom v1,3:lucy#1:john#6:jack


Loading a MAP by method 3: SPLIT() function
CREATE VERTEX v_map (PRIMARY_ID id STRING, att_map MAP<INT, STRING>) CREATE ONLINE_POST JOB load_map FOR GRAPH test_graph { LOAD TO VERTEX v_map VALUES ($0, SPLIT($1, ":", "#") ); }

The SPLIT() function cannot be used for UDT type elements.

Loading Wildcard Type Edges

If an edge has been defined using a wildcard vertex type, a vertex type name must be specified, following the vertex id, in a load statement for the edge. An example is shown below:

Example: explicit vertex typing for an untyped edge
#schema setup CREATE VERTEX user(PRIMARY_ID id UINT) CREATE VERTEX product(PRIMARY_ID id UINT) CREATE VERTEX picture(PRIMARY_ID id UINT) CREATE UNDIRECTED EDGE purchase (FROM *, TO *) CREATE GRAPH test_graph(*) #offline loading job CREATE LOADING JOB test FOR GRAPH test_graph { LOAD "a.csv" TO EDGE purchase VALUES ($0 user, $1 product), TO EDGE purchase VALUES ($0 user, $2 picture) USING SEPARATOR = ","; } #online loading job CREATE ONLINE_POST JOB test2 FOR GRAPH test_graph { LOAD TO EDGE purchase VALUES ($0 user, $1 product), TO EDGE purchase VALUES ($0 user, $2 picture); }


Built-in Loader Token Functions

The GSQL Loader provides several built-in functions which operate on tokens. Some may be used to construct attribute expressions and some may be used for conditional expressions in the WHERE clause.

Token Functions for Attribute Expressions

The following token functions can be used in an id or attribute expression

Function name and parameters

Output type

Description of function

gsql_reverse( main_string ) string Returns a string with the characters in the reverse order of the input string main_string .
gsql_concat( string1, string2,...,stringN ) string Returns a string which is the concatenation of all the input strings.
gsql_split_by_space( main_string ) string Returns a modified version of main_string , in which each space character is replaced with ASCII 30 (decimal).
gsql_to_bool( main_string ) bool Returns true if the main_string is either "t" or "true", with case insensitive checking. Returns false otherwise.
gsql_to_uint( main_string ) uint

If main_string is the string representation of an unsigned int, the function returns that integer.
If main_string is the string representation of a nonnegative float, the function returns that number cast as an int.

gsql_to_int( main_string ) int

If main_string is the string representation of an int, the function returns that integer.
If main_string is the string representation of a float, the function returns that number cast as an int.

gsql_ts_to_epoch_seconds( main_string ) uint Converts a timestamp in canonical string format to Unix epoch time, which is the int number of seconds since Jan. 1, 1970. The main_string should be in one of the following 3 formats:
"%Y-%m-%d %H:%M:%S"
"%Y/%m/%d %H:%M:%S"
"%Y-%m-%dT%H:%M:%S.000z"
// text after . is ignored
gsql_current_time_epoch(0)

uint

Returns the current time in Unix epoch seconds. *By convention, the input parameter should be 0, but it is ignored.

flatten( column_to_be_split, group_separator, 1 )

flatten( column_to_be_split, group_separator, sub_field_separator, number_of_sub_fields_in_one_group )


See the section "TEMP_TABLE and Flatten Functions" below.

flatten_json_array ( $"array_name" )

flatten_json_array ( $"array_name", $"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" )


See the section "TEMP_TABLE and Flatten Functions" below.

split( column_to_be_split, element_separator )

split( column_to_be_split, key_value_separator, element _separator )


See the section "Loading a LIST or SET Attribute" above.

See the section "Loading a MAP Attribute" above.

Reducer Functions

A reducer function aggregates multiple values of a non-id attribute into one attribute value of a single vertex or edge. Reducer functions are computed incrementally; that is, each time a new input token is applied, a new resulting value is computed.

Discontinued feature

Prior to Version 0.8, there were separate reducer functions for offline loading and online loading. Starting in Version 0.8, the online loading functions (e.g., add(), max()) are now supported for offline loading as well. The previous offline loading functions (e.g., gsql_sum_uint(), gsql_max_real()) are no longer supported.


To reduce and load aggregate data to an attribute, the attribute expression has the form

REDUCE( reducer_function ( input_expr ) )

where reducer_function is one of the functions in the table below. input_expr can include non-reducer functions, but reducer functions cannot be nested.

Each reducer function is overloaded so that one function can be used for several different data types. For primitive data types, the output type is the same as the input_expr type. For LIST, SET, and MAP containers, the input_expr type is one of the allowed element types for these containers (see "Complex Types" in the Attribute Data Types section).  The output is the entire container.

Online Reducer Functions

Function name Data type of arg : Description of function's return value
max( arg ) INT, UINT, FLOAT, DOUBLE: maximum of all arg values cumulatively received
min( arg ) INT, UINT, FLOAT, DOUBLE: minimum of all arg values cumulatively received
add( arg )

INT, UINT, FLOAT, DOUBLE: sum of all arg values cumulatively received
STRING: concatenation of all arg values cumulatively received
LIST, SET element: list/set of all arg values cumulatively received
MAP (key -> value) pair: key-value dictionary of all key-value pair arg values cumulatively received

and( arg )

BOOL: AND of all arg values cumulatively received
INT, UINT: bitwise AND of all arg values cumulatively received

or( arg ) BOOL: OR of all arg v alues cumulatively received
INT, UINT: bitwise OR of all arg v alues cumulatively received
overwrite( arg )

non-container: arg
LIST, SET: new list/set containing only arg

ignore_if_exists( arg ) Any: If an attribute value already exists, return(retain) the existing value. Otherwise, return(load) arg .


Each function supports a certain set of attribute types. Calling a reducer function with an incompatible type crashes the service. In order to prevent that, use the WHERE clause (introduced below) together with IS NUMERIC or other operators, functions, predicates for type checking if necessary.

WHERE Clause

The WHERE clause is an optional clause. The WHERE clause's condition is a boolean expression.  The expression may use column token variables, token functions, and operators which are described below. The expression is evaluated for each input data line. If the condition is true, then the vertex or edge instance is loaded into the graph store. If the condition is false, then this instance is skipped. Note that all attribute values are treated as string values in the expression, so the type conversion functions to_int() and to_float(), which are described below, are provided to enable numerical conditions.

Operators in the WHERE Clause

The GSQL Loader language supports most of the standard arithmetic, relational, and boolean operators found in C++. Standard operator precedence applies, and parentheses provide the usual override of precedence.

  • Arithmetic Operators: +, -, *, /, ^
    Numeric operation can be used to express complex operation between numeric types. Just as in ordinary mathematical expressions, parentheses can be used to define a group and to modify the order of precedence.

    Because computers necessarily can only store approximations for most DOUBLE and FLOAT type values, it is not recommended to perform test for exact equality or inequality.  Instead, o ne should allow for an acceptable amount of error. The following example checks if $0 = 5, with an error of 0.00001 permitted:

    WHERE to_float($0) BETWEEN 5-0.00001 AND 5+0.00001


  • Relational Operators: <, >, ==, !=, <=, >=
    Comparisons can be performed between two numeric values or between two string values.

  • Predicate Operators:
    • AND, OR, NOT operators are the same as in SQL. They can be used to combine multiple conditions together.
      E.g., $0 < "abc" AND $1 > "abc" selects the rows with the first token less than "abc" and the second token greater than "abc".
      E.g., NOT $1 < "abc" selects the rows with the second token greater than or equal to "abc".

    • IS NUMERIC
      token
      IS NUMERIC
      returns true if token is in numeric format. Numeric format include integers, decimal notation, and exponential notation. Specifically, IS NUMERIC is true if token matches the following regular expression: (+/-) ? [0-9] + (.[0-9]) ? [0-9] * ((e/E)(+/-) ? [0-9] + ) ? . Any leading space and trailing space is skipped, but no other spaces are allowed.
      E.g., $0 IS NUMERIC checks whether the first token is in numeric format.
    • IS EMPTY
      token
      IS EMPTY
      returns true if token is an empty string.
      E.g., $1 IS EMPTY checks whether the second token is empty.
    • IN
      token IN
      ( set_of_values ) returns true if token is equal to one member of a set of specified values. The values may be string or numeric types.
      E.g., $2 IN ("abc", "def", "lhm") tests whether the third token equals one of the three strings in the given set.
      E.g., to_int($3) IN (10, 1, 12, 13, 19) tests whether the fourth token equals one of the specified five numbers.
    • BETWEEN ... AND
      token
      BETWEEN lowerVal AND upperVal returns true if token is within the specified range, inclusive of the endpoints. The values may be string or numeric types.
      E.g., $4 BETWEEN "abc" AND "def" checks whether the fifth token is greater than or equal to "abc" and also less than or equal to "def"
      E.g., to_float($5) BETWEEN 1 AND 100.5 checks whether the sixth token is greater than or equal to 1.0 and less than or equal to 100.5.

Token functions in the WHERE clause

The GSQL loading language provides several built-in functions for the WHERE clause.

Function name Output type Description of function
to_int( main_string ) int Converts main_string to an integer value.

to_float( main_string )

float Converts main_string to a float value.
concat( string1, string2,...,stringN ) string Returns a string which is the concatenation of all the input strings.
token_len( main_string ) int Returns the length of main_string.
gsql_is_not_empty_string( main_string ) bool Returns true if main_string is empty after removing white space. Returns false otherwise.
gsql_token_equal( string1, string2 ) bool Returns true if string1 is exactly the same (case sensitive) as string2 . Returns false otherwise.
gsql_token_ignore_case_equal( string1, string2 ) bool Returns true if string1 is exactly the same (case insensitive) as string2 . Returns false otherwise.
gsql_is_true( main_string ) bool Returns true if main_string is either "t" or "true" (case insensitive). Returns false otherwise.
gsql_is_false( main_string ) bool Returns true if main_string is either "f" or "false" (case insensitive). Returns false otherwise.


The token functions in the WHERE clause and those token functions used for attribute expression are different. They cannot be used exchangeably.

User-Defined Token Functions

Users can write their own token functions in C++ and install them in the GSQL system. The system installation already contains a source code file containing sample functions. Users simply add their customized token functions to this file.  The file for user-defined token functions for attribute expressions or WHERE clauses is at <tigergraph.root.dir>/dev/gdk/gsql/src/TokenBank/TokenBank.cpp. There are a few examples in this file, and details are presented below .



Testing your functions is simple. In the same directory with the TokenBank.cpp file is a command script called compile.


  1. To test that your function compiles:

    ./compile
  2. To test that your function works correctly, write your own test and add it to the main() procedure in the TokenBank.cpp.  Then, compile the file and run it. Note that files located in ../TokenLib need to be included:

    g++ -I../TokenLib TokenBank.cpp ./a.out

User-defined Token Functions for Attribute Expressions

Attribute type Function signature
string or string compress extern "C" void funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum,
char* const oToken, uint32_t& oTokenLen)

bool

extern "C" bool funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
uint extern "C" uint64_t funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
int extern "C" int64_t funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
float extern "C" float funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
double extern "C" double funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)

The parameters are as follows: iToken is the array of string tokens, iTokenLen is the array of the length of the string tokens, and iTokenNum is the number of tokens. Note that the input tokens are always in string (char*) format.

If the attribute type is not string nor string compress, the return type should be the corresponding type: bool for bool; uint64_t for uint; int64_t for int; float for float double for double. If the attribute type is string or string compress, the return type should be void, and use the extra parameters ( char *const oToken, uint32_t& oTokenLen) for storing the return string. oToken is the returned string value, and oTokenLen is the length of this string.

The built-in token function gsql_concat is used as an example below. It takes multiple-token parameter and returns a string.

gsql_concat
extern "C" void gsql_concat(const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum, char* const oToken, uint32_t& oTokenLen) { int k = 0; for (int i=0; i < iTokenNum; i++) { for (int j =0; j < iTokenLen[i]; j++) { oToken[k++] = iToken[i][j]; } } oTokenLen = k; }

User-defined Token Functions for WHERE Clause

User-defined token functions (described above) can also be used to construct the boolean conditional expression in the WHERE clause. However, there are some restrictions in the WHERE clause:

In the clause "WHERE conditions ",

  • The only type of user-defined token function allowed are those that return a boolean value.
  • At a given level of evaluation, the expression cannot combine user-defined token functions with built-in token functions. That is, if Udtf() is a user-defined token function and Btf() is a built-in token function, then
    • Illegal: Udtf( $1 ) AND Btf( $2 )
    • Legal Udtf( Btf( $1 ), Btf( $2 ) )


The source code for the built-in token function gsql_token_equal is used as an example for how to write a user-defined token function.

gsql_token_equal
extern "C" bool gsql_token_equal(const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum) { if (iTokenNum != 2) { return false; } if (iTokenLen[0] != iTokenLen[1]) { return false; } for (int i =0; i < iTokenLen[0]; i++) { if (iToken[0][i] != iToken[1][i]) { return false; } } return true; }

Other Optional LOAD Clauses

OPTION clause

There are no supported options for the OPTION clause at this time.

USING clause

A USING clause contains one or more parameter value pairs:

USING parameter=value [parameter=value]*

There are three different places where a USING clause may exist:

  1. At the end of an online LOAD statement.
  2. At the end of an offline LOAD statement.
  3. At the end of an online RUN JOB statement.

Each case serves a different purpose. Accordingly, each location has a different set of USING parameters which either must appear or may appear. The rules are described in the table below.

If multiple LOAD statements use the same source (the same data file, TEMP_TABLE, or the same online job), the USING clauses in these LOAD statements must be the same. Therefore, we recommend that if multiple destination clauses share the same source, put all of these destination clauses into the same LOAD statement.


The following USING parameters are supported. The abbreviations in three Online/Offline columns have the following meanings:
"--" = "Not applicable", "Opt" = "optional", "REQ" = "required".

Parameter Online
LOAD Stmt
Offline LOAD Stmt Online RUN JOB Meaning of Value Allowed Values
FILENAME -- -- REQ name of input data file any valid path to a data file
SEPARATOR -- Opt (default is comma) REQ specifies the special character that separates tokens (columns) in the data file

any single ASCII character.

"\t" for tab

"\xy" for ASCII decimal code xy
EOL -- -- REQ the end-of-line character

any ASCII sequence

Default = "\n" (system-defined newline character or character sequence)

QUOTE Opt Opt --

specifies explicit boundary markers for string tokens, either single or double quotation marks. See more details below.

"single" for '
"double" for "
USER_DEFINED_HEADER Opt Opt -- specifies the name of the header variable, when a header has been defined in the loading job, rather than in the data file the variable name in the preceding DEFINE HEADER statement
REJECT_LINE_RULE Opt Opt -- if the filter expression evaluates to true, then do not use this input data line. name of filter from a preceding DEFINE INPUT_LINE_FILTER statement
JSON_FILE Opt Opt -- whether each line is a json object (see Section "JSON Loader" below for more details) "true", "false"
Default is "false"
HEADER -- Opt Opt*

whether the data file's first line is a header line. If offline loading, the header assigns names to the columns.

*If online loading, HEADER="true" simply tells the loader to skip the first line

"true", "false"

Default is "false"

QUOTE parameter

The parser will not treat separator characters found within a pair of quotation marks as a separator. For example, if the parsing conditions are QUOTE="double", SEPARATOR=",", the comma in "Leonard,Euler" will not separate Leonard and Euler into separate tokens.

  • If QUOTE is not declared, quotation marks are treated as ordinary characters.
  • If QUOTE is declared, but a string does not contain a matching pair of quotation marks, then the string is treated as if QUOTE is not declared.
  • Only the string inside the first pair of quote (from left to right) marks are loaded. For example QUOTE="double", the string a"b"c"d"e will be loaded as b.
  • There is no escape character in the loader, so the only way to include quotation marks within a string is for the string body to use one type of quote (single or double) and to declare the other type as the string boundary marker.


Loading JSON Data

When the USING option JSON_FILE="true" is used, the loader loads JSON objects instead of tabular data. A JSON object is an unordered set of key/value pairs, where each value may itself be an array or object, leading to nested structures.  A colon separates each key from its value, and a comma separates items in a collection.  A more complete description of JSON format is available at www.json.org. The JSON loader requires that each input line has exactly one JSON object . Instead of using column values as tokens, the JSON loader uses JSON values as tokens, that is, the second part of each JSON key/value pair. In a GSQL loading job, a JSON field is identified by a dollar sign $ followed by the colon-separated sequence of nested key names to reach the value from the top level. For example, given the JSON object {"abc":{"def": "this_value"}}, the identifier $"abc":"def" is used to access "this_value". T he double quotes are mandatory.

An example is shown below:

USING JSON_FILE test schema and loading job
CREATE VERTEX encoding (PRIMARY_ID id STRING, length FLOAT default 10) CREATE UNDIRECTED EDGE encoding_edge (FROM encoding, TO encoding) CREATE GRAPH encoding_graph (*) CREATE LOADING JOB json_load FOR GRAPH encoding_graph { LOAD "encoding.json" TO VERTEX encoding VALUES ($"encoding", $"indent":"length") USING JSON_FILE="true"; } RUN JOB json_load

encoding.json

encoding.json
{"encoding": "UTF-7","plug-ins":["c"],"indent" : { "length" : 30, "use_space": true }} {"encoding":"UTF-1","indent":{"use_space": "dontloadme"}, "plug-ins" : [null, true, false] } {"plug-ins":["C","c++"],"indent":{"length" : 3, "use_space": false},"encoding":"UTF-6"}

In the above data encoding.json, the order of fields are not fixed and some fields are missing. The JSON loader ignores the order and accesses the fields by the nested key names. The missing fields are loaded with default values. The result vertices are:

id attr1
"UTF-7" 30
"UTF-1" 10
"UTF-6" 3

TEMP_TABLE and Flatten Functions

The keyword TEMP_TABLE triggers the use of a temporary data table which is used to store data generated by one LOAD statement, for use by a later LOAD statement. Earlier we introduced the syntax for loading data to a TEMP_TABLE:

TEMP_TABLE Destination Clause
TO TEMP_TABLE table_name (id_name [, attr_name]*) VALUES (id_expr [, attr_expr]*) [WHERE conditions] [OPTION (options)]

This clause is designed to be used in conjunction with the flatten or flatten_json_array function in one of the attr_expr expressions. The flatten function splits a multi-value field into a set of records. Those records can first be stored into a temporary table, and then the temporary table can be loaded into vertices and/or edges. Only one flatten function is allowed in one temp table destination clause.

There are two versions of the flatten function: One parses single-level groups and the other parses two-level groups. There are also two versions of the flatten_json_array function: One splits an array of primitive values, and the other splits an array of JSON objects.

One-Level Flatten Function

flatten( column_to_be_split, separator, 1 ) is used to parse a one-level group into individual elements. An example is shown below:

book1.dat

book1.dat
101|"Harry Potter and the Philosopher's Stone"|"fiction,fantasy,young adult" 102|"The Three-Body Problem"|"fiction,science fiction,Chinese"


The loading job contains two LOAD statements.  The first one loads input data to Book vertices and to a TEMP_TABLE. The second one loads the TEMP_TABLE data to Genre vertices and book_genre edges.

One-level Flatten Function loading (book_flatten1_load.gsql)
CREATE ONLINE_POST JOB load_books_flatten1 FOR GRAPH Book_rating { LOAD TO VERTEX Book VALUES ($0, $1, _), TO TEMP_TABLE t1(bookcode,genre) VALUES ($0, flatten($2,",",1)) USING QUOTE="double"; LOAD TEMP_TABLE t1 TO VERTEX Genre VALUES($"genre", $"genre"), TO EDGE book_genre VALUES($"bookcode", $"genre"); } RUN JOB load_books_flatten1 USING FILENAME="book1.dat", SEPARATOR="|", EOL="\n"

Line 4 says that the third column ($2) of each input line should be split into separate tokens, with comma "," as the separator. Each token will have its own row in table t1. The first column is labeled "bookcode" with value $0 and the second column is "genre" with one of the $2 tokens.  The contents of TEMP_TABLE t1 are shown below:

bookcode genre
101 fiction
101 fantasy
101 young_adult
102 fiction
102 science_fiction
102 Chinese

Then, lines 6 and 8 say to read TEMP_TABLE t1 and to do the following for each row:

  • Create a Genre vertex for each new value of "genre".
  • Create a book_genre edge from "bookcode" to "genre".  In this case, each row of TEMP_TABLE t1 generates one book_genre edge.

The final graph will contain two Book vertices (101 and 102), five Genre vertices, and six book_genre edges.

List of all book_genre edges after loading
{ "results": [{"@@edgeSet": [ { "from_type": "Book", "to_type": "Genre", "directed": false, "from_id": "101", "to_id": "fiction", "attributes": {}, "e_type": "book_genre" }, { "from_type": "Book", "to_type": "Genre", "directed": false, "from_id": "101", "to_id": "fantasy", "attributes": {}, "e_type": "book_genre" }, { "from_type": "Book", "to_type": "Genre", "directed": false, "from_id": "102", "to_id": "sciencevfiction", "attributes": {}, "e_type": "book_genre" }, { "from_type": "Book", "to_type": "Genre", "directed": false, "from_id": "101", "to_id": "young adult", "attributes": {}, "e_type": "book_genre" }, { "from_type": "Book", "to_type": "Genre", "directed": false, "from_id": "102", "to_id": "fiction", "attributes": {}, "e_type": "book_genre" }, { "from_type": "Book", "to_type": "Genre", "directed": false, "from_id": "102", "to_id": "Chinese", "attributes": {}, "e_type": "book_genre" } ]}] }


Two-Level Flatten Function

flatten( column_to_be_split, group_separator, sub_field_separator, number_of_sub_fields_in_one_group ) is used for parse a two-level group into individual elements. Each token in the main group may itself be a group, so there are two separators: one for the top level and one for the second level. An example is shown below.

book.dat
101|"Harry Potter and the Philosopher's Stone"|"FIC:fiction,FTS:fantasy,YA:young adult" 102|"The Three-Body Problem"|"FIC:fiction,SF:science fiction,CHN:Chinese"

The flatten function now has four parameters instead of three.  The additional parameter is used to record the genre_name in the Genre vertices.

Two-level Flatten Function loading (book_flatten2_load.gsql)
CREATE ONLINE_POST JOB load_books_flatten2 FOR GRAPH Book_rating { LOAD TO VERTEX Book VALUES ($0, $1, _), TO TEMP_TABLE t2(bookcode,genre_id,genre_name) VALUES ($0, flatten($2,",",":",2)) USING QUOTE="double"; LOAD TEMP_TABLE t2 TO VERTEX Genre VALUES($"genre_id", $"genre_name"), TO EDGE book_genre VALUES($"bookcode", $"genre_id"); } RUN JOB load_books_flatten2 USING FILENAME="book2.dat", SEPARATOR="|", EOL="\n"

In this example, in the genres column ($2), there are multiple groups, and each group has two sub-fields, genre_id and genre_name. After running the online job, the file book2.dat will be loaded into the TEMP_TABLE t2 as shown below.

bookcode genre_id
genre_name
101 FIC fiction
101 FTS fantasy
101 YA young adult
102 FIC fiction
102 SF science fiction
102 CHN Chinese

Flatten a JSON Array of Primitive Values

flatten_json_array($" array_name ") parses a JSON array of primitive (string, numberic, or bool) values, where "array_name" is the name of the array. Each value in the array creates a record. Below is an example:

flatten_json_array_values loading
CREATE VERTEX encoding (PRIMARY_ID id STRING, length FLOAT default 10) CREATE UNDIRECTED EDGE encoding_edge (FROM encoding, TO encoding) CREATE GRAPH encoding_graph (*) CREATE LOADING JOB json_flatten FOR GRAPH encoding_graph { LOAD "encoding2.json" TO TEMP_TABLE t2 (name, length) VALUES (flatten_json_array($"plug-ins"), $"indent":"length") USING JSON_FILE ="true"; LOAD TEMP_TABLE t2 TO VERTEX encoding VALUES ($"name", $"length"); } RUN JOB json_flatten

encoding2.json

encoding2.json
{"plug-ins" : ["C", "c++"],"encoding" : "UTF-6","indent" : { "length" : 3, "use_space": false}}


The above data and loading job creates the following temporary table:

id length
C 3
c++ 3

Flatten a JSON Array of JSON Objects

flatten_json_array ( $"array_name", $"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" ) parses a JSON array of JSON objects. "array_name" is the name of the array, and the following parameters $"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" are the field key names in each object in the array. See complete example below:

encoding3.json

encoding3.json
{"encoding":"UTF-1","indent":{"use_space": "dontloadme"}, "plug-ins" : [null, true, false, {"lang":"golang","prop":{"age":"noidea"}}]} {"encoding": "UTF-8", "plug-ins" : [{"lang": "pascal", "score":"1.0", "prop":{"age":"old"}}, {"lang":"c++", "score":2.0}],"indent":{"length" :12,"use_space": true}} {"encoding": "UTF-7", "plug-ins" : [{"lang":"java", "score":2.22}, {"lang":"python", "score":3.0},{"lang":"go", "score":4.0, "prop":{"age":"new"}}],"indent" : { "length" : 30, "use_space": true }} {"plug-ins" : ["C", "c++"],"encoding" : "UTF-6","indent" : { "length" : 3, "use_space": false}}


json_flatten_array_test.gsql
CREATE VERTEX encoding3 (PRIMARY_ID id STRING, score FLOAT default -1.0, age STRING default "Unknown", length INT default -1) CREATE UNDIRECTED EDGE encoding3_edge (FROM encoding3, TO encoding3) CREATE GRAPH encoding_graph (*) CREATE LOADING JOB json_flatten_array FOR GRAPH encoding_graph { LOAD "encoding3.json" TO TEMP_TABLE t3 (name, score, prop_age, indent_length ) VALUES (flatten_json_array($"plug-ins", $"lang", $"score", $"prop":"age"), $"indent":"length") USING JSON_FILE="true"; LOAD TEMP_TABLE t3 TO VERTEX encoding3 VALUES ($"name", $"score", $"prop_age", $"indent_length"); } RUN JOB json_flatten_array

When splitting a JSON array of JSON objects, the primitive values are skipped and only JSON objects are processed. As in the example above, the 4th line's "plug-ins" field will not generate any record because its "plug-ins" array doesn't contain any JSON object. Any field which does not exist in the object will be loaded with default value. The above example generates the temporary table shown below:

id score age length
"golang" default "noidea" default
"pascal" 1.0 "old" 12
"c++" 2.0 default
12
"java" 2.22 default 30
"python" 3.0 default 30
"go" 4.0 "new" 30

Flatten a JSON Array in a CSV file

flatten_json_array() can also be used to split a column of a tabular file, where the column contains JSON arrays. An example is given below:

encoding.csv

encoding.csv
golang|{"prop":{"age":"noidea"}} pascal|{"score":"1.0", "prop":{"age":"old"}} c++|{"score":2.0, "indent":{"length":12, "use_space": true}} java|{"score":2.22, "prop":{"age":"new"}, "indent":{"use_space":"true", "length":2}} python|{ "prop":{"compiled":"false"}, "indent":{"length":4}, "score":3.0} go|{"score":4.0, "prop":{"age":"new"}}


The second column in the csv file is a JSON array which we want to split. flatten_json_array() can be used in this case without  the USING JSON_FILE="true" clause:

json_flatten_cvs.gsql
CREATE VERTEX encoding3 (PRIMARY_ID id STRING, score FLOAT default -1.0, age STRING default "Unknown", length INT default -1) CREATE UNDIRECTED EDGE encoding3_edge (FROM encoding3, TO encoding3) CREATE GRAPH encoding_graph (*) CREATE LOADING JOB json_flatten_cvs FOR GRAPH encoding_graph { LOAD "encoding.csv" TO TEMP_TABLE t4 (name, score, prop_age, indent_length ) VALUES ($0,flatten_json_array($1, $"score", $"prop":"age", $"indent":"length")) USING SEPARATOR="|"; LOAD TEMP_TABLE t4 TO VERTEX encoding3 VALUES ($"name", $"score", $"prop_age", $"indent_length"); } RUN JOB json_flatten_cvs

The above example generates the temporary table shown below:

id score age length
golang -1 (default) noidea -1 (default)
pascal 1 old -1 (default)
c++ 2 unknown (default) 12
java 2.22 new 2
python 3 unknown (default) 4
go 4 new -1 (default)


flatten_json_array in csv

flatten_json_array() does not work if the separator appears also within the json array column. For example, if the separator is comma, the csv loader will erroneously divide the json array into multiple columns. Therefore, it is recommended that the csv file use a special column separator, such as "|" in the above example .

DELETE statement

In addition to loading data, an ONLINE_POST JOB can be used to perform the opposite operation: deleting vertices and edges, using the DELETE statement. DELETE cannot be used in offline loading. Just as a LOAD statement uses the tokens from each input line to set the id and attribute values of a vertex or edge to be created, a DELETE statement uses the tokens from each input line to specify the id value of the item(s) to be deleted.

There are four variations of the online_post DELETE statement. The syntax of the four cases is shown below.

DELETE VERTEX | EDGE Syntax
CREATE ONLINE_POST JOB abc FOR GRAPH graph_name { # 1. Delete each vertex which has the given vertex type and primary id. DELETE VERTEX vertex_type_name (PRIMARY_ID id_expr) [WHERE condition] ; # 2. Delete each edge which has the given edge type, source vertex id, and destination vertex id. DELETE EDGE edge_type_name (FROM id_expr, TO id_expr) [WHERE condition] ; # 3. Delete all edges which have the given edge type and source vertex id. (Destination vertex id is left open.) DELETE EDGE edge_type_name (FROM id_expr) [WHERE condition] ; # 4. Delete all edges which have the given source vertex id. (Edge type and destination vertex id are left open.) DELETE EDGE * (FROM id_expr vertex_type_name) [WHERE condition] ; }

An example using book_rating data is shown below:

DELETE example
# Delete all user occupation edges if the user is in the new files, then load the new files CREATE ONLINE_POST JOB clean_user_occupation FOR GRAPH Book_rating { DELETE EDGE user_occupation (FROM $0); } CREATE ONLINE_POST JOB load_user_occupation FOR GRAPH Book_rating { LOAD TO EDGE user_occupation VALUES ($0,$1); } RUN JOB clean_occupation USING FILENAME="./data/occupation_updated.dat", SEPARATOR=",", EOL="\n" RUN JOB load_occupation USING FILENAME="./data/occupation_updated.dat", SEPARATOR=",", EOL="\n"

There is a separate DELETE statement in the GSQL Query Language. The query delete statement can leverage the query language's ability to explore the graph and to use complex conditions to determine which items to delete. In contrast, the ONLINE_POST delete statement requires that the id values of the items to be deleted must be specified in advance in an input file.

offline2online Job Conversion

offline2online <offline_job_name>

The gsql command offline2online converts an installed offline loading job to an equivalent online loading job or set of jobs.

Online Job Names

An offline loading job contains one or more LOAD statements, each one specifying the name of an input data file.  The offline2online will convert each LOAD statement into a separate online loading job. The data filename will be appended to the offline job name, to create the new online job name.  For example, if the offline job has this format:

CREATE LOADING JOB loadEx FOR GRAPH graphEx { LOAD "fileA" TO ... LOAD "fileB" TO ... }

then running the GSQL command offline2online loadEx will create two new online loading jobs, called loadEx_fileA and loadEx_fileB . The converted loading jobs are installed in the GSQL system; they are not available as text files. However, if there are already jobs with these names, then a version number will be appended: first "_1", then "_2", etc.

For example, if you were to execute offline2online loadEx three times, this would generate the following online jobs:

  • 1st time:  loadEx_fileA, loadEx_fileB
  • 2nd time: loadEx_fileA_1, loadEx_fileB_1
  • 3rd time:  loadEx_fileA_2, loadEx_fileB_2

Conversion and RUN JOB Details

Some parameters of a loading job which are built in to offline loading jobs instead cannot be included in online jobs:

  • input data filename
  • SEPARATOR
  • HEADER

Instead, they should be provided when running the loading job. However, online jobs do not have full support for HEADER.

When running any online loading job, the input data filename and the separator character must be provided.  See sections on the USING clause and Running a Loading Job for more details.

If an online loading job is run with the HEADER="true" option, it will skip the first line in the data file, but it will not read that line to get the column names.  Therefore, offline jobs which read and use column header names must be manually converted to online jobs.

The following example is taken from the Social Network case in the GSQL Tutorial with Real-Life Examples . In version 0.2 of the tutorial, we used offline loading. The job below uses the same syntax as v0.2, but some names have been updated:

Offline loading example, based on social_load.gsql, version 0.2
CREATE LOADING JOB load_social FOR GRAPH gsql_demo { LOAD "data/social_users.csv" TO VERTEX SocialUser VALUES ($0,$1,$2,$3) USING QUOTE="double", SEPARATOR=",", HEADER="true"; LOAD "data/social_connection.csv" TO EDGE SocialConn VALUES ($0, $1) USING SEPARATOR=",", HEADER="false"; }

To run, this job:

RUN JOB load_social

Note that the first LOAD statement has HEADER="true", but is does not make use of column names. It simply uses column indices $0, $1, $2, and $3. Therefore, the HEADER option can still be used with the converted job. Running offline2online load_social1 , creates two new jobs called load_social_social_users.csv and load_social_social_connection.csv.

The equivalent run commands for the jobs are the following:

RUN JOB load_social_social_users.csv USING FILENAME="data/social_users.csv", SEPARATOR=",", EOL="\n", HEADER="true" RUN JOB load_social_social_connection.csv USING FILENAME="data/social_connection.csv", SEPARATOR=",", EOL="\n"


For comparison, here is the online loading job in the current version of the Tutorial and its loading commands:

social_load.gsql, version 0.8.1
CREATE ONLINE_POST JOB load_social1 FOR GRAPH gsql_demo { LOAD TO VERTEX SocialUser VALUES ($0,$1,$2,$3) USING QUOTE="double"; } CREATE ONLINE_POST JOB load_social2 FOR GRAPH gsql_demo { LOAD TO EDGE SocialConn VALUES ($0, $1); } # load the data RUN JOB load_social1 USING FILENAME="../social/data/social_users.csv", SEPARATOR=",", EOL="\n", HEADER="true" RUN JOB load_social2 USING FILENAME="../social/data/social_connection.csv", SEPARATOR=",", EOL="\n"




Back to Top


Running a Loading Job

Contents of this Section

Clearing and Initializing the Graph Store

There are two aspects to clearing the system: flushing the data and clearing the schema definitions in the catalog. Two different commands are available.

CLEAR GRAPH STORE

To flush all the data out of the graph store (database), use the CLEAR GRAPH STORE command.  By default, the system will ask the user to confirm that you really want to discard all the graph data.  To force the clear operation and bypass the confirmation question, use the -HARD option, e.g.,

CLEAR GRAPH STORE -HARD

Clearing the graph store does not affect the schema.

  1. Use the -HARD option with extreme caution. There is no undo option. -HARD must be in all capital letters.
  2. CLEAR GRAPH STORE stops all the TigerGraph servers (GPE, GSE, RESTPP, Kafka, and Zookeeper).

DROP ALL

The DROP ALL statement clears the graph store and removes all definitions from the catalog: vertex types, edge types, graph types, jobs, and queries.


Discontinued feature

The INIT GRAPH command has been removed. Starting with v0.8, the graph store will automatically be initialized after it is cleared.

Running a Loading Job

Running a loading job executes a previously installed loading job.  The job reads lines from an input source, parses each line into data tokens, and applies loading rules and conditions to create new vertex and edge instances to store in the graph data store. Just as there are two versions of loading jobs, online and offline, there are two corresponding versions of the RUN JOB statement. Additionally, loading jobs can also be run by directly submitted a HTTP request to the REST++ server.

RUN JOB for Online Loading

RUN JOB syntax for online loading
RUN JOB [-n [ first_line_num,] last_line_num] job_name USING FILENAME="myFile", SEPARATOR="schar", EOL="echar"

Note that FILENAME, and SEPARATOR are required fields in online loading.

Pre- and Post- Conditions for Online Loading (New for v0.8)

Pre-Conditions:

  • The graph schema must be defined.
  • The TigerGraph dictionary server must be on. All other services will be turned on automatically when needed.

Post-Condition:

  • All TigerGraph services will be on.


RUN JOB example for online loading
GSQL > RUN JOB load_cf USING FILENAME="data/cf_data.csv", SEPARATOR=",", EOL="\n"


RUN JOB for Offline Loading (Deprecated)

RUN JOB syntax for offline loading
RUN JOB [-dryrun] [-n [ first_line_num,] last_line_num] job_name

In offline loading jobs, the data filename, separator character, and EOL character are embedded in the job itself, so they are not needed when executing RUN JOB:

RUN JOB example for offline loading
GSQL > RUN JOB load_cf_offline


Pre- and Post- Conditions for Offline Loading (New for v0.8)

As of v0.8, offline loading is emulated with online loading. Consequently, the pre- and post- conditions for offline loading and the same as those for online loading.

RUN JOB Options

The -dryrun option is only for offline loading. It reads input files and process data as instructed by the job, but it prevents data from actually being stored in the graph store. This option can be a useful diagnostic tool.

The -n option limits the loading job to processing only a range of lines of each input data file. The -n flag accepts one or two arguments. For example, -n 50 means read lines 1 to 50.
-n 10, 50 means read lines 10 to 50.  The special symbol $ is interpreted as "last line", so -n 10,$ means reads from line 10 to the end.

The USING fields have the same meaning as presented previously for the Parsing Conditions.

Running Loading Jobs directly as REST++ Requests

The loading time can be reduced by directly submitting an HTTP request to the POST /ddl endpoint of the REST++ server. The Linux curl command is a handy way to make HTTP requests. This can be used for either online or offline loading jobs.

Curl/REST++ syntax for loading using the POST /ddl endpoint
curl -X POST --data-binary @<data_filename> "http://<server_ip>:9000/ddl?tag=<job_name><&optional_parameters>

If there is a small amount of data, it can be included directly in the command line:

Curl/REST++ syntax for loading using the POST /ddl endpoint
curl -X POST -d "<data_string>" "http://<server_ip>:9000/ddl?<parameters>?tag=<job_name><&optional_parameters>

For more information, about sending REST++ requests, see the RESTPP API User Guide v1.0 .

Example : The code block below shows three equivalent commands for the same loading job.  The first uses the gsql command RUN JOB. The second uses the Linux curl command to support a HTTP request, placing the parameter values in the URL's query string. T he third gives the parameter values through the curl command's data payload -d option.

REST++ ddl loading examples
# Case 1: Using GSQL GSQL RUN JOB load_cf USING FILENAME="../cf/data/cf_data.csv", SEPARATOR=",", EOL="\n" # Case 2: Using REST++ Request with data in a file curl -X POST --data-binary @data/cf_data.csv "http://localhost:9000/ddl?tag=load_cf&sep=,&eol=\n" # Case 3: Using REST++ Request with data inline curl -X POST -d "id2,id1\nid2,id3\nid3,id1\nid3,id4\nid5,id1\nid5,id2\nid5,id4" "http://localhost:9000/ddl?tag=load_cf&sep=,&eol=\n"


Verifying and Debugging a Loading Job

A report file load_output.log is generated when finishing a loading job. Relative the the gsql root directory, the log file is located at dev/gdk/gsql/output/load_output.log

This report file records how many objects of each type is created, and how many lines are invalid due to different reasons. This report also shows which lines cause the errors. Here is the list of statistics shown in the report. There are two types of statistics. One is file level (the number of lines), and the other is object level (the number of objects). If an file level error occurs, e.g. a line does not have enough columns, this line of data is skipped for all LOAD statements in this loading job. If an object level error or failed condition occurs, only the corresponding object is not created, i.e., all other objects in the same loading job are still created if no object level error or failed condition for each corresponding object.

File level statistics Explanation
Valid lines
The number of valid lines in the source file
Reject lines
The number of lines which are rejected by reject_line_rules
Invalid Json format
The number of lines with invalid JSON format
Not enough token
The number of lines with missing column(s)
Oversize token
The number of lines with oversize token(s). Please increase "OutputTokenBufferSize" in the

tigergraph/dev/gdk/gsql/config file.


Object level statistics Explanation
Valid Object
The number of objects which have been loaded successfully
No ID found
The number of objects in which PRIMARY_ID is empty
Invalid Attributes
The number of invalid objects caused by wrong data format for the attribute type
Invalid primary id
The number of invalid objects caused by wrong data format for the PRIMARY_ID type
Incorrect fixed
binary length
The number of invalid objects caused by the mismatch of the length of the data to the type defined in the schema
Passed condition lines
If there is a WHERE clause, the number of objects which pass the condition
Failed condition lines
If there is a WHERE clause, the number of objects which fail the condition

Note that failing a WHERE clause is not necessarily a bad result.  If the user's intent for the WHERE clause is to select only certain lines, then it is natural for some lines to pass and some lines to fail.


Below is an example.

CREATE VERTEX movie (PRIMARY_ID id UINT, title STRING, country STRING COMPRESS, year UINT) CREATE DIRECTED EDGE sequel_of (FROM movie, TO movie) CREATE GRAPH movie_graph(*) CREATE ONLINE_POST JOB load_movie FOR GRAPH movie_graph{ LOAD TO VERTEX movie VALUES ($0, $1, $2, $3) WHERE to_int($3) < 2000; } RUN JOB load_movie USING FILENAME="movie.dat", SEPARATOR=",", EOL="\n"


movie.dat
0,abc,USA,-1990 1,abc,CHN,1990 2,abc,CHN,1990 3,abc,FRA,2015 4,abc,FRA,2005 5,abc,USA,1990 6,abc,1990

The above loading job and data generate the following report

load_output.log (tail)
--------------------Statistics------------------------------ Valid lines: 6 Reject lines: 0 Invalid Json format: 0 Not enough token: 1 [ERROR] (e.g. 7) Oversize token: 0 Vertex: movie Valid Object: 3 No ID found: 0 Invalid Attributes: 1 [ERROR] (e.g. 1:year) Invalid primary id: 0 Incorrect fixed binary length: 0 Passed condition lines: 4 Failed condition lines: 2 (e.g. 4,5)


There are a total of 7 data lines. The report shows that

  • Six of the lines are valid data lines
  • One line (Line 7) does not have enough tokens.

Of the 6 valid lines,

  • Three of the 6 valid lines generate valid movie vertices.
  • One line has an invalid attribute  (Line 1: year)
  • Two lines (Lines 4 and 5) do not pass the WHERE clause.


Back to Top


Appendix A - DDL Keywords and Reserved Words

The following words are reserved for use by the Data Definition Language.  That is, a graph schema or loading job may not use any of these words for a user-defined identifier, for the name of a vertex type, edge type, graph, or attribute.
The compiler will reject the use of a Reserved Word as a user-defined identifier.

ABORT ACCESS ADD ADMIN AFTER ALL ALLOCATE ALTER ANALYZE AND ANY ARCHIVE ARE ARRANGE ARRAY AS ASC ASENSITIVE ASYMMETRIC AT ATOMIC ATTRIBUTE AUTHORIZATION AV AVG BAG BASIC BEFORE BEGIN BETWEEN BIGINT BINARY BINSTORAGE BLOB BOOL BOOLEAN BOTH BUCKET BUCKETS BY BYTEARRAY CACHE CALL CALLED CASCADE CASCADED CASE CAST CAT CD CHANGE CHAR CHARACTER CHARARRAY CHECK CLEAR CLOB CLOSE CLUSTER CLUSTERED CLUSTERSTATUS COGROUP COLLATE COLLECTION COLUMN COLUMNS COMMENT COMMIT COMPACT COMPACTIONS COMPRESS COMPUTE CONCAT CONCATENATE CONDITION CONF CONNECT CONST CONSTRAINT CONTINUE COPYFROMLOCAL COPYTOLOCAL CORRESPONDING COUNT CP CREATE CROSS CUBE CURRENT CURRENT_DATE CURRENT_PATH CURRENT_ROLE CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA DATABASE DATABASES DATE DATETIME DAY DBPROPERTIES DD DEALLOCATE DEC DECIMAL DECLARE DECRYPT DEFAULT DEFERRED DEFINE DEFINED DELETE DELIMITED DEPENDENCY DEREF DESC DESCRIBE DETERMINISTIC DIFF DIRECTED DIRECTORIES DIRECTORY DISABLE DISCONNECT DISTINCT DISTRIBUTE DM DO DOUBLE DROP DRYRUN DU DUMP DYNAMIC EACH EDGE ELEMENT ELEM_TYPE ELSE ELSEIF EMPTY ENABLE END EOL ESCAPE ESCAPED EVAL EXCEPT EXCHANGE EXCLUSIVE EXEC EXECUTE EXISTS EXIT EXPLAIN EXPORT EXTENDED EXTERN EXTERNAL FALSE FETCH FIELDS FILE FILEFORMAT FILTER FIRST FIXED_BINARY FLATTEN FLATTEN_JSON_ARRAY FLOAT FOLLOWING FOR FOREACH FOREIGN FORMAT FORMATTED FREE FROM FULL FUNCTION FUNCTIONS GENERATE GET GLOBAL GPATH GPATH_QUERY GQL GQUERY GRANT GRAPH GRAPHSQL GROUP GROUPING GSHELL HANDLER HARD HASH_PARTITION HAVING HEADER HELP HOLD HOLD_DDLTIME HOST_GRAPH HOUR ICON IDENTIFIED IDENTITY IDXPROPERTIES IF IGNORE IGNORE_IF_EXISTED IGNORE_IF_EXISTS ILLUSTRATE IMMEDIATE IMPORT IN INCREMENTAL INDEX INDEXES INDICATOR INIT INNER INOUT INPATH INPUT INPUTDRIVER INPUTFORMAT INPUT_LINE_FILTER INSENSITIVE INSERT INSTALL INT INT16 INT32 INT32_T INT64_T INT8 INTEGER INTERSECT INTERVAL INTO INT_LIST INT_SET IS ITEMS ITERATE JAR JOB JOIN JSON KEY KEYS KEY_TYPE KILL LANGUAGE LARGE LATERAL LEADING LEAVE LEFT LESS LIKE LIMIT LINES LOAD LOADING LOCAL LOCALTIME LOCALTIMESTAMP LOCATION LOCK LOCKS LOGICAL LONG LOOP LS MACRO MAP MAPJOIN MATCH MATCHES MATERIALIZED MAX MERGE METHOD MIN MINUS MINUTE MKDIR MODIFIES MODULE MONTH MSCK MULTISET MV NATIONAL NATURAL NCHAR NCLOB NEW NO NONE NOSCAN NOT NO_DROP NULL NUMERIC OF OFFLINE OLD ON ONLINE_POST ONLY ONSCHEMA OPEN OPTIMIZE OPTION OR ORDER OUT OUTER OUTPUT OUTPUTDRIVER OUTPUTFORMAT OVER OVERLAPS OVERWRITE OWNER PARALLEL PARAMETER PARTIALSCAN PARTITION PARTITIONED PARTITIONS PERCENT PIG PIGDUMP PIGSTORAGE PLUS PRECEDING PRECISION PREPARE PRESERVE PRETTY PRIMARY PRIMARY_ID PRINCIPALS PROCEDURE PROTECTION PURGE PWD QUERY QUIT QUOTE RANGE RANGE_PARTITION READ READONLY READS REAL REBUILD RECORDREADER RECORDWRITER RECURSIVE REDUCE REF REFERENCES REFERENCING REFRESH REGEXP REGISTER RELEASE RENAME REPAIR REPEAT REPLACE RESIGNAL RESTRICT RESULT RETURN RETURNS REVERSE_EDGE REVOKE REWRITE RIGHT RLIKE RM RMF ROLE ROLES ROLLBACK ROLLUP ROW ROWS RUN SAMPLE SAVEPOINT SCHEMA SCHEMAS SCHEMA_CHANGE SCOPE SCROLL SEARCH SECOND SECONDARY_ID SELECT SEMI SENSITIVE SEPARATOR SERDE SERDEPROPERTIES SERVER SESSION_USER SET SETS SHARED SHIP SHOW SHOW_DATABASE SIGNAL SIMILAR SIZE SKEWED SMALLINT SOME SORT SORTED SPECIFIC SPECIFICTYPE SPLIT SQL SQLEXCEPTION SQLSTATE SQLWARNING SSL START START_ID STATIC STATISTICS STATS STDERR STDIN STDOUT STORE STORED STREAM STREAMTABLE STRING STRING_LIST STRING_SET STRUCT SUBMULTISET SUM SYMMETRIC SYSTEM SYSTEM_USER TABLE TABLES TABLESAMPLE TBLPROPERTIES TEMPORARY TEMP_TABLE TERMINATED TEXTLOADER THEN THROUGH TIME TIMESTAMP TIMEZONE_HOUR TIMEZONE_MINUTE TINYINT TO TOKENIZE TOKEN_LEN TOUCH TO_FLOAT TO_INT TRAILING TRANSACTIONS TRANSFORM TRANSLATION TREAT TRIGGER TRUE TRUNCATE TUPLE TYPE TYPEDEF UDF_PARTITION UINT UINT16 UINT32 UINT32_T UINT64_T UINT8 UINT_SET UNARCHIVE UNBOUNDED UNDIRECTED UNDO UNION UNIONTYPE UNIQUE UNIQUEJOIN UNKNOWN UNLOCK UNNEST UNSET UNSIGNED UNTIL UPDATE UPSERT URI USE USING UTC UTCTIMESTAMP VAL VALUE VALUES VALUE_TYPE VARCHAR VARYING VECTOR VERSION VERTEX VIEW VOID WHEN WHENEVER WHERE WHILE WINDOW WITH WITHIN WITHOUT YEAR CURRENT_DEFAULT_TRANSFORM_GROUP CURRENT_TRANSFORM_GROUP_FOR_TYPE INT32_INT32_KV_LIST UINT32_UDT_KV_LIST UINT32_UINT32_KV_LIST


Appendix B - GSQL Start-to-End Process and Data Flow

The figures below illustrates the sequence of steps and the dependencies to progress from no graph to a loaded graph and a query result, for TigerGraph platform version 0.8 and higher.  Note that online and offline follow the same flow.


Figure B1: Complete GSQL Workflow