In this exercise, we will go through the 3-step process of writing GSQL-- define a schema, load data, and write a query.
Table of Contents
- 1 Get Set
- 2 Define a Schema
- 3 Load Data
- 4 Query Using Built-In SELECT Queries
- 5 Query Using A Parameterized GSQL Query
- 6 Review: What You've Learned
This tutorial is written so that you can follow along and perform the steps on your TigerGraph system as your read.
In this tutorial, we will show you how to create a graph schema, load data in your graph, write simple parameterized queries, and run your queries. Before you start, you need to have installed the TigerGraph system, verified that it is working, and cleared out any previous data. It'll also help to become familiar with our graph terminology.
What is a Graph?
A graph is a collection of data entities and the connections between them. That is, it's a network of data entities.
Many people call a data entity a node ; at TigerGraph we called it a vertex . The plural is vertices . We call a connection an edge. Both vertices and edges can have properties or attributes. The figure below is a visual representation of a graph containing 7 vertices (shown as circles) and 7 edges (the lines).
Figure 1 . Friendship Social Graph
A graph schema is the model which describes the types of vertices (nodes) and edge (connections) which can appear in your graph. The graph above has one type of vertex (person) and one type of edge (friendship).
A schema diagram looks like a small graph, except each node represents one type of vertex, and each link represents one type of edge.
Figure 2 . Friendship Social Graph Schema
The friendship loop shows that a friendship is between a person and another person.
For this tutorial, we will create and query the simple friendship social graph shown in Figure 1. The data for this graph consists of two files in csv (comma-separated values) format. To follow along with this tutorial, please save these two files, person.csv and friendship.csv, to your TigerGraph local disk. In our running example, we use the /home/tigergraph/ folder to store the two csv files.
Prepare Your TigerGraph Environment
First, let's check that you can access GSQL.
- Open a Linux shell.
Type gsql as below. A GSQL shell prompt should appear as below.
If the GSQL shell does not launch, try reseting the system with "gadmin restart all". If you need further help, please see the TigerGraph Knowledge Base and FAQs .
If this is your first time using GSQL, the TigerGraph data store is probably empty. However, if you or someone else has already been working on the system, there may already be a database. You can check by listing out the database catalog with the "ls" command. This is what should look like if it is empty:
If the data catalog is not empty, you will need to empty it to start this tutorial. We'll assume you have your coworkers' permission. Use the command DROP ALL to delete all the database data, its schema, and all related definitions. This command takes about a minute to run.
If you need to restart TigerGraph for any reason, use the following command sequence:
Tip: Running GSQL commands from Linux
You can also run GSQL commands from a Linux shell. To run a single command, just use "gsql" followed by the command line enclosed in single quotes. (The quotes aren't necessary if there is no parsing ambiguity; it's safer to just use them.) For example,
You can also execute a series of commands which you have stored in a file, by simply invoking "gsql" following by the name of the file.
When you are done, you can exit the GSQL shell with the command "quit" (without the quotes).
Define a Schema
For this tutorial, we will work mostly in the GSQL shell, in interactive mode. A few commands will be from a Linux shell. The first step in creating a GSQL graph is to define its schema. GSQL provides a set of DDL (Data Definition Language) commands, similar to SQL DLL commands, to model vertex types, edge types and a graph.
Create a Vertex Type
Use CREATE VERTEX command to define a vertex type named person. Here, PRIMARY_ID is required: each person must have a unique identifier. The rest is the optional list of attributes which characterize each person vertex, in the format attribute_name data_type, attribute_name data_type, .. .
We show GSQL keywords in ALL CAPS to highlight them, but they are case-insensitive.
GSQL will confirm the creation of the vertex type.
You can create as many vertex types as you need.
Create an Edge Type
Next, use the CREATE ... EDGE command to create an edge type named friendship . The keyword UNDIRECTED indicates this edge is a bidirectional edge, meaning that information can flow starting from either vertex. If you'd rather have a unidirectional connection where information flows only from the FROM vertex, use the DIRECTED keyword in place of UNDIRECTED. Here, FROM and TO are required to specify which two vertex types the edge type connects. An individual edge is specifying by giving the primary_ids of its source (FROM) vertex and target (TO) vertex. These are followed by an optional list of attributes, just as in the vertex definition.
GSQL will confirm the creation of the edge type.
You can create as many edge types as you need.
Create a Graph
Next, use the CREATE GRAPH command to create a graph named social. Here, we just list the vertex types and edge types that we want to include in this graph.
GSQL will confirm the creation of the first graph after several seconds, during which it pushes the catalog information to all services, such as the GSE, GPE and RESTPP.
At this point,
we have created
vertex type, a
edge type, and a
graph that includes them. You've now built your first graph schema! Let's take a look what's in the catalog by typing the "
" command in the GSQL shell.
After creating a graph schema, the next step is to load data into it. The task here is to instruct the GSQL loader how to associate ("map") the fields in a set of data files to the attributes in your vertex types and edge types of the graph schema we just defined.
You should have the two data files person.csv and friendship.csv on your local disk. It's not necessary that they are in the same folder with you.
If you need to exit the GSQL shell for any reason, you can do so by typing "quit" without the quotes. Type gsql to enter again.
Define a Loading Job
The loading job below assumes that your data files are in the folder /home/tigergraph. If they are elsewhere, then in the loading job script below replace "/home/tigergraph/person.csv" and "/home/tigergraph/friendship.csv" with their corresponding file path respectively. Assuming you're (back) in the GSQL shell, enter the following set of commands.
Let's walk through the commands:
USE GRAPH social :
Tells GSQL which graph you want to work with.
BEGIN ... END :
Indicates multiple-line mode. The GSQL shell will treat everything between these markers as a single statement. These is only needed for interactive mode. If you run GSQL statements that are stored in a command file, the command interpreter will study your whole file, so it doesn't need the BEGIN and END hints.
CREATE LOADING JOB:
One loading job can describe the mappings from multiple files to multiple graph objects. Each file must be assigned to a filename variable. The field labels can be either by name or by position. By-name labelling requires a header line in the source file. By-position labelling uses integers to indicate source column position 0, 1,... In the example above, the first LOAD statement refers to the source file columns by name, whereas the second LOAD statement refers to the source file columns by position. Note the following details:
- The column "name" in file1 gets mapped to two fields, both the PRIMARY_ID and the "name" attribute of the person vertex.
- In file1, gender comes before age. In the person vertex, gender comes after age. When loading, state your attributes in the order needed by the target object (in this case, the person vertex).
- Each LOAD statement has a USING clause. Here it tells GSQL that both files contain a header (whether we choose to use the names or not, GSQL still needs to know whether to consider the first line as data or not). It also says the column separator is comma. GSQL can handle any single-character separator, not just commas.
When you run the CREATE LOADING JOB statement, GSQL checks for syntax errors and checks that you have data files in the locations specified. If it detects no errors, it compiles and saves your job.
Run a Loading Job
You can now run your loading job to load data into your graph:
The result is shown below.
Notice the location of the loading log file. The example assumes that you installed TigerGraph in the default location, /home/tigergraph/. In your installation folder is the main product folder, tigergraph. Within the tigergraph folder are several subfolders, such as logs, document, config, bin, and gstore. If you installed in a different location, say /usr/local/, then you would find the product folder at /usr/local/tigergraph.
Query Using Built-In SELECT Queries
You now have a graph with data! You can run some simple built-in queries to inspect the data.
The following GSQL command reports the total number of person vertices. The person.csv data file had 7 lines after the header.
Similarly, the following GSQL command reports the total number of friendship edges. The friendship.csv file also had 7 lines after the header.
The results are illustrated below.
Why are there 14 edges? For an undirected edge, GSQL actually creates two edges, one in each direction.
If you want to see the details about a particular set of vertices, you can use "SELECT *" and the WHERE clause to specify a predicate condition. Here are some statements to try:
The result is in JSON format as shown below.
In similar fashion, we can see details about edges. To describe an edge, you name the types of vertices and edges in the three parts, with some added punctuation to represent the traversal direction:
Note that the arrow -> is always used, whether it's an undirected or directed edge. That is because we are describing the direction of the query's traversal (search) through the graph, not the direction of the edge itself.
We can use the from_id predicate in the WHERE clause to select all friendship edges starting from the vertex identified by the "from_id". The keyword ANY to indicate that any edge type or any target vertex type is allowed. The following two queries have the same result
Restrictions on built-in edge select queries
To prevent queries which might return an excessive number of output items, built-in edge queries have the following restrictions:
- The source vertex type must be specified.
- The from_id condition must be specified.
There is no such restriction for user-defined queries.
The result is shown below.
Another way to check the graph's size is using one of the options of the administrator tool, gadmin. From a Linux shell, enter the command
gadmin status graph -v
Query Using A Parameterized GSQL Query
We just saw how easy and quick it is to run simple built-in queries. However you'll undoubtedly want to create more customized or complex queries. GSQL puts maximum power in your hands through parameterized vertex set queries. Parameterized queries let you traverse the graph from one vertex set to an adjacent set of vertices, again and again, performing computations along the way, with built-in parallel execution and handy aggregation operations. You can even have one query call another query. But we'll start simple.
A GSQL parameterized query has three steps.
- Define your query in GSQL. This query will be added to the GSQL catalog.
- Install one or more queries in the catalog, generating a REST endpoint for each query.
- Run an installed query, supplying appropriate parameters, either as a GSQL command or by sending an HTTP request to the REST endpoint.
A Simple 1-Hop Query
Now, let's write our first GSQL query. We'll display all the direct (1-hop) neighbors of a person, given as an input parameter.
This query features one SELECT statement. The SELECT statements here are much more powerful than the ones in built-in queries. Here you can do the following:The query starts by seeding a vertex set "Start" with the person vertex identified by parameter p passed in from the query call. The curly braces tell GSQL to construct a set containing the enclosed items.
Next, the SELECT statement describes a 1-hop traversal according to the pattern described in the FROM clause:
Start:s -(friendship:e)-> person:tgt
This is basically the same syntax we used for the built-in select edges query. Namely, we select all edges beginning from the given source set (Start), which have the given edge type (friendship) and which end at the given vertex type (person). A feature we haven't seen before is the use of vertex and edge set aliases defined by ":alias": "s" is the alias for the source vertex set, "e" is the edge set alias, and "tgt" is the target vertex set alias.
Refer back to the initial clause and the assignment ("Result = SELECT tgt"). Here we see the target set's alias tgt. This means that the SELECT statement should return the target vertex set (as filtered and processed by the full set of clauses in the SELECT query block) and assign that output set to the variable called Result.
Last, we print out the Result vertex set, in JSON format.
Create A Query
Rather than defining our query in interactive mode, we can store the query in a file and invoke the file from within the GSQL shell, using the @filename syntax. Copy and paste the above query into a file /home/tigergraph/hello.gsql. Then, enter the GSQL shell and invoke the file using @hello.qsql ( Note that if you are not in the /home/tigergraph folder when you start gsql, then you can use the absolute path to invoke a gsql file. e.g., @/home/tigergraph/hello.gsql ) Then run the "ls" command to see that the query is now in the catalog.
Install a Query
However, the query is not installed yet; it is not ready to run. In the GSQL shell, type the following command to installed the just added query "hello".
It takes about 1 minute for the database to install this new query. Be patient! For queries on large datasets, this small investment pays off many times over in faster query execution, particularly if you will run the query many times, with different parameters. The installation will generate machine instructions and a REST endpoint. After the progress bar reaches 100%, we are ready to run this query.
Run a Query in GSQL
To run a query in GSQL, use "RUN QUERY" followed by the query name and a set of parameter values.
The result is presented in JSON format. Tom has two 1-hop neighbors, namely Dan and Jenny.
Run a Query as a REST Endpoint
Under the hood, installing a query will also generate a REST endpoint, so that the parameterized query can be invoked by an http call. In Linux, the curl command is the most popular way to submit an http request. In the example below, the portion that is standard for all queries is shown in bold ; the portion in normal weight pertains to this particular query and parameter value. The JSON result will be returned to the Linux shell's standard output. So, our parameterized query becomes a http service!
Finally, to see the GSQL text of a query in the catalog, you can use
Congratulations! At this point, you have gone through the whole process of defining, installing, and running a query.
A More Advanced Query
Now, let's do a more advanced query. This time, we are going to learn to use the powerful built-in accumulators, which serves as the runtime attributes (properties) attachable to each vertex visited during our traversal on the graph. Runtime means they exist only while the query is running; they are called accumulators because they are specially designed to gather (accumulate) data during an implicitly parallel processing of the query.
In this query we will find all the persons which are exactly 2 hops away from the parameterized input person. Just for fun, let's also compute the average age of those 2-hop neighbors.
In the standard approach for this kind of graph traversal algorithm, you use a boolean variable to mark the first time that the algorithm "visits" a vertex, so that it knows not to count it again. To fit this need, we'll define a local accumulator of the type OrAccum. To declare a local accumulator, we prefix an identifier name with a single "@" symbol. Each accumulator type has a default initial value; the default value for boolean accumulators is false. Optionally, you can specify an initial value.
We also want to compute one average, so we will define a global AvgAccum. The identifier for a global accumulator begins with two "@"s.
After defining the Start set, we then have our first one 1-hop traversal. The SELECT and FROM clauses are the same as in our first example, but there is an additional ACCUM clause. The += operator within an ACCUM clause means that for each edge matching the FROM clause pattern, we accumulate the right-hand-side expression (true) to the left-hand-accumulator (tgt.@visited as well as s.@visited). Note that a source vertex or target vertex may be visited multiple times. Referring to Figure 1, if we start at vertex Tom, there are two edges incidents to Tom, so the ACCUM clause in the first SELECT statement will visit Tom two times. Since the accumulator type is OrAccum, the cumulative effect of the two traversals is the following:
Tom.@visited <== (initial value: false) OR (true) OR (true)
Note that it does not matter which of the two edges was processed first, so this operation is suitable for multithreaded parallel processing. The net effect is that as long as a vertex is visited at least once, it will end up with @visited = true. The result of this first SELECT statement is assigned to the variable FirstNeighbors.
The second SELECT block will do one hop further, starting from the FirstNeighbors vertex set variable, and reaching the 2-hop neighbors. Note that this time, we have omitted the edge type friendship and the target vertex type person from the FROM clause, but we retained the aliases. If no type is mentioned for an alias, then it is interpreted as ALL types. Since our graph has only one vertex type and one edge type, it is logically the same as if we had specified the types. The WHERE clause filters out the vertices which have been marked as visited before (the 1-hop neighbors and the starting vertex p ). This SELECT statement uses POST_ACCUM instead of ACCUM. The reason is that POST_ACCUM traverses the vertex sets instead of the edge sets, guaranteeing that we do not double-count any vertices. Here, we accumulate the ages of the 2-hop neighbors to get their average.
Finally, the SecondNeighbors of p are printed out.
This time, we put all of the following GSQL commands into one file hello2.gsql:
- USE GRAPH social
- The query definition
- Installing the query
- Running the query
We can execute this full set of commands without entering the GSQL shell. Please copy and paste the above GSQL commands into a Linux file named /home/tigergraph/hello2.gsql.
In a Linux shell, under /home/tigergraph, type the following:
The result is shown as below.
GSQL Query Summary:
- Queries are installed in the catalog and can have one or more input parameters, enabling reuse of queries.
- A GSQL query consists of a series of SELECT query blocks, each generating a named vertex set.
- Each SELECT query block can start traversing the graph from any of the previously defined vertex sets (that is, the sequence does not have to form a linear chain).
- Accumulators are runtime variables with built-in accumulation operations, for efficient multithreaded computation.
- Output is in JSON format.
Review: What You've Learned
You have learned a lot in GSQL 101!
With just the knowledge from GSQL 101 and a little practice, you should be able to do the following:
- Create a graph schema containing multiple vertex types and edge types.
- Define a loading job that takes one or more CSV files and maps the data directly to the vertices and edges of your graph.
- Write and run simple parameterized queries which start at one vertex and then traverse one or more hops to generate a final vertex set. Make a simple additive computation and return the results.
Want to learn more?
- To learn to do the same types of operations using the GraphStudio Visual SDK and UI, see the TigerGraph GraphStudio UI Guide .
- To see more GSQL examples, see GSQL Demo Examples .
- To get answers to common questions, see the TigerGraph Knowledge Base and FAQs .
To see the full GSQL specification (whose table of contents with give you and idea of what is available) see