×

☰ Table of Contents

TigerGraph Docs : GSQL Tutorial and Demo Examples

Version 1.0
Document updated:

Copyright © 2015-2017 TigerGraph. All Rights Reserved .

For technical support on this topic, contact support@tigergraph.com with a subject line beginning with "Examples"



Using the TigerGraph TM platform is as easy as 1-2-3. In this tutorial we will show you how to use the TigerGraph platform and the GSQL language by developing solutions for several use cases, using the following three-step method:

Step 1: Create a Graph Model for the use case using the GSQL™ language, TigerGraph's high-level graph definition and manipulation language.
Step 2: Load Initial Data : load and transform data to TigerGraph's graph engine.
Step 3: Write a Graph-based Solution by writing queries in the GSQL language.

In addition, this guide will also show you how to update your data : load more data, revise your data, or delete selected data.

Each example i nvolves a data set and simple example of a real-life query or task.  We develop a graph model, a loading job to load the data, and one or more queries to answer the question at hand. The applications for graph-based queries are limitless. The goal of these examples is to demonstrate the expressive power of GSQL queries, as well as how business intelligence is a natural fit for the graph analytics world.

We assume the user has a working installation of the TigerGraph system. If you have not installed the system, please refer to the TigerGraph Platform Installation Guide v1.0 .

This tutorial uses the console-based GSQL Shell. If you prefer to use the browser-based GraphStudio UI, see the TigerGraph GraphStudio UI Guide first. You can then return to this document in learn more about the language itself.

To start the GSQL Shell:
type the command gsql . To exit, type exit or quit . To run a command file from within the shell, precede the file name with "@":

GSQL> @load_demo.gsql

You can also run GSQL commands directly from Linux:

  • For single-line commands, type "gsql" followed by the command enclosed in single-quotation marks:
    $ gsql 'RUN QUERY topCopLiked("id1", 5)'
  • For command files, just type "gsql" followed by the filename:
    $ gsql cf_model.gsql

Common Graph Schema of Demo Examples

The examples in Part 1 of this tutorial have been designed so that all them can be loaded together in one master graph, gsql_demo. This has several benefits:

  1. You can quickly load several demo examples by running just one script.
  2. After they are loaded, you can switch from one example to another with no delay.
  3. The format is modular, so additional examples can be added easily.

If you want to learn how to design your own graph data analyses, we recommend reading and doing Example 1, then Example 2, etc., rather than running the entire batch of examples at once.


Part 1: Common Applications

Data files for Part 1 examples: DemoExamples_1.0.zip


Example 1. Collaborative Filtering

Here is an observation about social networks: If a set of persons likes me, and many of them also like another person Z, it is probably true that person Z and I have some things in common.  The same observation works for products and services: if a set of customers likes product X, and many of them also like product Z, then product X and Z probably have something in common.  We say X and Z are "co-liked". This observation can be turned around into a search for recommendations: Given a user X, find a set of of users Z which are highly co-liked.  For social networks, this can be used as friend recommendation: find a highly co-liked person Z to introduce to X.  For e-commerce, this can be used for purchase recommendation: someone who bought X may also be interested in buying Z. This technique of finding the top co-liked individuals is called collaborative filtering.

A graph analytics approach is a natural fit for collaborative filtering because the original problem is in a graph (social network), and the search criteria can easily be expressed as a path in the graph. We first find all people Y who like user X, then find other users Z who are liked by someone in group Y, and rank members of Z according to how many times they're liked by Y.

Figure 1 below shows a simple graph according to our model.  The circles represent three User vertices with id values id1, id2, and id3. There are two directed edges labeled "Liked" which show that User id2 likes id1, and id2 also likes id3.  (In this model, friendship is directional because in online social networks, one of the two persons initiates the friendship.) There are two more directed edges in the opposite directions labeled "Liked_By". Since id2 likes both id1 and id3, id1 and id3 are co-liked.


Figure 1 - Example graph for the collaborative filtering model

Quick Demo

To just see the basic operation of the TigerGraph system, follow the easy instructions below .  You can then continue to read for the explanation of the command files so you can learn to design your own examples.

Quick Demo Instructions

This example uses the graph below and asks the following query: "Who are the top co-liked persons of id1"?


Step 1: Obtain the data and command files. Create a graph model.

This example uses 4 small files: 3 command files ( cf_model.gsql , cf_load.gsql , cf_query.gsql) and one data file ( cf_data.csv) . Their contents are shown below, so you can either copy from this document or download the files (look in the "cf" subfolder of Examples.zip)

Use DROP ALL to erase existing graph models and old data, and then create your new graph model:

> gsql 'DROP ALL' > gsql cf_model.gsql > gsql 'CREATE GRAPH gsql_demo(*)'


Step 2: Load data:

The command below loads our new data.

> gsql cf_load.gsql


Step 3: Install and execute the query:

The file cf_query.gsql creates a query called topCoLiked. Then we install the query. The creation step runs fast, but the installation (compiling) step may take about 1 minute. We then run the query, asking for the top 20 Users who are co-liked with User id1.

> gsql cf_query.gsql > gsql 'INSTALL QUERY topCoLiked' > gsql 'RUN QUERY topCoLiked("id1", 20)'

The query results should be the following. Interpretation: id4 has as score (@cnt) = 2, which means there are two persons who like both id1 and id4. Next, id2 and id3 each have 1 co-friend in common with id1.

{ "error": false, "message": "", "results": [ { "v_set": "L2", "v_id": "id4", "v": {"@cnt": 2}, "v_type": "User" }, { "v_set": "L2", "v_id": "id3", "v": {"@cnt": 1}, "v_type": "User" }, { "v_set": "L2", "v_id": "id2", "v": {"@cnt": 1}, "v_type": "User" } ] }


We now begin a tutorial-style explanation of this TigerGraph example and the workflow in general.

The figure below outlines the steps to progress from an empty graph to a query solution.  Each of the blocks below corresponds to one of the steps in the Quick Demo above. The tutorial below will give you a deeper understanding of each step, so you can learn how it works and so you can design your own graph solutions.

Step 1: Create a graph model.

The first step is to create a model for your data which describes the types of vertices and edges you will have.

The TigerGraph platform handles one graph model at a time (though the user can make the model simple or complex, to handle multiple needs).  To clear an existing model and old data, so you can install a new one, run the " DROP ALL " command.

The statements below describe the vertex types and edge types in our Co-Liked model

CREATE VERTEX User (PRIMARY_ID id string) CREATE DIRECTED EDGE Liked (FROM User, TO User) WITH REVERSE_EDGE = "Liked_By"

The first CREATE statement creates one vertex type called User. The second statement creates one directed edge type called Liked.  The WITH REVERSE_EDGE clause means that for every two vertices (x,y) connected by a Liked type of edge, the system will automatically generate a corresponding edge of type Liked_By pointing from y to x, and both edges will have the same edge attributes.

After defining all your vertex and edge types, execute the following command to create a graph which binds the vertices and edges into one graph model:

CREATE GRAPH command
CREATE GRAPH gsql_demo(*)

The name of the graph is gsql_demo. Within the parentheses, you can either list the specific vertex and edge types (User, Liked), or you can use *, which means include everything.  We chose to use * so that the same command can be used for all of our examples.

The CREATE commands can be stored in one file and executed together.

cf_model.gsql
CREATE VERTEX User (PRIMARY_ID id string) CREATE DIRECTED EDGE Liked (FROM User, TO User) WITH REVERSE_EDGE = "Liked_By" #CREATE GRAPH gsql_demo(*)

The CREATE GRAPH command is commented out for the following reason:

Our examples have been designed to run either as individual graphs or merged together into one multi-application graph.  The CREATE GRAPH command may be run only once, after all the vertex and edge types  have been created. (Each of our demo exmples uses unique vertex and edge names, to avoid conflicts.) In other words, we run ' CREATE GRAPH gsql_demo(*) ' as a separate command after creating all the vertex and edge types. If you decide you want to modify the schema after running CREATE GRAPH, you can create and run a SCHEMA_CHANGE JOB.

  1. To execute these statements (DROP ALL, CREATE VERTEX, etc.), you can type them individually at the GSQL shell prompt, or you can first save them to a file, such as cf_model.gsql , and then run the command file.  From within the shell, you would run
    @cf_model.gsql
    From outside the shell, you would run
    > gsql cf_model.gsql

    Normally a user would put all their CREATE VERTEX, CREATE EDGE, and the final CREATE GRAPH statements in one file. In our example files, we have separated out the CREATE GRAPH statement because we want to merge all our example schemas together into one common graph.

  2. The vertex, edge, and graph types become part of the catalog . To see what is currently in your catalog, type the ls command from within the GSQL shell to see a report as below:

    Catalog contents, as reported by the "ls" command
    Vertex Types: - VERTEX User(PRIMARY_ID id STRING) WITH STATS="OUTDEGREE_BY_EDGETYPE" Edge Types: - directed edge Liked(from User, to User) with reverse_edge="Liked_By" - directed edge Liked_By(from User, to User) with reverse_edge="Liked" Graphs: Jobs: Queries
  1. To remove a definition from the catalog, use some version of the DROP command.  Use the help command to see a summary of available GSQL commands.
  2. In our examples, we typically show keywords in ALL UPPERCASE to distinguish them from user-defined identifiers. Identifiers are case-sensitive but keywords are not.


In this example, the vertices and edges don't have attributes. In general, a TigerGraph graph can have attributes on both vertices and edges, and it can also have different types of edges connecting the same two vertices.  Please see GSQL Language Reference Part 1 - Defining Graphs and Loading Data v1.0 which provides a more complete description of the graph schema definition language with additional examples.

Step 2: Load initial data.

Figure 2 shows a larger graph with five vertices and several edges. To avoid crowding the figure, only the Liked edges are shown: For every Liked edge, there is a corresponding Liked_By edge in the reverse direction.

Figure 2 - Graph for Collaborative Filtering Calculation

The data file below describes the five vertices and seven edges of Figure 2.

cf_data.csv
id2,id1
id2,id3
id3,id1
id3,id4
id5,id1
id5,id2
id5,id4

The loading job below will read from a data file and create vertex and edge instances to put into the graph.

Per Example: Load data into the graph (file: cf_load.gsql)
# define the loading job CREATE ONLINE_POST JOB load_cf FOR GRAPH gsql_demo { LOAD TO VERTEX User VALUES ($0), TO VERTEX User VALUES ($1), TO EDGE Liked VALUES ($0, $1) ; } # load the data RUN JOB load_cf USING FILENAME="../cf/data/cf_data.csv", SEPARATOR=",", EOL="\n"

The CREATE ONLINE_POST JOB statement (line 2) defines a job called load_cf. The job will read each line of the input file, c reates one vertex based on the value in the first column (referenced with column name $0), another vertex based on the value in the second column ($1), and one Liked edge pointing from the first vertex to the second vertex. In addition, since the Liked edge type definition includes the WITH REVERSE_EDGE clause, a Liked_By edge pointing in the opposite direction is also created.

After the job has been created, we run the job (line 11).  the RUN JOB command line includes details about the data source: the name of the file is cf_data.csv, commas are used to separate columns, and \n is used to end each line. (Data files should not contain any extra spaces before or after the separator character.)

The TigerGraph loader automatically filters out duplicates. If either of the two column values has already been seen before, that vertex won't be created. Instead the existing vertex will be used. For example, if we read the first two data lines in data file cf_data.csv , the first line will generate two User vertices, one edge type of Liked , and one edge type of Liked_By . For the second row, however, only one new vertex will be created since id2 has been seen already. Two edges will be created for the second row.

  1. It is okay to run an ONLINE_POST JOB again, or to run a different loading job, to add more data to a graph store which already has some data.  For example, you could do the following:

    RUN JOB load_cf USING FILENAME="../cf/cf_data1.csv", SEPARATOR=",", EOL="\n" RUN JOB load_cf USING FILENAME="../cf/cf_data2.tsv", SEPARATOR="\t", EOL="\n"
  2. After loading, you can use the GraphStudio UI to visually inspect your data.  Refer to the TigerGraph GraphStudio UI Guide .

To clear all your data but to keep your graph model, run the "CLEAR GRAPH STORE -HARD" command. -HARD must be in all capital letters.

Be very careful using this command; there is no UNDO command.

For the querying and updating examples in the remainder of this use case, we will assume that Figure 2 has been loaded.

This loading example is basic. The GQSL language can do complex data extraction and transformation, such as dealing with JSON input format and key-value list input, all in high-level syntax.  Please see GSQL Language Reference Part 1 - Defining Graphs and Loading Data v1.0 for more examples.


Step 3: Write a graph-based query solution

The GSQL language includes not only data definition and simple inspection of the data, but also advanced querying which traverses the graph and which supports aggregation and iteration.

Built-In Queries

First , we can run some simple queries to verify that the data were loaded correctly. Below are some examples of some built-in GSQL queries which can be run in GSQL shell:

Simple Query for Validation Meaning & Comments
SELECT count() FROM User Display the number of User vertices,
SELECT count() FROM User-(Liked)->User Display the number of directed Liked edges from User type to User type
SELECT * FROM User LIMIT 3 Display all id, type, and attribute information for up to 3 User vertices.
A LIMIT or WHERE condition is required, to prevent the output from being too large.
Note that there is also a system limit of 10240 vertices or edges returned by SELECT *.

SELECT * FROM User WHERE primary_id=="id2"

Display all id, type and attribute information for the User vertex whose primary_id is "id2".
The keyword primary_id must be lowercase.
The WHERE clause can also specify non-ID attributes.
SELECT * FROM User-(ANY)->ANY WHERE from_id=="id1"

Display all id,type, and attribute information about any type of edge which starts from vertex "id1"
1. The keyword from_id must be lowercase.
2. To guard against queries which select too many edges, the WHERE clause is mandatory when selecting edges.


SELECT *  displays information in JSON format.  Below is an example of query output.

GSQL > SELECT * FROM User LIMIT 5 [ { "id": "id2", "v_type": "User", "attributes": {} }, { "id": "id3", "v_type": "User", "attributes": {} }, { "id": "id1", "v_type": "User", "attributes": {} }, { "id": "id4", "v_type": "User", "attributes": {} }, { "id": "id5", "v_type": "User", "attributes": {} } ]


Create a Query

Now let's solve our original problem: find users who are co-liked with a user X. The following query demonstrates a 2-step traversal with aggregation.

The query below performs the co-liked collaborative filtering search.  The concept behind this query is to describe a "graph path" which represents the relationship between a person (the starting point) and a person that is co-liked (the ending point).  Figure 1 illustrates this path: id3 is a co-liked user of id1, because id2 likes both of them.  The path from id1 to co-liked users is: (1) traverse a Liked_By edge to a User, and then (2) traverse a Liked edge to another User. This query also calculates the magnitude of the relationship between the starting point and each ending point. The more users there are such as id2 which connect id1 and id3, the stronger the co-like relationship between id1 and id3.  Counting the number of paths that end at id3 serves to calculate this magnitude.

cf_query.gsql - Define the collaborative filtering query
CREATE QUERY topCoLiked( vertex<User> input_user, INT topk) FOR GRAPH gsql_demo { SumAccum<int> @cnt = 0; # @cnt is a runtime attribute to be associated with each User vertex # to record how many times a user is liked. L0 = {input_user}; L1 = SELECT tgt FROM L0-(Liked_By)->User:tgt; L2 = SELECT tgt FROM L1-(Liked)->:tgt WHERE tgt != input_user ACCUM tgt.@cnt += 1 ORDER BY tgt.@cnt DESC LIMIT topk; PRINT L2; }

This query is structured like a procedure with two input parameters: an input vertex and value of k for the top-K ranking. The query contains three SELECT statements executed in order. The L0 statement defines our initial list of vertices: a set containing a single user supplied by the input_user parameter. Suppose the input user is id1 . Next, the L1 statement starts from every vertex in the set L0, traverses every connected edge of type Liked_By and returns every target vertex (that is, the other ends of the connected edges).  As a result, L1 is the set of all users who liked the input user. Referring to the graph in Figure 2, the query travels backwards along every Liked edge which points to id1 , arriving at id2 , id3 , and id5 . These three vertices form L1. Next, the L2 statement starts from each user in L1, travels to every user liked by that starting user (via the Liked type of edges), and increments the count for each User reached. That is, the algorithm counts how many times each vertex is visited by a query path. The WHERE condition makes sure the original input user will not be returned in the result. ORDER BY and LIMIT have the same meaning as in SQL. Below, we show how the L2 step tallies the counts for each vertex encountered:

  1. From id2, Liked edges lead to id1 and id3. id1 is excluded due to the WHERE clause.  The cnt count for id3 is incremented from 0 to 1.
  2. From id3, Liked edges lead to id1 and id4. id1 is excluded due to the WHERE clause.  The cnt count for id4 is incremented from 0 to 1.
  3. From id5, Liked edges lead to id1, id2, and id4. id1 is excluded to to the WHERE clause. The cnt count of id2 is incremented from 0 to 1. The cnt count of id4 is incremented from 1 to 2.

The three co-liked users and their cnt scores:  id3 (cnt score = 1), id4 (cnt = 2), and id2 (cnt = 1).  The ORDER BY clause indicates that the sorting should be in descending order, such that the LIMIT clause trims L2 to the 20 vertices with the highest (as opposed to lowest) cnt values.  For the test graph, there are only 3 vertices which are co-liked, less than the limit of 20. id4 has the strongest co-liked relationship.

Install and Run a Query

After the query is defined (in the CREATE QUERY block), it needs to be installed.  The INSTALL QUERY command compiles the query.

INSTALL QUERY topCoLiked


If you have several queries, you can wait to install them in one command, which runs faster than installed each one separately. E.g.,

INSTALL QUERY query1, query2

or

INSTALL QUERY ALL

is faster than

INSTALL QUERY query1 INSTALL QUERY query2

After a query has been installed, it can be run as many times has desired. The command RUN QUERY invokes the query, with the given input arguments.

Using "id1" as the starting point and allowing up to 5 vertices in the output, the RUN QUERY command and its output on our test graph is shown below:

GSQL > RUN QUERY topCoLiked("id1", 5) { "error": false, "message": "", "results": [ { "v_set": "L2", "v_id": "id4", "v": {"@cnt": 2}, "v_type": "User" }, { "v_set": "L2", "v_id": "id3", "v": {"@cnt": 1}, "v_type": "User" }, { "v_set": "L2", "v_id": "id2", "v": {"@cnt": 1}, "v_type": "User" } ] }


Instead of using the RUN QUERY command within the GSQL shell, the query can be invoked from the operating system via a RESTful GET endpoint (which is automatically created by the INSTALL QUERY command):
> curl -X GET "http:// hostName : port /query/topCoLiked? input_user= id1 &topk=5 "
If you followed the standard installation instructions for the TigerGraph system, hostName for the R EST server be localhost , and port is 9000 .

Step 4 (Optional): Update Your Data.

You can update the stored graph at any time, to add new vertices and edges, to remove some, or to update existing values. The GSQL language includes ADD, DROP, ALTER, UPSERT, and DELETE operations which are similar to the SQL operations of the same name. The UPSERT operation is a combined UPDATE-INSERT operation: If object exists, then UPDATE, else INSERT. Note that this is the default behavior for The GSQL language's 'smart' loading described above. There are three basic types of modifications to a graph:

  1. Adding or deleting objects
  2. Altering the schema of the graph
  3. Modifying the attributes of existing objects

We'll give a quick example of each type. To show the effect each modification, we'll use the following simple built-in queries:

cf_mod_check.gsql
SELECT * FROM User LIMIT 1000 SELECT * FROM User-(Liked)->User WHERE from_id=="id2"

The current results, before making any modifications, are shown below.

Users vertices and Edges from id2, before any modifications
GSQL > SELECT * FROM User LIMIT 1000 [ { "v_type": "User", "attributes": {}, "id": "id2" }, { "v_type": "User", "attributes": {}, "id": "id1" }, { "v_type": "User", "attributes": {}, "id": "id3" }, { "v_type": "User", "attributes": {}, "id": "id4" }, { "v_type": "User", "attributes": {}, "id": "id5" } ] GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2" [ { "to_v_type": "User", "to_id": "id1", "etype": "Liked", "attributes": {} }, { "to_v_type": "User", "to_id": "id3", "etype": "Liked", "attributes": {} } ]


Graph modification operations are performed by a distributed computing model which satisfies Eventual Consistency. For these examples, a brief one second pause between the updating and querying the graph should be sufficient.


Modification Type 1: Adding or deleting

Adding is simply running a loading job again with a new data file. The GSQL loader supports a special type of LOADING JOB called an ONLINE_POST LOADING JOB. This can be used to read data continuously from a streaming source. More details are in the GSQL Language Reference.

Deleting: Suppose we want to delete vertex id3 and all its connections:

cf_mod1.gsql
DELETE FROM User WHERE primary_id=="id3"


The GSQL DELETE operation is a cascading deletion. If a vertex is deleted, then all of the edges which connect to it are automatically deleted as well.

Result: one fewer vertex and one fewer edge from id2.

Users vertices and Edges from id2, after Modification 1
GSQL > SELECT * FROM User LIMIT 1000 [ { "v_type": "User", "attributes": {}, "id": "id2" }, { "v_type": "User", "attributes": {}, "id": "id1" }, { "v_type": "User", "attributes": {}, "id": "id4" }, { "v_type": "User", "attributes": {}, "id": "id5" } ] GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2" [{ "to_v_type": "User", "to_id": "id1", "etype": "Liked", "attributes": {} }]



Modification Type 2: Altering the schema

The GSQL language supports four types of schema alterations:

  1. Adding a new type of vertex or edge: ADD    VERTEX | DIRECTED EDGE | UNDIRECTED EDGE
  2. Removing a type of vertex or edge: DROP    VERTEX | DIRECTED EDGE | UNDIRECTED EDGE
  3. Adding attributes to a vertex or edge type: ALTER VERTEX vertex_type | EDGE edge_type ADD ATTRIBUTE (name type)
  4. Removing attributes of a vertex or edge type: ALTER VERTEX vertex_type | EDGE edge_type DROP ATTRIBUTE (name)

To make schema changes, create a SCHEMA_CHANGE job. Running the SCHEMA_CHANGE JOB will automatically stop all services, update the graph store, and restart the service. For example, suppose we wish to add a name for Users and a weight to Liked edges to indicate how much User A likes User B.

cf_mod2.gsql
CREATE SCHEMA_CHANGE JOB cf_mod2 FOR GRAPH gsql_demo { ALTER VERTEX User ADD ATTRIBUTE (name string); ALTER EDGE Liked ADD ATTRIBUTE (weight float DEFAULT 1); } RUN JOB cf_mod2

Changing the schema may necessitate changing queries and other tasks, such as REST endpoints.  In this example, the collaborative filtering query will still run with the the new weight attribute, but it will ignore the weight in its calculations.


Users vertices and Edges from id2, after Modification 2
GSQL > SELECT * FROM User LIMIT 1000 [ { "v_type": "User", "attributes": {"name": ""}, "id": "id2" }, { "v_type": "User", "attributes": {"name": ""}, "id": "id1" }, { "v_type": "User", "attributes": {"name": ""}, "id": "id4" }, { "v_type": "User", "attributes": {"name": ""}, "id": "id5" } ] GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2" [{ "to_v_type": "User", "to_id": "id1", "etype": "Liked", "attributes": {"weight": 1} }]


Modification Type 3: Modifying the attributes of existing objects

Now that we have added a weight attribute, we probably want to assign some weight values to the graph.  The following example updates the weight values of two edges. For edge upserts, the first two arguments in the VALUES list specify the FROM vertex id and the TO vertex_id, respectively.  Similarly, for vertex upserts, the first argument in the VALUES list specifies the PRIMARY_ID id. Since id values may not be updated, the GSQL shell implicitly applies a conditional test: "If the specified id value(s) exist, than update the non-id attributes in the VALUES list; otherwise, insert a new data record using these values."

cf_mod3.gsql
UPSERT User VALUES ("id1", "Aaron") UPSERT User VALUES ("id2", "Bobbie") UPSERT User-(Liked)->User VALUES ("id2","id1",2.5)


Users vertices and Edges from id2, after Modification 3
GSQL > SELECT * FROM User LIMIT 1000 [ { "v_type": "User", "attributes": {"name": "Bobbie"}, "id": "id2" }, { "v_type": "User", "attributes": {"name": "Aaron"}, "id": "id1" }, { "v_type": "User", "attributes": {"name": ""}, "id": "id4" }, { "v_type": "User", "attributes": {"name": ""}, "id": "id5" } ] GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2" [{ "to_v_type": "User", "to_id": "id1", "etype": "Liked", "attributes": {"weight": 2.5} }]


Other Modes for Graph Updates

In addition to making graph updates within the GSQL Shell, there are two other ways: sending a query string directly to the Standard Data Manipulation REST API, or writing a custom REST endpoint.  For details about the first method, see the GET, POST, and DELETE /graph endpoints in the RESTPP API User Guide v1.0 . The functionality in GSQL and in the Standard Query API is the same; GSQL commands are translated into REST GET, POST, and DELETE requests and submitted to the Standard Query API.

The REST API equivalent of the GSQL Modification 3 upsert example above is as follows:

curl -X POST --data-binary @ data/cf_mod3_input.json http:// serverIP :9000/graph

where serverIP is the IP address of your REST server (default = localhost ) and data / cf_mod3_input.json is a text file containing the following JSON-encoded data:

cf_upsert.json
{ "vertices": { "User":{ "id1":{ "name":{ "value":"Aaron" } } }, "User":{ "id2":{ "name":{ "value":"Bobbie" } } } }, "edges": { "User":{ "id2":{ "Liked":{ "User":{ "id1":{ "weight" : { "value":2.5 } } } } } } } }



Back to top



Example 2. Page Rank

This example shows the use of WHILE loop iteration, global variables , and the built-in outdegree attribute.

It is recommended that you do the Collaborative Filtering Use Case first, because it contains additional tips on running the TigerGraph system.

Remember that if you have a text file containing GSQL commands (e.g., commands.gsql), you can run it one of two ways:

  • From Linux: gsql commands.gsql
  • From inside the GSQL shell: @commands.gsql

To run a single command (such as DROP ALL):

  • From Linux: gsql 'DROP ALL'
  • From inside the GSQL shell: DROP ALL

Step 1: Create a graph model.

In this example, there is only one type of vertex and one type of edge, and edges are directed.

pagerank_model.gsql
CREATE VERTEX Page (PRIMARY_ID pid string, page_id string) CREATE DIRECTED EDGE Linkto (FROM Page, TO Page) #CREATE GRAPH gsql_demo(*)

Note how the Page vertex type has both a PRIMARY_ID  and a page_id attribute.  As will be seen in step 2, the same data will be loaded into both fields.  While this seems redundant, this is a useful technique in TigerGraph graph stores. The PRIMARY_ID is not treated as an ordinary attribute. In exchange for high-performance storage, the PRIMARY_ID lacks some of the filtering and querying features available to regular attributes.  The Linkto edge does not have any attributes. In general, a TigerGraph graph can have attributes on both vertices and edges, and it can also have different types of edges connecting the same two vertices.

The CREATE GRAPH command is commented out for the following reason:

Our examples have been designed to run either as individual graphs or merged together into one multi-application graph.  The CREATE GRAPH command should be run only once, after all the vertex and edge types for all the examples have been created. (Naturally, every model uses unique vertex and edge names, to avoid conflicts.) In other words, run ' CREATE GRAPH gsql_demo(*) ' as a separate command after you have created all your vertex and edge types.

Please see the GSQL Language Reference which provides a more complete description of the graph schema definition language with additional examples .

Step 2: Load initial data using Online Loading

A similar graph to what was used for the Collaborative Filtering user-user network example can be used for an example here. That is, each row has two values which are node IDs, meaning that there is a connection from the first node to the second node. However, we will introduce a difference to demonstrate the flexibility of the TigerGraph loading system. We will modify the data file to use the tab character as a field separator instead of the comma.

pagerank_data.tsv
1 2 1 3 2 3 3 4 4 1 4 2

Online loading job:

Create your loading job and load the data.

Per Example: Load data into the graph (file: pagerank_load.gsql)
# define the loading job CREATE ONLINE_POST JOB load_pagerank FOR GRAPH gsql_demo { LOAD TO VERTEX Page VALUES ($0, $0), TO VERTEX Page VALUES ($1, $1), TO EDGE Linkto VALUES ($0, $1) ; } # load the data RUN JOB load_pagerank USING FILENAME="../pagerank/pagerank_data.tsv", SEPARATOR="\t", EOL="\n"

The above loading job will read each line of the input file (pagerank_data.tsv), create one vertex based on the value in the first column (referenced as $0), another vertex based on the value in the second column ($1), and one edge pointing from the first vertex to the second vertex. If either of the two column values has already been seen before, that vertex won't be created. Instead the existing vertex will be used. For example, the first row of pagerank_data.tsv, will create two vertices, with ids 1 and 2, and one edge (1, 2). The second row, however, will create only one new vertex, id 3, and one edge (1, 3), because id 1 already exists.

Note how an online loading RUN JOB command specifies the data filename, separator character, and end-of-line character.

Step 3: Write a Graph-based query solution

GSQL includes not only data definition and simple inspection of the data, but also advanced querying which traverses the graph and which supports aggregation and iteration. This example uses iterations, repeating the computation block until the maximum score change at any vertex is no more than a user-provided threshold, or until it reaches a user-specified maximum number of allowed iterations. Note the arrow -> in the FROM clause used to represent the direction of a directed edge.

pagerank_query.gsql
CREATE QUERY pageRank (float maxChange, int maxIteration, float dampingFactor) FOR GRAPH gsql_demo { # In each iteration, compute a score for each vertex: # score = dampingFactor + (1-dampingFactor)* sum(received scores from its neighbors). # The pageRank algorithm stops when either of the following is true: # a) it reaches maxIterations iterations; # b) the maximum score difference of any vertex compared to the last iteration <= maxChange. # @@ prefix means a global accumulator; # @ prefix means an individual accumulator associated with each vertex MaxAccum<float> @@maxDifference = 9999; # max score change in an iteration SumAccum<float> @received_score = 0; # sum of scores each vertex receives from neighbors SumAccum<float> @score = 1; # initial score for every vertex is 1. AllV = {Page.*}; # Start with all vertices of type Page WHILE @@maxDifference > maxChange LIMIT maxIteration DO @@maxDifference = 0; AllV = SELECT s FROM AllV:s-(Linkto)->:t ACCUM t.@received_score += s.@score/s.outdegree() POST-ACCUM t.@score = dampingFactor + (1-dampingFactor) * t.@received_score, t.@received_score = 0, @@maxDifference += abs(t.@score - t.@score'); PRINT @@maxDifference; # print to default json result END; # end while loop PRINT AllV.page_id, AllV.@score; # print the results } # end query

After executing the CREATE QUERY command, remember to install the query, either by itself or together with other queries:

Install the query
INSTALL QUERY pageRank


Run the query:

We will use the typical dampingFactor of 0.15, iterate until the pagerank values change by less than 0.001, up to a maximum of 100 iterations. For these conditions, the PageRank values for the 4 vertices (1,2,3,4) are  ( 0.65551, 0.93379, 1.22156, 1.18914), respectively.

RUN QUERY pageRank(0.001, 100, 0.15) { "error": false, "message": "", "results": [ {"@@maxDifference": 0.425}, {"@@maxDifference": 0.36125}, {"@@maxDifference": 0.15353}, {"@@maxDifference": 0.19575}, {"@@maxDifference": 0.16639}, {"@@maxDifference": 0.09429}, {"@@maxDifference": 0.08014}, {"@@maxDifference": 0.05961}, {"@@maxDifference": 0.04705}, {"@@maxDifference": 0.03999}, {"@@maxDifference": 0.017}, {"@@maxDifference": 0.02}, {"@@maxDifference": 0.017}, {"@@maxDifference": 0.00953}, {"@@maxDifference": 0.0081}, {"@@maxDifference": 0.00616}, {"@@maxDifference": 0.00479}, {"@@maxDifference": 0.00407}, {"@@maxDifference": 0.00178}, {"@@maxDifference": 0.00205}, {"@@maxDifference": 0.00174}, {"@@maxDifference": 9.6E-4}, { "v_set": "AllV", "v_id": "2", "v": { "AllV.page_id": "2", "AllV.@score": 0.93379 }, "v_type": "Page" }, { "v_set": "AllV", "v_id": "4", "v": { "AllV.page_id": "4", "AllV.@score": 1.18914 }, "v_type": "Page" }, { "v_set": "AllV", "v_id": "1", "v": { "AllV.page_id": "1", "AllV.@score": 0.65551 }, "v_type": "Page" }, { "v_set": "AllV", "v_id": "3", "v": { "AllV.page_id": "3", "AllV.@score": 1.22156 }, "v_type": "Page" } ] }

Step 4 (Optional): Update Your Data.

Details about updating were discussed in Use Case 1 (Collaborative Filtering).  We will go right to the graph modification examples for the PageRank case.

To show the effect of each modification, we use two built-in queries. The first one lists all the Page vertices.  The second one lists all the edges which start at Page 4.

pagerank_mod_check.gsql
SELECT * FROM Page LIMIT 1000 SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4"

These are the results of the diagnostic queries, before any graph modifications.  There are 4 vertices total and 2 edges which start at page 4.

Page vertices and Linkto edges from Page 4, before modifications
SELECT * FROM Page LIMIT 1000 [ { "v_type": "Page", "attributes": {"page_id": "1"}, "id": "1" }, { "v_type": "Page", "attributes": {"page_id": "2"}, "id": "2" }, { "v_type": "Page", "attributes": {"page_id": "3"}, "id": "3" }, { "v_type": "Page", "attributes": {"page_id": "4"}, "id": "4" } ] SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4" [ { "to_v_type": "Page", "to_id": "1", "etype": "Linkto", "attributes": {} }, { "to_v_type": "Page", "to_id": "2", "etype": "Linkto", "attributes": {} } ]


Modification 1: Adding or deleting

Adding is simply running a loading job again with a new data file.

Deleting: Suppose we want to delete vertex url4 and all its connections:

pagerank_mod1.gsql
DELETE FROM Page WHERE page_id=="1"


The GSQL DELETE operation is a cascading deletion. If a vertex is deleted, then all of the edges which connect to it are automatically deleted as well.

Result: one fewer vertex and one fewer edge from Page 4.

Page vertices and Linkto edges from Page 4, after Modification 1
SELECT * FROM Page LIMIT 1000 [ { "v_type": "Page", "attributes": {"page_id": "2"}, "id": "2" }, { "v_type": "Page", "attributes": {"page_id": "3"}, "id": "3" }, { "v_type": "Page", "attributes": {"page_id": "4"}, "id": "4" } ] SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4" [ { "to_v_type": "Page", "to_id": "2", "etype": "Linkto", "attributes": {} } ]

Modification Type 2: Altering the schema

For example, suppose we wish to add an attribute to the Page vertices to classify what type of Page it is and also a date to the edges.

pagerank_mod2.gsql
CREATE SCHEMA_CHANGE JOB pagerank_mod2 FOR GRAPH gsql_demo { ALTER VERTEX Page ADD ATTRIBUTE (pageType string DEFAULT ""); ALTER EDGE Linkto ADD ATTRIBUTE (dateLinked string DEFAULT ""); } RUN JOB pagerank_mod2


Changing the schema may necessitate revising and reinstalling loading jobs and queries. In this case, adding the pageType attribute does not harm the pageRank query.

Page vertices and Linkto edges from Page 4, after Modification 2
SELECT * FROM Page LIMIT 1000 [ { "v_type": "Page", "attributes": { "page_id": "2", "pageType": "" }, "id": "2" }, { "v_type": "Page", "attributes": { "page_id": "3", "pageType": "" }, "id": "3" }, { "v_type": "Page", "attributes": { "page_id": "4", "pageType": "" }, "id": "4" } ] SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4" [{ "to_v_type": "Page", "to_id": "2", "etype": "Linkto", "attributes": {"dateLinked": ""} }]


Modification Type 3: Modifying the attributes of existing objects

The following example updates the type values of two vertices and one edge.

UPSERT Page VALUES (2,2,"info") UPSERT Page VALUES (3,3,"commerce") UPSERT Page-(Linkto)->Page VALUES (4,2,"2016-08-31")

Page vertices and Linkto edges from Page 4, after Modification 3
SELECT * FROM Page LIMIT 1000 [ { "v_type": "Page", "attributes": { "page_id": "2", "pageType": "info" }, "id": "2" }, { "v_type": "Page", "attributes": { "page_id": "3", "pageType": "commerce" }, "id": "3" }, { "v_type": "Page", "attributes": { "page_id": "4", "pageType": "" }, "id": "4" } ] SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4" [{ "to_v_type": "Page", "to_id": "2", "etype": "Linkto", "attributes": {"dateLinked": "2016-08-31"} }]


Other Modes for Graph Updates

In addition to making graph updates within the GSQL Shell, there are two other ways: sending a query string directly to the Standard Data Manipulation REST API, or writing a custom REST endpoint.  For details about the first method, see the Standard Data Manipulation REST API User Guide . The functionality in GSQL and in the Standard Query API is essentially the same; GSQL commands are translated into REST GET, POST, and DELETE requests and submitted to the Standard Query API.

The REST API equivalent of the GSQL Modification 3 upsert example above is as follows:

curl -X POST --data-binary @data/pagerank_mod3_input.json http://serverIP:9000/graph

where serverIP is the IP address of your REST server, and data / pagerank_mod3_input.json is a text file containing the following JSON-encoded data:

{ "vertices": { "Page":{ "2":{ "pageType" : { "value":"info" } } }, "Page":{ "3":{ "pageType" : { "value":"commerce" } } } }, "edges": { "Page":{ "4":{ "Linkto":{ "Page":{ "2":{ "dateLinked" : { "value":"2016-08-31" } } } } } } } }



Back to top



Example 3. Simple Product Recommendation

This example introduces the technique of flattening – splitting a data field which contains a set of elements into individual vertices and edges, one for each element.


Input Data: A list of products. Each Product has a 64-bit image hash value and a list of words describing the product.
Query Task
: Find the products which are most similar to a given product. Formally, given a product id P and an integer K, return the top K products similar to the product P. The similarity between a product P and another product Q is based on the number of words found in the product descriptions for both product P and product Q.

Step 1: Create a graph model for the use case, using the data definition language (DDL) aspect of the GSQL language.


simprod_model.gsql
CREATE VERTEX Product (PRIMARY_ID pid string, image_hash uint) CREATE VERTEX DescWord (PRIMARY_ID id string) CREATE UNDIRECTED EDGE Has_desc (FROM Product, TO DescWord)

Then run

CREATE GRAPH gsql_demo(*)

The above statements create two types of vertices, Product and DescWord, and one type of edge connecting the two vertex types. The edge is undirected so that you can just as easily traverse from a Product to its descriptive words or from a descriptive word to Products which are described by it.

The generated graph schema for this case is shown below. The GSQL Language Reference manual provides a more complete description of the language with more examples .

Step 2: Load Input Data.

In this example, the input data are all stored in a single file having a 3-column format with a header column. Below are the test data:

simprod_data.csv
id,hash,words 62abcax334,15243242,"word1,word2,word3" dell laptop,1837845,"word2,word4,word5" mac book, 128474373,"word4" surface pro,8439828,"word1,word3,word6" hp book,29398439828,"word2,word3,word1" linux abc,298439234,"word4,word2,word1" linux def,295839234,"word4,word2,word6,word7"


Column 1 is the product id; column 2 is the image hash code, and column 3 is a list of words describing the product. Note how double quotation marks are used to enclose the list of words. Each row from the input file may lead to the creation of one Product vertex, multiple DescWord vertices, and multiple edges, one edge connecting the Product to each DescWord vertex.

The loading job below makes use of several features of the loading language to intelligently transform this data file into the appropriate vertices and edges.

  1. GSQL online loading cannot directly use the header in a data file, but a loading job can accomplish something similar: DEFINE HEADER and USER_DEFINED_HEADER allow the loading job to use descriptive names for the columns ("id", "hash", "words") instead of numbers ($0, $1, $2).
  2. QUOTE="double" informs the loader that double quotation marks enclose strings.  This allows the separator character (e.g., comma) to appear in the string, without triggering the end of the token.  QUOTE="single" is also available.
  3. The special TEMP_TABLE and flatten() function are used to split the list of tokens into separate items and to store them temporarily. The temporary items are then used to assemble the final edge objects.
  4. The RUN JOB option " -n 2,$ " means "only read data lines from line 2 to the end." This is used to skip over the header in simprod_data.csv.

simprod_load.gsql
# define the loading job CREATE ONLINE_POST JOB load_simprod FOR GRAPH gsql_demo { DEFINE HEADER head1 = "id","hash","words"; LOAD TO VERTEX Product values ($"id", $"hash"), TO TEMP_TABLE t (pid, description) VALUES ($"id", flatten($"words", ",", 1)) USING QUOTE="double", USER_DEFINED_HEADER="head1"; LOAD TEMP_TABLE t TO VERTEX DescWord VALUES ($"description"), TO EDGE Has_desc VALUES ($"pid", $"description"); } # load the data RUN JOB -n 2,$ load_simprod USING FILENAME="../simprod/simprod_data.csv", SEPARATOR=",", EOL="\n"

In general, the GSQL language can map and transform multiple input files to multiple vertex and edge types.  More advanced data transformation and filtering features are also available. See the GSQL Language Reference manual for more information.

An example of the resulting data graph is shown below. Products (P1, P2, etc.) connect to various DescWords (Word1, Word2, etc.). Each Product connects to many DescWords, and each DescWord is used in multiple Products.


Step 3: Write a graph-based solution using TigerGraph's high-level GSQL query language, to solve the use case and auto-generate the REST GET/POST endpoints for real-time accesses to TigerGraph's system.

simprod_query.gsql
CREATE QUERY productSuggestion (vertex<Product> seed, int threshold_cnt, int k = 10) FOR GRAPH gsql_demo { # an accumulator variable attachable to any vertex SumAccum<int> @cnt = 0; # T0 is the set of products from which we want to start traversal in the graph T0={seed}; /** * Compute the collection of words describing the input * product. tgt is the alias of vertex type DescWord. * In other words, for every edge of the given type (Has_desc) * that has one vertex in the set T0 and the other vertex being of * the DescWord type, add its DescWord vertex to the output set. */ ProductWords = SELECT tgt FROM T0-(Has_desc)-DescWord:tgt; /** * The output set of the previous query (ProductWords) becomes the input set * of this query. From each word in ProductWords, activate all other products * which contain the word in their description, and accumulate a count for * each activated product to record how many words it has in common with the * input product. Then rank each related product using the count of common * words; the count must exceed the query parameter threshold_cnt. */ Results = SELECT tgt FROM ProductWords-(Has_desc)->Product:tgt WHERE tgt != seed ACCUM tgt.@cnt += 1 HAVING tgt.@cnt > threshold_cnt ORDER BY tgt.@cnt DESC LIMIT k; PRINT Results; # default print output is the REST call response in JSON format. }


Query Result:

For product id= 62abcax334, find the top 3 similar products, which have more than 1 descriptive word in common with product 62abcax334.


//INSTALL QUERY productSuggestion RUN QUERY productSuggestion("62abcax334", 1, 3) { "error": false, "message": "", "results": [ { "v_set": "Results", "v_id": "hp book", "v": { "@cnt": 3, "image_hash": 29398439828 }, "v_type": "Product" }, { "v_set": "Results", "v_id": "surface pro", "v": { "@cnt": 2, "image_hash": 8439828 }, "v_type": "Product" }, { "v_set": "Results", "v_id": "linux abc", "v": { "@cnt": 2, "image_hash": 298439234 }, "v_type": "Product" } ] }

When installing the above GSQL query, a REST GET endpoint for this query will automatically be generated.  Instead of running the query as a GSQL command, clients can also invoke the query by formatting the query as a HTTP request query string and sending a GET request, e.g.,

curl -X GET " http://hostName:port/query/ productSuggestion ?seed= 62abcax334 &threshold_cnt= 1 &k= 3 "

For the standard TigerGraph configuration, hostName:port is localhost:9000


Back to top



Example 4. Same Name Search

This example introduces the CASE...WHEN...THEN structure, which can also be used as an if...then block.

Input Data: A social network, where each person has a first and last name and may also display a picture of themselves.

Query Task : Find the users who are most "similar" to a user X.  Specifically, a user X searches for other users whose first or last name matches user X's name. The query returns the list of users (Y1,Y2,...Yk)  within two steps (two steps means friend-of-friend), who have matching names, and who offer a picture. The list is sorted and ranked by the relevance score between X and another user Yi, where the score is a linear function of four factors:

  1. depth : how far X is from Yi (the shortest distance)
  2. count : the number of shortest paths between X and Yi
  3. match : whether Yi matches the input first name (match=1), the input last name (match=2), or both input names (match=3)
  4. profile : whether Yi has a profile picture


Using the graph above as an example, suppose we want to compute relevance scores for the social network of the Tom Smith on the left.

  • match=1 for Tom Lee (first names are the same)
  • match=2 for May Smith (last names are the same)
  • match=3  for Tom Smith on the right (both names are the same).

There is no direct connection to the other Tom Smith, but there are several paths:

  • Paths with a depth = 2:
    • Tom Smith → Ming Wu → Tom Smith
    • Tom Smith → Ron Glass → Tom Smith
    • Tom Smith → Tom Lee → Tom Smith
    • Tom Smith → May Smith → Tom Smith

There are also some longer paths (e.g., Tom Smith → Ron Glass → Tom Lee → Tom Smith), but since they are longer, they are not of interest.  Therefore, for the relationship (Tom Smith, Tom Smith), depth = 2 and count = 4.

The four factors (depth, count, match, hasPicture) are combined to compute an overall relevance score:

score = match * 100 + (4-depth) * 50 + count + hasPicture? 200 : 0

The clause hasPicture? 200 : 0 uses the ternary conditional operator. If hasPicture is TRUE, evaluate to  200. Otherwise, evaluate to 0.

To design the graph schema, consider what attributes are needed for each vertex and attribute.  The User vertices need to have a first name, a last name, and a profile picture. We assume that the social network is stored in two data files, one for vertices and one for edges.

Vertex file format: id, firstname, lastname, imag_url
Edge file format: user1, user2

The following code creates the schema and loads the data:

name_model.gsql
CREATE VERTEX NameUser (PRIMARY_ID id string, firstname string, lastname string, imag_url string) CREATE UNDIRECTED EDGE NameConn (FROM NameUser, TO NameUser)


CREATE GRAPH gsql_demo(*)
name_load.gsql
# define the loading job CREATE ONLINE_POST JOB load_nameV FOR GRAPH gsql_demo { LOAD TO VERTEX NameUser VALUES ($0, $1, $2, $3); } CREATE ONLINE_POST JOB load_nameE FOR GRAPH gsql_demo { LOAD TO EDGE NameConn VALUES ($0, $1); } # run the loading job RUN JOB load_nameV USING FILENAME="../name/name_search_vertex.csv", SEPARATOR=",", EOL="\n" RUN JOB load_nameE USING FILENAME="../name/name_search_edge.csv", SEPARATOR=",", EOL="\n"

Test data files

name_search_vertex.csv
0,michael,jackson, 1,michael,franklin,abc.com 2,michael,lili,def.com 3,franklin,lili, 4,lucia,franklin,lucia.org 5,michael,jackson, 6,michael,jackson,abc.com 7,lucia,jackson, 8,hahah,jackson,haha.net
name_search_edge.csv
0,1 0,3 0,4 1,5 1,3 1,2 2,6 2,7 2,1 2,0 3,0 3,1 3,5 3,7 4,5 5,6 5,7 6,7 6,1 6,2 6,0 6,4 7,5 8,5



The query algorithm is a bit long but straightforward:

  1. Select all the depth=1 neighbors. For each neighbor:
    1. Use a CASE structure to check for matching first and last names and assign a match value.
    2. Check for an image.
    3. We know depth=1 and count=1, so compute the relevance score.
  2. Starting from the depth=1 neighbors, move to the depth=2 neighbors.  For each such neighbor:
    1. Use a CASE structure to check for matching first and last names and assign a match value.
    2. Use ACCUM to count up the paths.
    3. Check for an image.
    4. Depth=2.  Compute the relevance score.

The complete query is shown below:

name_query.gsql
CREATE QUERY namesSimilar (vertex<NameUser> seed, string firstName, string lastName, int k) FOR GRAPH gsql_demo { # define a tuple to store neighbor score typedef tuple<vertex<NameUser> uid, float score> neighbor; # runtime variables used to compute neighbor score SumAccum<int> @count = 0; SumAccum<int> @depth = 0; SumAccum<int> @match= 0; SumAccum<float> @score = 0.0; SumAccum<int> @hasImgURL = 0; # global heap variable used to store final top k users, sorted by score field # in the neighbor tuple HeapAccum<neighbor>(k, score DESC) @@finalTopKUsers; # starting user StartP = {seed}; # flag first level neighbor with @depth = 1 # count number of incoming connections # flag match category # flag img_url count greater than 0 # finally, push the user and their score into global top-k heap. FirstLevelConnection = SELECT u FROM StartP -(NameConn)-> :u ACCUM u.@depth = 1, u.@count += 1, CASE WHEN u.firstname == firstName AND u.lastname == lastName THEN u.@match = 3 WHEN u.firstname != firstName AND u.lastname == lastName THEN u.@match = 2 WHEN u.firstname == firstName AND u.lastname != lastName THEN u.@match = 1 END, CASE WHEN u.imag_url != "" THEN u.@hasImgURL = 1 END POST-ACCUM @@finalTopKUsers += neighbor(u, u.@match * 100 + (4-u.@depth) * 50 + u.@count + u.@hasImgURL*200); # similarly, do the topk heap update using second level neighbor SecondLevelConnection = SELECT u2 FROM FirstLevelConnection -(NameConn)-> :u2 WHERE u2 != seed AND u2.@depth != 1 ACCUM u2.@depth = 2, u2.@count +=1, CASE WHEN u2.firstname == firstName AND u2.lastname == lastName THEN u2.@match = 3 WHEN u2.firstname != firstName AND u2.lastname == lastName THEN u2.@match = 2 WHEN u2.firstname == firstName AND u2.lastname != lastName THEN u2.@match = 1 END, CASE WHEN u2.imag_url !="" THEN u2.@hasImgURL = 1 END POST-ACCUM @@finalTopKUsers += neighbor(u2, u2.@match*100 + (4-u2.@depth)*50 + u2.@count + u2.@hasImgURL*200); # print the result PRINT @@finalTopKUsers; }


Query result

Starting from user 0, who is named "Michael Jackson", find the top 100 most similar persons, according to the scoring function described above.

//INSTALL QUERY namesSimilar RUN QUERY namesSimilar (0,"michael","jackson",100) { "error": false, "message": "", "results": [{"@@finalTopKUsers": [ { "uid": "6", "score": 651 }, { "uid": "1", "score": 451 }, { "uid": "2", "score": 451 }, { "uid": "5", "score": 404 }, { "uid": "4", "score": 351 }, { "uid": "7", "score": 303 }, { "uid": "3", "score": 151 } ]}] }

Back to top



Example 5. Content-Based Filtering Recommendation of Videos

This example demonstrates conditional loading to be selective about which data records to load into which vertices or edges.

Input Data : A network of video programs, a set of tags which describe each video, and a set of users who have watched and rated videos.

Query Task: Recommend video programs that a given user might like.

Step 1: Create Graph Schema

The principle behind content-based recommendation is that people are often interested in products which have attributes similar to the ones which they have selected in the past. Suppose we have a video store. If the store tracks what videos each customer has selected in the past, and also records attributes about its videos, it can use this data to recommend more videos to the customer. Formally, for an input user (seed), first find which videos the user has watched. Then, from all the watched videos, find the top k attributes. From the top k attributes, find the top n videos that the seed user has not watched.

This suggests that we should have a graph with three types of vertices: user, video, and attributes (of a video). The schema is shown below.

video_model.gsql
CREATE VERTEX VidUser (PRIMARY_ID user_id uint, content string, date_time uint) CREATE VERTEX Video (PRIMARY_ID content_id uint, content string, date_time uint) CREATE VERTEX AttributeTag (PRIMARY_ID tag_id string, content string, att_type uint) CREATE UNDIRECTED EDGE Video_AttributeTag (FROM Video, TO AttributeTag, weight float DEFAULT 1.0, date_time uint) CREATE UNDIRECTED EDGE User_Video (FROM VidUser, TO Video, rating float DEFAULT 1.0, date_time uint)

Then run

CREATE GRAPH gsql_demo(*)

Step 2: Load Input Data

In this example, there is one data file which contains data for all three type of vertices – VidUser, Video, and AttributeTag. The first field of each line indicates the vertex type. Similarly, there is one edge data file for two types of edges – User_Video and Video_AttributeTag. The WHERE clause is used to conditionally load only certain data into each type of vertex or edge.  Further, these data files do not contain information for every attribute.  When "_" is used in the VALUES list of a LOAD statement, it means not to load data from the input. The default value will be written (or it will remain as it is, if there is already a vertex or edge with that ID).

Test data files

video_recommendation_v.csv
type,id,content User,0, User,1, User,2, User,3, Video,0,v0 Video,1,v1 Video,2,v2 Video,3,v3 Video,4,v4 Tag,action, Tag,comedy, Tag,mystery, Tag,technical,
video_recommendation_e.csv
type,from,to,rating UV,0,0,6.8 UV,0,2,5.2 UV,0,3,10.0 UV,1,1,1.2 UV,2,0,7.4 UV,3,0,6.6 UV,3,4,8.4 VA,0,action, VA,0,comedy, VA,1,mystery, VA,2,technical, VA,2,mystery, VA,2,action, VA,3,comedy, VA,4,technical, VA,4,action,


Loading jobs

video_load.gsql
# define the loading job CREATE ONLINE_POST JOB load_videoV FOR GRAPH gsql_demo { LOAD TO VERTEX VidUser VALUES ($1,_,_) WHERE $0 == "User", TO VERTEX Video VALUES ($1,$2,_) WHERE $0 == "Video", TO VERTEX AttributeTag VALUES ($1,_,_) WHERE $0 == "Tag"; } CREATE ONLINE_POST JOB load_videoE FOR GRAPH gsql_demo { LOAD TO EDGE User_Video VALUES ($1,$2,$3, _) WHERE $0 == "UV", TO EDGE Video_AttributeTag VALUES ($1,$2,_, _) WHERE $0 == "VA"; } # load the data RUN JOB load_videoV USING FILENAME="../video/video_recommendation_v.csv", SEPARATOR=",", EOL="\n" RUN JOB load_videoE USING FILENAME="../video/video_recommendation_e.csv", SEPARATOR=",", EOL="\n"


Step 3: Query the data

The query has the three stages characteristic of content-based recommendation:

  1. Find products (videos) previously selected
  2. Find the top attributes of those products
  3. Find the products which have the most attributes in common with the seed products
video_query.gsql
CREATE QUERY videoRecommendation (vertex<VidUser> seed, int k, int n) FOR GRAPH gsql_demo { OrAccum @viewedBySeed; SumAccum<float> @score; Start = {seed}; # get viewed videos Viewed = SELECT v FROM Start -(User_Video:e)-> Video:v ACCUM v.@viewedBySeed += true, v.@score += e.rating; # get attribute Attribute = SELECT att FROM Viewed:v -(Video_AttributeTag)-> AttributeTag:att ACCUM att.@score += v.@score ORDER BY att.@score LIMIT k; # get recommended videos Recommend = SELECT v FROM Attribute:att -(Video_AttributeTag)-> Video:v WHERE v.@viewedBySeed != true ACCUM v.@score += att.@score ORDER BY v.@score LIMIT n; PRINT Recommend; }


Query result

Recommend up to 10 videos to user 0, using the top 10 attributes  from the client's favorite videos.

//INSTALL QUERY videoRecommendation RUN QUERY videoRecommendation (0, 10, 10) { "error": false, "message": "", "results": [ { "v_set": "Recommend", "v_id": "1", "v": { "date_time": 0, "@score": 5.2, "@viewedBySeed": false, "content": "v1" }, "v_type": "Video" }, { "v_set": "Recommend", "v_id": "4", "v": { "date_time": 0, "@score": 17.2, "@viewedBySeed": false, "content": "v4" }, "v_type": "Video" } ] }

Back to top



Example 6. People You May Know

This example shows a technique for passing intermediate results from one stage to another.

Input Data : A social network with weighted connections.

Query Task: Recommend the Top K people you may know but who are not yet in your set of connections.  Scoring is based on a variation of cosine similarity of two users:
score(A,B) = ∑ x [ (connection strength for A→x)×(connection strength for x→B) / ( degree (A) × degree(B)) ]

This example shows that the computation of a moderately complex formula is simple in the GSQL language. It also demonstrates a technique of copying an attribute from an edge or a source vertex to the (temporary) accumulator of the edge's target vertex:

    tgt.@edge_strength = e.strength

This is a way to "transport" a value as the query travels through the graph .

The graph schema and loading jobs:

Step 1

people_model.gsql
CREATE VERTEX Person (PRIMARY_ID id uint) CREATE UNDIRECTED EDGE PersonConn (FROM Person, TO Person, strength float)


Then

CREATE GRAPH gsql_demo(*)


Step 2

people_load.gsql
# define the loading jobs CREATE ONLINE_POST JOB load_peopleV FOR GRAPH gsql_demo { LOAD TO VERTEX Person VALUES ($0); } CREATE ONLINE_POST JOB load_peopleE FOR GRAPH gsql_demo { LOAD TO EDGE PersonConn VALUES ($0,$1,$2); } # load the data RUN JOB load_peopleV USING FILENAME="../people/people_user.dat", SEPARATOR=",", EOL="\n" RUN JOB load_peopleE USING FILENAME="../people/people_conn.dat", SEPARATOR=",", EOL="\n"

Test data:

people_user.dat
1 2 3 4 5 6 7 8
people_conn.dat
1,2,0.6 2,3,0.5 2,6,0.5 3,6,0.3 3,5,0.2 3,4,0.8 5,8,0.8 6,8,0.2


Step 3

If you have worked through the previous examples, you perhaps can now see that we need a two-stage query: from A to A's neighbors, and then from A's neighbors to their neighbors.  Also, you may realize that we will use the ACCUM clause to perform summation in the second stage.  But, how will we know during the second stage what was the strength of the first stage edge? By storing a copy of the edge's weight in an accumulator attached to the edge's target vertex, which becomes a source vertex in the second stage.

people_query.gsql
CREATE QUERY peopleYouMayKnow(vertex<Person> startP, int TopK) FOR GRAPH gsql_demo { SumAccum<float> @edge_strength = 0; SumAccum<int> @depth = 0; SumAccum<float> @sum = 0; SumAccum<float> @score = 0; SumAccum<int> @@startPdegree = 0; Start = {startP}; L1 = SELECT tgt FROM Start:src-(PersonConn:e)->Person:tgt ACCUM tgt.@edge_strength = e.strength, tgt.@depth=1, # copy edge strength to target vertex @@startPdegree += src.outdegree(); # save seed outdegree # second level connections L2 = SELECT tgt2 FROM L1:u-(PersonConn:e)->Person:tgt2 WHERE tgt2 != startP AND tgt2.@depth != 1 ACCUM tgt2.@sum += u.@edge_strength*e.strength POST-ACCUM tgt2.@score += tgt2.@sum/(@@startPdegree * tgt2.outdegree() ) ORDER BY tgt2.@score LIMIT TopK; PRINT L2.@score; }


Query result:

Recommend up to 10 persons whom Person 1 might like to get to know.

# INSTALL QUERY peopleYouMayKnow RUN QUERY peopleYouMayKnow (1, 10) { "error":false, "message":"", "results":[ { "v_set":"L2", "v_id":"3", "v":{"L2.@score":0.075}, "v_type":"Person" }, { "v_set":"L2", "v_id":"6", "v":{"L2.@score":0.1}, "v_type":"Person" } ] }



Back to top



Example 7. More Social Network Queries

Input Data: A social network in which each user has two attributes (besides their name): the time that they joined the network, and a boolean flag which says whether they are active or not.

Query Tasks: We show several query examples, making use the the time attribute and directed links between users.

Part 1: Create Graph

social_model.gsql
CREATE VERTEX SocialUser (PRIMARY_ID uid string, name string, isActive bool, registration_timestamp uint) CREATE DIRECTED EDGE SocialConn (FROM SocialUser, TO SocialUser) WITH REVERSE_EDGE="reverse_conn" #

Then

CREATE GRAPH gsql_demo(*)


Part 2: Load Data

Test data:

social_users.csv
id,name,active,timestamp 0,luke,1,1400000000 1,john,1,1410000000 2,matthew,0,1420000000 3,mark,1,143000000 4,paul,1,144000000 5,steven,0,145000000 6,peter,1,146000000 7,james,1,147000000 8,joseph,1,148000000 9,thomas,1,149000000
social_connections.csv
0,1 0,2 0,3 0,4 0,5 1,3 1,4 1,5 1,6 0,7 7,0 7,3 7,4 7,5 0,8 8,3 8,4 0,9 9,3


We have two data files, so we define two ONLINE_POST loading jobs. Note that the file social_users.csv starts with a header line: " id,name,active.timestamp ". Online loading does not support header lines in data files, so we use a trick to skip over it.  When we run the job, we use the command

RUN JOB -n 2,$ load_social1 USING FILENAME="../social/social_users.csv", SEPARATOR=",", EOL="\n"

The option " -n 2,$ " means "only read data lines from line 2 to the end."

social_load.gsql
# define the loading jobs CREATE ONLINE_POST JOB load_social1 FOR GRAPH gsql_demo { LOAD TO VERTEX SocialUser VALUES ($0,$1,$2,$3) USING QUOTE="double"; } CREATE ONLINE_POST JOB load_social2 FOR GRAPH gsql_demo { LOAD TO EDGE SocialConn VALUES ($0, $1); } # load the data RUN JOB -n 2,$ load_social1 USING FILENAME="../social/social_users.csv", SEPARATOR=",", EOL="\n" RUN JOB load_social2 USING FILENAME="../social/social_connection.csv", SEPARATOR=",", EOL="\n"



Part 3 : Create, install, and run queries.

This case study presents four queries and their results, one at a time, so there are four separate "INSTALL QUERY" commands. Alternately, all four can be installed at once, which will execute faster than separate install commands:
INSTALL QUERY socialFromUser, socialToUser, socialMutualConnections, socialOneWay

or

INSTALL QUERY ALL


Q1 (socialFromUser): find  users who have a direct connection from a given input user, with some filtering conditions on the candidate users' attributes


socialFromUser from social_query.gsql
CREATE QUERY socialFromUser(vertex<SocialUser> uid, bool is_active, int reg_time_min, int reg_time_max, int k) FOR GRAPH gsql_demo { L0={uid}; RESULT = SELECT tgt FROM L0:u-(SocialConn)->SocialUser:tgt WHERE tgt.registration_timestamp >= reg_time_min AND tgt.registration_timestamp <= reg_time_max AND tgt.isActive == is_active LIMIT k; PRINT RESULT; }

Test query and result:

//INSTALL QUERY socialFromUser RUN QUERY socialFromUser("0", "true", 0, 147000000, 10) { "error": false, "message": "", "results": [ { "v_set": "RESULT", "v_id": "4", "v": { "registration_timestamp": 144000000, "name": "paul", "isActive": true }, "v_type": "SocialUser" }, { "v_set": "RESULT", "v_id": "3", "v": { "registration_timestamp": 143000000, "name": "mark", "isActive": true }, "v_type": "SocialUser" }, { "v_set": "RESULT", "v_id": "7", "v": { "registration_timestamp": 147000000, "name": "james", "isActive": true }, "v_type": "SocialUser" } ] }

Q2 (socialToUser) : similar to Q1, but return users who have a connection pointing to the input user.


socialToUser from social_query.gsql
CREATE QUERY socialToUser(vertex<SocialUser> uid, bool is_active, int reg_time_min, int reg_time_max, int K) FOR GRAPH gsql_demo { L0={uid}; Result = SELECT tgt FROM L0:u-(reverse_conn)->SocialUser:tgt WHERE tgt.registration_timestamp >= reg_time_min AND tgt.registration_timestamp <= reg_time_max AND tgt.isActive == is_active LIMIT K; PRINT Result; }

test query and result:

//INSTALL QUERY socialToUser RUN QUERY socialToUser("4", "true", 0, 150000000, 10) { "error": false, "message": "", "results": [ { "v_set": "Result", "v_id": "8", "v": { "registration_timestamp": 148000000, "name": "joseph", "isActive": true }, "v_type": "SocialUser" }, { "v_set": "Result", "v_id": "7", "v": { "registration_timestamp": 147000000, "name": "james", "isActive": true }, "v_type": "SocialUser" } ] }

Q3 (socialMutualConnections): return the set of users who have connections from both input user A and input user B.


socialMutualConnection from social_query.gsql
CREATE QUERY socialMutualConnections(vertex<SocialUser> uid1, vertex<SocialUser> uid2, bool is_active, int reg_time_min, int reg_time_max, int k) FOR GRAPH gsql_demo { SumAccum<int> @cnt =0; Start = {uid1,uid2}; Result = SELECT tgt FROM Start-(SocialConn)->SocialUser:tgt WHERE tgt.registration_timestamp >= reg_time_min AND tgt.registration_timestamp <= reg_time_max AND tgt.isActive == is_active ACCUM tgt.@cnt +=1 HAVING tgt.@cnt == 2 LIMIT k; PRINT Result; }


test query and result:

//INSTALL QUERY socialMutualConnections RUN QUERY socialMutualConnections("1", "7", "false", 0, 2000000000, 10) { "error": false, "message": "", "results": [{ "v_set": "Result", "v_id": "5", "v": { "registration_timestamp": 145000000, "@cnt": 2, "name": "steven", "isActive": false }, "v_type": "SocialUser" }] }


Q4 (socialOneWay) : find all A->B user relationships such that there is an edge from A to B but there is no edge from B to A, and also requires that A and B connect to at least some number of common friends.

socialOneWay from social_query.gsql
CREATE QUERY socialOneWay(int mutual_contacts_min = 5, int mutual_contacts_max = 10) FOR GRAPH gsql_demo { typedef tuple<vertex<SocialUser> id, string name, int cnt> recTuple; # SumAccum<list<recTuple>> @recList; # v0.1 to v0.1.2 ListAccum<recTuple> @recList; # v0.2 Start = {SocialUser.*}; Result = SELECT B FROM Start:A-(SocialConn)->SocialUser:B # B.neighbors() is a built-in function which returns the list of neighbors which B points to. # B.neighbors('edgeType1') returns only the neighbors connected by the given edge type. WHERE B NOT IN A.neighbors("reverse_conn") AND COUNT(A.neighbors("SocialConn") INTERSECT B.neighbors("SocialConn")) >= mutual_contacts_min AND COUNT(A.neighbors("SocialConn") INTERSECT B.neighbors("SocialConn")) <= mutual_contacts_max ACCUM B.@recList += recTuple(A, A.name, COUNT(A.neighbors("SocialConn") INTERSECT B.neighbors("SocialConn"))); PRINT Result; # the result includes B's static attributes and B.@fromName. }


Test query and result: There are three such pairs

  1. From vertex 0 to 1.  Vertices 0 and 1 have 3 neighbors in common.
  2. From vertex 0 to 8.  Vertices 0 and 8 have 2 neighbors in common.
  3. From vertex 0 to 9.  Vertices 0 and 9 have 1 neighbor in common.

//INSTALL QUERY socialOneWay RUN QUERY socialOneWay(1,10) { "error": false, "message": "", "results": [ { "v_set": "Result", "v_id": "9", "v": { "registration_timestamp": 149000000, "name": "thomas", "isActive": true, "@recList": [{ "name": "luke", "cnt": 1, "id": "0" }] }, "v_type": "SocialUser" }, { "v_set": "Result", "v_id": "1", "v": { "registration_timestamp": 1410000000, "name": "john", "isActive": true, "@recList": [{ "name": "luke", "cnt": 3, "id": "0" }] }, "v_type": "SocialUser" }, { "v_set": "Result", "v_id": "8", "v": { "registration_timestamp": 148000000, "name": "joseph", "isActive": true, "@recList": [{ "name": "luke", "cnt": 2, "id": "0" }] }, "v_type": "SocialUser" } ] }

Suggested variant query:

  • socialTwoWay: Find all A↔B relationships such that there are connected edges both from A to B and from B to A, and A and B have some common neighbors.

Test query and result:

There is one such pair (0, 7), but the query reports it twice: first as (7, 0) and then as (0, 7). Vertices 7 and 0 have 3 neighbors in common.


RUN QUERY socialTwoWay(1,10) { "error": false, "message": "", "results": [ { "v_set": "Result", "v_id": "1", "v": { "registration_timestamp": 1400000000, "name": "luke", "isActive": true, "@recList": [{ "name": "james", "cnt": 3, "id": "7" }] }, "v_type": "SocialUser" }, { "v_set": "Result", "v_id": "7", "v": { "registration_timestamp": 147000000, "name": "james", "isActive": true, "@recList": [{ "name": "luke", "cnt": 3, "id": "0" }] }, "v_type": "SocialUser" } ] }


Back to top



Part 2: Classic Graph Algorithms


Example 8. Single Pair Shortest Path (unweighted)

The shortest path problem is to find the path(s) between two given vertices S and T in a graph such that the path's total edge weight is minimized. If the edge weights are all the same (e.g., weight=1), then the shortest paths are the ones with the fewest edges or steps from S to T. The classic solution to this graph problem is to start the search from one vertex S and walk one step at a time on the graph until it meets the other input vertex T (unidirectional Breadth-First Search). In addition, we present a more sophisticated way to solve this problem on the TigerGraph advanced graph computing platform. Instead of starting the search from one input vertex, our solution will launch the search agents from both input vertices, walking the graph concurrently until they meet. This greatly improves the algorithm performance. To simplify this problem, this article will assume the graph is undirected and unweighted.

1. Graph Schema and Data

The following examples will use the graph that is presented below . Before we show the algorithms, their implementation and examples, we present the graph schema and data used to create the graph. All files in this document are given here: work_graph_example.zip .

Figure 1: Example Graph used to demonstrate the Shortest Path algorithms.

Graph Schema

First, we give the graph schema. This will create the graph with vertices of type company , persons and skill . It also creates undirected edges that go from person to company , from person to person , from any type to skill , and from any type to company .

Graph Schema
DROP ALL CREATE VERTEX company (PRIMARY_ID companyId string, id string, company_name string default "unknown") CREATE VERTEX persons (PRIMARY_ID pId string, id string, profileId string default "0", registrationDate uint default 0) CREATE VERTEX skill (PRIMARY_ID skillId uint, id string) CREATE UNDIRECTED EDGE person_work_company (FROM persons, TO company, positionId double, industryId float, startTime int) CREATE UNDIRECTED EDGE person_person (FROM persons, TO persons, connect_time int) CREATE UNDIRECTED EDGE all_to_skill (FROM *, TO skill) CREATE UNDIRECTED EDGE all_to_company (FROM *, TO company) CREATE GRAPH work_graph(*)

Data Set

Data source for company vertices.

small_companies
c1,1,com1,us c2,0,com2,jp c3,1,com3,de c4,0,com4,cn

Data source for person vertices and skill vertices. The first line,

m1,i1,0,"s2|s3"

means that person m1 has skills s2 and s3.

small_persons
m1,i1,0,"s2|s3" m2,i1,1,"" m3,i1,2,"s1|s3" m4,i1,3,"s2" m5,i2,4,"" m6,i2,5,"" m7,i2,6,"" m8,i3,7,"s1"

Data source for person_work_company edges.  The first line means that person m1 works for company c1.

small_person_company
m1,c1,1,1,1 m2,c1,2,1,3 m2,c2,2,1,4 m3,c1,2,1,5 m4,c4,2,2,6 m5,c2,3,2,7 m6,c3,3,2,8 m6,c4,3,2,9 m7,c3,3,2,10 m7,c4,3,2,11 m8,c3,3,3,12

Data source for person_person edges.

small_person_person
m1,m4,1 m6,m7,3 m7,m5,4

Data source for all_to_skill edges such as all_to_skill (m1, s2) or all_to_skill (c2, s3). While the schema supports all_to_company edges, this particular data set does not use any..

small_all_to_all
s,m,m1,s2,0 s,m,m4,s2,0 s,m,m1,s3,0 s,m,m3,s3,0 s,c,c2,s3,0 s,m,m3,s1,1 s,m,m8,s1,1

Loading the Data

To load all of this data into the graph, we can use the following GSQL command file (which also includes the graph schema creation commands).

graph_create.gsql
DROP ALL CREATE VERTEX company (PRIMARY_ID companyId string, id string, company_name string default "unknown") CREATE VERTEX persons (PRIMARY_ID pId string, id string, profileId string default "0", registrationDate uint default 0) CREATE VERTEX skill (PRIMARY_ID skillId uint, id string) CREATE UNDIRECTED EDGE person_work_company (FROM persons, TO company, positionId double, industryId real, startTime int) CREATE UNDIRECTED EDGE person_person (FROM persons, TO persons, connect_time int) CREATE UNDIRECTED EDGE all_to_skill (FROM *, TO skill) CREATE UNDIRECTED EDGE all_to_company (FROM *, TO company) CREATE GRAPH work_graph(*) SET sys.data_root="./" CREATE LOADING JOB load_data FOR GRAPH work_graph { LOAD "$sys.data_root/small_companies" TO VERTEX company VALUES ($0, $0, $2) USING HEADER="false", SEPARATOR=",", QUOTE="double"; LOAD "$sys.data_root/small_persons" TO VERTEX persons VALUES ($0, $0, $1, $2) WHERE to_int($2) >= 0 USING HEADER="false", SEPARATOR=",", QUOTE="double"; # Example of flattening a multi-valued field LOAD "$sys.data_root/small_persons" TO temp_table member_skill_table (memberID, skillID) VALUES ($0, flatten($3, "|", 1)) USING HEADER="false", SEPARATOR=",", QUOTE="double"; LOAD temp_table member_skill_table TO VERTEX skill VALUES ($"skillID", $"skillID"); LOAD "$sys.data_root/small_person_company" TO EDGE person_work_company VALUES($0, $1, $2, $3, $4) WHERE to_int($4) >= 0 USING HEADER="false", SEPARATOR=",", QUOTE="double"; LOAD "$sys.data_root/small_person_person" TO EDGE person_person VALUES($0, $1, $2) WHERE to_int($2) >= 0 USING HEADER="false", SEPARATOR=",", QUOTE="double"; # Note how $0 and $1 indicate what type of data is in $3 and $2, respectively, # so that the VALUES $2 and $3 can explicitly state the data type. LOAD "$sys.data_root/small_all_to_all" TO EDGE all_to_skill VALUES ($2 company, $3 skill) WHERE $0 == "s" AND $1 == "c", TO EDGE all_to_skill VALUES ($2 persons, $3 skill) WHERE $0 == "s" AND $1 == "m", TO EDGE all_to_skill VALUES ($2 skill, $3 skill) WHERE $0 == "s" AND $1 == "s", TO EDGE all_to_company VALUES ($2 company, $3 company) WHERE $0 == "c" AND $1 == "c", TO EDGE all_to_company VALUES ($2 persons, $3 company) WHERE $0 == "c" AND $1 == "m", TO EDGE all_to_company VALUES ($2 skill, $3 company) WHERE $0 == "c" AND $1 == "s" USING HEADER="false", SEPARATOR=",", QUOTE="double"; } RUN JOB load_data

To run a command file, simply enter gsql name_of_file

Create Graph and Load Data
> gsql graph_create.gsql

2. Unidirectional (BFS) Algorithm

If the edges are unweighted, then the shortest path can be found using the classic Breadth-First Search (BFS) algorithm. Below is an implementation in the GSQL Query Language:

shortest_path_1D.gsql
/** * This query assumes every edge in the graph is undirected. * It uses breadth-first-search to find the shortest path between s and t. */ CREATE QUERY shortest_path_1D (VERTEX<company> S, VERTEX<company> T, INT maxDepth) FOR GRAPH work_graph{ OrAccum @@found = false; OrAccum @notSeen = true; ListAccum<STRING> @pathResult; Start (ANY) = {S}; Start = SELECT v FROM Start:v //assume each vertex has an id attribute ACCUM v.@notSeen = false, v.@pathResult = v.id; WHILE NOT @@found LIMIT maxDepth DO Start = SELECT v FROM Start - (:e) -> :v WHERE v.@notSeen ACCUM v.@notSeen = false, //add partial result paths to target v. v.@pathResult += (Start.@pathResult + "-") + v.id, CASE WHEN v == T THEN @@found += true END; END; IF @@found THEN Result = {T}; PRINT Result.@pathResult; ELSE PRINT "Can't find shortest path within max steps"; END; } INSTALL QUERY shortest_path_1D

The algorithm works by expanding the search path through all vertices that were seen in the previous step. Each step is taken by one iteration of the WHILE loop. In the first iteration of the WHILE loop, we start at vertex S and travel to all its neighbors. In each of the following iterations, we travel from previously reached vertices to their neighbors that have not already been seen by the path.

To install the query, run the following command:

Install Query
> gsql shortest_path_1D.gsql

Example of Unidirectional BFS Search

Let us show a running example of this algorithm. We will be trying to find the shortest path from c1 to c3. First, we have our initial graph, where we have not traveled along any edges yet.

Figure 2: The starting state for our graph. From here, we go on to the first step of the algorithm. We start at c1, and go along each of its edges.


Figure 3: This is the graph after one step. We have traveled from c1 to all of its neighbors, labeling them as visited. For each one that we visit, we update its @pathResult accumulator value in order to keep track of our path as we traverse the graph.


Figure 4. This graph shows where we have traveled after two steps. We traveled to our new vertices s1, s2, s3, c2 and m4 by traveling one edge away from the nodes that we had visited in step 1. Note that the blue edges also tell us how we can get from c1 to a vertex.  For example, we notice that e21 is not labeled blue. This means that we did not travel along this edge. That is, we must have gotten to c2 using a different edge. Indeed, we can see that the path c1-m2-c2 is shorter than c1-m3-s3-c2. This explains why e9 is blue, but e21 is not.

Each time that the query travels from a starting vertex (m1, m2, or m3) to a target vertex (s1, s2, s3, c3, or m4), the target vertex's @pathResult ListAccum<string> is updated (Line 22 of the query). A new string is added to the list (the += operator), which means that there is a path string for each time that the target vertex is reached. The path string consists of the path string from the source vertex, followed by this target vertex.  That is equivalent to the path from the query's starting vertex (e.g., c1) to the current target vertex.


Figure 5: At the third step of our algorithm, we have reached the nodes m8, m5 and c4. We got here by moving one edge away from the vertices that we reached in step 2.


Figure 6: Finally, we have reached the end of our algorithm. Note that when we travel one edge away from m8, we arrive at our target node of c3. Working backwards, we can reconstruct the shortest path. We reached c3 from m8, m8 from s1, s1 from m3 and m3 from c1. Thus, we get that the shortest path is indeed c1-m3-s1-m8-c3. It is important to note that if we reached a target vertex V at the same time from more than one of its neighbors, then this implies there are multiple shortest paths. If N paths arrive at target vertex v, then our algorithm records N strings in the @pathResult ListAccum<string> accumulator.

To run the query with starting vertex c1, ending vertex c3, and a maximum distance of 10:

Query
> gsql 'RUN QUERY shortest_path_1D("c1","c3",10)'

This will give the following result.

Results
{ "error": false, "message": "", "results": [{ "v_set": "Result", "v_id": "c3", "v": {"Result.@pathResult": ["c1-m3-s1-m8-c3"]}, "v_type": "company" }] }

As we can see, the algorithm tells us that the shortest path from c1 to c3 is going through m3, followed by s1, then m8, then finally arriving at c3. However, this result also tells us that this is the unique shortest path. Indeed, if we instead run

Multiple Shortest Paths Query
> gsql 'RUN QUERY shortest_path_1D("c3", "c4", 10)'

our results are:

Multiple Shortest Paths Results
{ "error": false, "message": "", "results": [{ "v_set": "Result", "v_id": "c4", "v": {"Result.@pathResult": [ "c3-m6-c4", "c3-m7-c4" ]}, "v_type": "company" }] }

Note that here we have two paths. The first is from c3 to m6, and then to c4. The other path is from c3, to m7, to c4. We are presented with both paths because each of these consist of the least possible weight: exactly two edges. As explained earlier, this is because we arrive at a vertex at the same time through two different paths. When we started at c3, we traveled to m6, m7 and m8. At the second step, both m6 and m7 arrive at c4 at the exact same time. That means that two path strings will be written to c4.@queryResult, recording two shortest paths.

3. Bi-Directional Shortest Path Search Algorithm

Bi-Directional search will launch two search agents, each from a given vertex. The two agents concurrently walk one step at a time, until they meet at an intermediate vertex. The shortest path length may be odd or even.  For example, in Figure 7 below, Case II is an even-length case, and Case III is an odd-length case. Case I is a special case of an odd-length path.

The core of this solution is that in each step, a set of previously unvisited vertices will be discovered by the search frontiers of S and T.  The newly visited vertices will become the new frontier of  S or T. The algorithm will repeat this process until the frontiers of the two agents meet.



Figure 7 : Three cases for terminating a bi-directional path search.

Because this algorithm is more complicated than one directional search, we first give pseudocode to help explain the algorithm.

bi-directional shortest path search algorithm
void find_shortest_path_bi_directional_search(Vertex S, Vertex T) { bool stop = false; vertex.pathFromS = ""; vertex.pathFromT = ""; vertex.visitedByS = false; vertex.visitedByT = false; final_path = ""; Activate vertex S and T as the starting vertices; S.visitedByS = true; T.visitedByT = true; // VERTEX GROUP, if a vertex V is visited by a vertex originating from vertex T, // then V is part of vertex group T. The vertices who have the longest path // from its origin vertex are called the FRONTIER of the vertex group. // Initially, S is the frontier and only member of vertex group S, // and T is the frontier and only member of vertex group T. while (!stop) { VS = (frontier of Vertex Group S) union (frontier of Vertex Group T) for each vertex v in VS { for each neighbor vertex b of v { if ((v.visitedByS && b.visitedByT) || (v.visitedByT && b.visitedByS)) { // If the frontiers of S and T are neighbors (Case III in Figure 7) if (v.visitedByS) { final_path = v.pathFromS + v.ID + b.ID + b.pathFromT; } if (v.visitedByT) { final_path = v.pathFromT + v.ID + b.ID + b.pathFromS; } stop = true; break; } else if ((v.visitedByS && not b.visitedByS) || (v.visitedByT && not b.visitedByT)) { // If b (the neighbor of v) is not yet part of v's vertex group, // then add b to the vertex group, and update b's path to the origin. if (v.visitedByS) { b.visitedByS = true; b.pathFromS = v.pathFromS + v.ID; } if (v.visitedByT) { b.visitedByT = true; b.pathFromT = v.pathFromT + v.ID; } } } // if a vertex is visited by S & T in the same iteration (Case II, Figure 7) if (v.visitedByT && v.visitedByS) { final_path = v.pathFromS + v.ID + v.pathFromT; stop = true; break; }  } } print out final_path; }

This algorithm essentially works by running two versions of the algorithm from the first example at the same time, just with different starting vertices. The algorithm continues with these two paths until there is an intersection. Once the two paths cross, we know that the shortest path goes through this intersection, as explained in the previous section.

Below is an implementation in the GSQL Query Language.

shortest_path_2D.gsql
CREATE QUERY shortest_path_2D (VERTEX<company> S, VERTEX<company> T , INT maxDepth) FOR GRAPH work_graph { OrAccum @@stop = false; // global variable controlling whether to continue the search OrAccum @seenS = false; // a vertex has been seen by S OrAccum @seenT = false; // a vertex has been seen by T OrAccum @meet = false; // vertex flag indicating whether it is 'meeting' point SumAccum<int> @sLength = 0; // vertex runtime attribute: # steps from origin vertex S SumAccum<int> @tLength = 0; // vertex runtime attribute: # steps from origin vertex T SumAccum<int> @resultLength = 0; // the final length of shortest path ListAccum<string> @pathS; //list of paths so far from S ListAccum<string> @pathT; //list of paths so far from T ListAccum<string> @pathResults; //final set of shortest paths Start = {S,T}; //initialize S, T vertices StartSet (ANY) = SELECT v // _ means StartSet can contain any vertex type FROM Start:v ACCUM CASE WHEN v==S THEN v.@seenS=true, v.@pathS += "" WHEN v==T THEN v.@seenT=true, v.@pathT += "" END; WHILE(@@stop == false) LIMIT maxDepth DO StartSet = SELECT v // Consider each edge from S or T's frontier (u) to outside (v), // i.e., each edge that moves "out" from the frontier. // Note how StartSet is updated to be v (pushing the frontier forward). FROM StartSet:u-(:e)->:v WHERE ((u.@seenS==true AND v.@seenS!=true) OR // from S frontier to outside (u.@seenT==true AND v.@seenT!=true)) // from T frontier to outside ACCUM // If u->v joins the S and T frontiers, an odd-length path is found (Case III) CASE WHEN ((u.@seenS == true AND v.@seenT == true) OR (u.@seenT == true AND v.@seenS == true)) THEN @@stop += true, // we don't want to print the results twice // v.@pathResults stores all shortest paths // between S and T where v is in the middle of // every such path. // only need to print out the result once, see above step 3 (b) CASE WHEN (u.@seenS == true AND v.@seenT == true) THEN v.@pathResults += (u.@pathS + (u.id + "-" + v.id + "-")) * v.@pathT, v.@meet = true, v.@resultLength = u.@sLength + v.@tLength + 1 END // Else, since u->v does not complete a path, move the frontiers forward. // If u is in S's frontier, then extend S's frontier to v. Assign path info to v. WHEN u.@seenS == true THEN v.@seenS = true, v.@pathS += u.@pathS + ( u.id + "-"), v.@sLength = u.@sLength + 1 // If u is in T's frontier, then extend T's frontier to v. Assign path info to v. WHEN u.@seenT == true THEN v.@seenT =true, v.@pathT += (u.id + "-") + u.@pathT, v.@tLength = u.@tLength + 1 END POST-ACCUM // If the two frontiers meet at v, an even-length path is found (Case II) CASE WHEN (v.@seenS == true AND v.@seenT == true AND @@stop == false) THEN @@stop += true, // Append v.id to every path in .@pathS v.@pathS = v.@pathS + (v.id + "-"), v.@pathResults = v.@pathS * v.@pathT, v.@resultLength = v.@sLength + v.@tLength, v.@meet = true END; END; // print out the final result stored at the vertex who marked // as meet vertex StartSet = SELECT v FROM StartSet:v WHERE v.@meet == true; PRINT StartSet.@resultLength, StartSet.@pathResults; } INSTALL QUERY shortest_path_2D

Example of Bidirectional BFS Search

The following is a running example to demonstrate the algorithm of finding the shortest path in a bi-directional way. The graph below (Figure 8) shows vertices c1 and c3, with several other vertices between them. The algorithm will demonstrate the two search directions by using two different colors and border thicknesses:

  • Blue and thin border for c1's search frontier
  • Orange and thick border for c3's search frontier


Figure 8: Initialization - prepare to start the search process. The two given vertices (c1 and c3) are activated and colored as Blue and Orange respectively. The rest of the graph remains untouched.


Figure 9: The graph after the first step. The search process starts simultaneously from c1 and c3. If a vertex is seen by the agent starting from c1 (c3), we will say it is seen by c1 (c3).

  • From the vertex c1, the algorithm goes to the neighbors of c1 that have not yet been seen. As a result, the unseen vertices m1, m2 and m3 are discovered and become the frontier of c1's vertex group.
  • From the vertex c3, in a similar fashion, the vertices m5, m7 and m8 are discovered and become the frontier of c3's vertex group.


Figure 10: As the two groups have not been met yet, the search process continues.

  • From c1's search agent, the vertices m4, s2, c2, s3 and s1 are all discovered.
  • From c3's search agent, the vertices c4, m5 and s1 are all discovered.

Notice that both search agents have found the vertex s1. Thus, the algorithm should stop, and return the path going through s1. In this case, this path is c1-m3-s1-m8-c3.

In order to get this result in the TigerGraph Query Language (GSQL), first install the query, for which the code was given earlier.

Install Query
> gsql shortest_path_2D.gsql

Now, run the query using c1 as a starting node, c3 as the ending node, and a maximum distance of 10:

Query
> gsql 'RUN QUERY shortest_path_2D("c1","c3",10)'

This will return the following result:

Results
{ "error": false, "message": "", "results": [{ "v_set": "StartSet", "v_id": "s1", "v": { "StartSet.@pathResults": ["c1-m3-s1-m8-c3-"], "StartSet.@resultLength": 4 }, "v_type": "skill" }] }


However , in order to demonstrate the odd-length case, assume that s1 does not exist.


Figure 11: 2nd Iteration in a modified graph in which s1 does not exist. We got here by traveling one edge away form the vertices that were visited in the previous step. However, as we do not yet have a crossing, we must complete one more iteration.



Figure 12: Here, the paths from c1 have finally found a vertex that was previously found by the paths from c3 (and vice versa). That is, the blue paths traveled from c2 to m5 and from m4 to c4. In Figure 11, m5 and c4 were both orange. In Figure 12, we change a vertex's color to purple when one frontier meets the other. This tells us that the shortest path from c1 to c3 either goes through e8 or e3. If we go through e8, we go along the path c1-m2-c2-m5-m7-c3. Note that if we go through e3, we are given two paths. This is almost identical to the multiple path example from the first algorithm. From c4, we can either take e4 or e12 to get to c3. Thus, when going from c1 to c3 through e3, we are actually given two paths. These paths are c1-m1-m4-c4-m6-c3 and c1-m1-m4-c4-m7-c3.

The * operator in Lines 41 and 63 handle the case of multiple paths from one direction merging with multiple paths from the other direction. For example, we know there are two shortest paths from c4 to c3. Pretend for a moment that there are 3 shortest paths from c1 to m4.  Then, when m4 and c4 meet, there would then be (3 * 2) = 6 shortest paths from c1 to c3.

Once again, we can implement this alternate graph in GSQL by using the DELETE keyword. First, we delete the vertex s1 from the graph by doing the following:

Remove "s1"
> gsql 'DELETE FROM skill WHERE primary_id=="s1"'

Now, we can run our query once again:

Query
> gsql 'RUN QUERY shortest_path_2D("c1","c3",10)'

Notice that this time, we are given the three paths that we previously described.

Results
{ "error": false, "message": "", "results": [ { "v_set": "StartSet", "v_id": "c4", "v": { "StartSet.@pathResults": [ "c1-m1-m4-c4-m6-c3-", "c1-m1-m4-c4-m7-c3-" ], "StartSet.@resultLength": 5 }, "v_type": "company" }, { "v_set": "StartSet", "v_id": "m5", "v": { "StartSet.@pathResults": ["c1-m2-c2-m5-m7-c3-"], "StartSet.@resultLength": 5 }, "v_type": "persons" } ] }

Back to top


Appendix: Start-to-End Process and Data Flow

The figures below illustrate the sequence of steps and the dependencies to progress from no graph to a loaded graph and a query result.

Figure A1: Complete GSQL Workflow, using Online Loading