Demo Highlights

We are building a Relational Knowledge Graph System. We have a big vision for
this product, and you can learn more about that vision in other videos on our
website.

Today we are going to focus on Data Applications built on Knowledge Graphs.

We're going to walk through some highlights of a demo that uses a Knowledge
Graph to solve a business problem.

This demo is called the "Knowledge Graph to Learn Knowledge Graphs",
or KGLKG. It parallels my own journey from programming in imperative languages
like Java, C++, and Python to RAI's declarative language, Rel. For many years, I
used those legacy languages plus SQL in an application-centric database-oriented
way of solving business problems. But today I am using Rel and a data-centric
business modeling approach to creating business solutions.

If you have seen the Overview video, or had a presentation from our Sales team,
you will recognize this reference architecture diagram for Data Apps built on
the RAI platform.

Here is how the KGLKG Data App fits into this architecture.

A full demo would walk through these components, but we’re going to jump ahead
to the demo itself.

A Jupyter notebook will serve as the front-end for today's demo.

The business problem revolves around choosing the most relevant learning
materials to bring a new Sales or Sales Engineering hire up to speed. Documents
are suggested based on their conceptual content, quality, degree of difficulty,
and appropriateness for a sequenced learning plan.

We model the business problem with this Knowledge Graph. Let’s build it up piece
by piece.

We begin with the core nodes and relationships (or edges) that center around the
learning materials and the concepts or topics they contain. These learning
materials are PDFs and HTML pages like blog posts, e-zine articles, videos, and
so on. We’re going to focus on PDFs and HTMLs.

We ingest a lemma CSV containing document names, concept names, and concept
weights (the number of occurrences of that concept in the document).

We define relations in Rel for the document and concept nodes. These are based
on the lemma CSV. Here are the documents…

The About edges link the Document and Concept nodes. A document is about some
set of concepts. And each concept has some number of documents that mention it.
Each such About relationship has a weight attribute, the number of times that
concept appears in that document.

We can list all the About edges or about_weight attributes, but here let's
query the about_weight attribute and use a regular expression to find the
documents and about_weights for concept words that contain the substring,
"graph".

Here's the Knowledge Graph fragment we've created so far annotated with example
data.

In addition to querying attributes from the loaded data, we can compute
knowledge from the nodes and relationships.

These computations can be executed as runtime queries.

Or, they can be generalized and declaratively defined as relations so that this
computed knowledge becomes part of our Knowledge Graph, available for simple
query.

Our diagram highlights the new computed attributes added to our Knowledge Graph.
These are automatically recomputed as the underlying data changes, say when a
new document and its concepts are added to the graph. Contrast this with other
graph databases which require you to explicitly run queries to recompute
attributes that may be affected by data updates.

To suggest documents based on their content, we compute the "focus" of
each document on each of the concepts it mentions. The about_focus attribute
becomes a queryable part of our knowledge graph. But for convenience, we create
an additional relation, called "suggested" to get us a Top-N list of
documents for a specified concept.

The definition shown here is unbounded, too generalized to be pre-computed. So
we mark it with "@inline" to defer evaluation of the relation until it
is used in a query for a specific concept. You might even think of the suggested
relation as similar to a procedural language function.

Here's a query that joins the top-N list for "graph" with the top-N
list for "ai" to get a top-N list of documents about BOTH topics.

So far we have based our suggestions purely on statistical analysis of the
documents. But now we want to include knowledge from Reviewers who rank and rate
these documents.

We add a Person node with a dynamic role attribute that can track their status
as learner, employee, or curator of the learning materials.

The full demo shows how we dynamically add data, perform ELT, and use integrity
constraints to guarantee our data conforms to our business rules. Right now,
we're going to skip ahead to the Reviewer role.

Any employee can review documents, and they are not confined to the curated
library of materials. In fact, reviewing documents outside the curated library
is how new material is found for the library.

In this example, we have two reviewers, Ben and Steve. Ben leads a sales team
and reviews documents with an eye towards onboarding new sales hires. He tracks
his reading in a Google Sheet called Sales Onboarding Plan, or SOP for short.

Steve leads a sales engineering team and reviews documents with an eye towards
onboarding new sales engineering hires. He tracks his reading in a Google Sheet
called Hersker's Learning Curve, or HLC for short.

Reflecting the real world, there was no standard format for these tracking
sheets, so Ben and Steve each created different formats that must be reconciled
for our analysis. In his sheet, Ben created document titles that were linked to
the URL. Steve used separate columns for the title and the URL. Both used some
Google Drive URLs which convey no human-readable information, but which can be
resolved to the actual on-drive file names.

We define brief and full-detailed definitions for the input CSVs for each
reviewer’s Google Sheet. Here's the brief data for each.

The detailed views include extracted URLs and Names generated during data
preparation. But the data ingest process left some entries we don't want.

For example, there are many empty strings in the extractedName relation which
we want to eliminate as they violate 6NF.

So, we perform some ELT to clean up our reviewer data.

Now we have a clean extractedName relation.

To reconcile these reviewer lists we want to match names and find the
intersections between the two lists and with the curated library. But we will
explore these next steps in other videos.

Let’s make some decisions...