NAV
http

Isoxya 2.1.0

Isoxya web crawler is an internet data processing system, representing years of research into building next-generation crawlers and scrapers. It comes in two editions: Community Edition (CE), a free and open-source (BSD 3-Clause) mini crawler, suitable for small crawls on a single computer; and Pro Edition (PE), a commercial and closed-source distributed crawler, suitable for small, large, and humongous crawls on high-availability clusters of multiple computers. Both editions utilise flexible plugins, allowing numerous programming languages to be used to extend the core engine via JSON interfaces. Plugins written for Isoxya CE should typically scale to Isoxya PE with minimal or no changes.

Features

Feature Community Edition (CE) Pro Edition (PE)
Licence open-source commercial
API
CLI scripts
Plugins 3+ 3+
·
Authentication Tigrosa
Database SQLite PostgreSQL
Cache Redis
Message broker RabbitMQ
·
High-availability
Horizontal scaling
Error recovery
Resource management
·
Concurrent crawls 1 ∞¹
Pages/crawl ∞²⁺³ ∞¹
User-agents ∞¹
Rate-limit (reqs/s) 1/10³ ∞¹⁺⁴
·
Robots.txt
Crawl max pages
Crawl max depth
List crawls
External link check
Crawl cancellation
Organisations
·
Crawler channels 1 ∞¹
Processor channels 1 ∞¹
Streamer channels 1 ∞¹
·
OS variant Linux Linux
Packaging container container
Support community direct
·
Price free on request

Features and limits are indicative only, not guarantees. ∞ indicates _many, not infinite!_ ¹ depending on licence and infrastructure ² no hard-limit, but small as single-process ³ not configurable ⁴ set globally per-site; configurable for on-prem only .

Installation

SQLite

(CE)

SQLite 3 is required as the main datastore for Isoxya CE. Since this is an embedded database, this should typically work out-the-box with no setup.

Tigrosa

(PE)

Tigrosa 2 is required as an authentication proxy for Isoxya PE. The full set of routes supported by Isoxya is the set of Tigrosa routes plus the Isoxya routes detailed here. Tigrosa can be installed before or after Isoxya, although Tigrosa will not run until Isoxya is running, and equally it won't be possible to initialise Isoxya until Tigrosa is running.

Different databases within the same PostgreSQL and Redis servers can be used if required. Installing Tigrosa within the same PostgreSQL or Redis databases as Isoxya is not supported. It is strongly recommended to use different PostgreSQL users for Tigrosa, as detailed in the documentation.

PostgreSQL

(PE)

create users and database

CREATE USER isx_dev;

CREATE DATABASE isx_dev OWNER isx_dev;

CREATE USER isx_dev_pe_auto;
CREATE USER isx_dev_pe_crwl;
CREATE USER isx_dev_pe_proc;
CREATE USER isx_dev_pe_strm;

GRANT isx_dev TO isx_dev_pe_auto;
GRANT isx_dev TO isx_dev_pe_crwl;
GRANT isx_dev TO isx_dev_pe_proc;
GRANT isx_dev TO isx_dev_pe_strm;

PostgreSQL 13 is required as the main datastore for Isoxya PE. Other recent versions may also work.

For high-availability (PE), a number of options are possible, such as active-passive with automatic failover, or multi-master where available. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

Redis

(PE)

Redis 6 is required as a cache for Isoxya PE. Other recent versions will also likely work.

For high-availability (PE), an active-passive setup with automatic failover is possible. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

RabbitMQ

(PE)

create vhost and users

rabbitmqctl add_vhost isx_dev &&

rabbitmqctl add_user isx_dev         $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_auto $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_crwl $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_proc $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_strm $(openssl rand -base64 32) &&

rabbitmqctl clear_password isx_dev         &&
rabbitmqctl clear_password isx_dev_pe_auto &&
rabbitmqctl clear_password isx_dev_pe_crwl &&
rabbitmqctl clear_password isx_dev_pe_proc &&
rabbitmqctl clear_password isx_dev_pe_strm &&

rabbitmqctl set_permissions -p isx_dev isx_dev         ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_auto ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_crwl ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_proc ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_strm ".*" ".*" ".*" &&

true

set policies

rabbitmqctl set_policy -p isx_dev base ".*" \
    '{"ha-mode":"all","ha-sync-mode":"automatic"}' \
    --priority 0 &&
rabbitmqctl set_policy -p isx_dev hlth "_\.healthcheck" \
    '{"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":0}' \
    --priority 1 --apply-to queues &&
rabbitmqctl set_policy -p isx_dev dyn "(crwl|proc|strm)\..*" \
    '{"ha-mode":"all","ha-sync-mode":"automatic","expires":604800000}' \
    --priority 1 --apply-to queues &&

true

RabbitMQ 3 is required for messaging between the main programs for Isoxya PE.

For high-availability (PE), RabbitMQ's built-in multi-broker setup is possible. A minimum of 3 nodes is recommended, with mirrored queues applied via a policy, and pause-minority cluster partition handling.

Containers

(CE/PE)

Podman or Docker are recommended for running the main programs for Isoxya CE/PE. Docker Swarm or Kubernetes can also be used for Isoxya PE. Alternatively, it is possible to run binaries with few dependencies instead, but these are not currently supplied separately, meaning it would be necessary to extract the binaries from the container images.

For high-availability (PE), it is possible to either let the container orchestrator handle this, or alternatively set up multiple instances behind an HTTP load-balancer. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

Dynamic Resources

(PE)

Pacemaker with other ClusterLabs components is required for dynamic resource management for Isoxya PE. In future, it is hoped that other orchestrators will also be supported, such as Docker Swarm or Kubernetes (get in touch if you'd like to discuss an extension such as this).

For high-availability (PE), Pacemaker can be configured to allow resource migration across the cluster automatically. A minimum of 3 nodes is recommended.

Plugins

(CE/PE)

The Isoxya engine requires plugins to run for Isoxya CE/PE. Which plugins are used can change the utility of the web crawler dramatically. For example, one set of plugins could turn Isoxya into an SEO crawler, another set of plugins into a large-scale spellchecker, and another set of plugins into an image search engine.

The installation instructions for plugins vary; consult their documentation for specific steps to take. References includes a list of open-source or proprietary plugins known to be available.

For high-availability (PE), it is possible to either let the container orchestrator handle this, or alternatively set up multiple instances behind an HTTP load-balancer. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

Configuration

isx-ce-api

(CE)

isx-ce-api is the Isoxya CE API, controlling the main engine. It is typically installed on the Containers servers.

Environment Variables

Variable Default Description
SQLITE_FILE sqlite3.db SQLite file

isx-pe-api

(PE)

isx-pe-api is the Isoxya PE API, controlling the main engine. It is typically installed on the Containers servers.

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-auto

(PE)

isx-pe-auto is the Isoxya PE dynamic resource manager allocator, responsible for launching crawlers, processors, and streamers. It is typically installed on the Dynamic Resources servers.

Container Mounts

Location Target
/ /mnt/chroot.d

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-auto-stop-inactive

(PE)

isx-pe-auto-stop-inactive is the Isoxya PE dynamic resource manager deallocator, responsible for cleaning up crawlers, processors, and streamers. It is typically installed on the Dynamic Resources servers.

Container Mounts

Location Target
/ /mnt/chroot.d

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-auto-validate-ext

(PE)

isx-pe-auto-stop-inactive is the Isoxya PE validate external scheduled task, generating new crawls automatically where required. It is typically installed on the Dynamic Resources servers.

Container Mounts

Location Target
/ /mnt/chroot.d

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-crwl

(PE)

isx-pe-crwl is the Isoxya PE crawler, crawling a single site. One of these is run for every site being crawled, potentially more than one in parallel if multiple channels are being used. It is typically installed on the Dynamic Resources servers.

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-proc

(PE)

isx-pe-proc is the Isoxya PE processor, connecting to a processor plugin. One of these is run for every processor plugin, potentially more than one in parallel if multiple channels are being used. It is typically installed on the Dynamic Resources servers.

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-strm

(PE)

isx-pe-strm is the Isoxya PE streamer, connecting to a streamer plugin. One of these is run for every streamer plugin, potentially more than one in parallel if multiple channels are being used. It is typically installed on the Dynamic Resources servers.

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

Initialisation

Use the Isoxya Scripts to complete setup, either directly or by using the scripts as reference.

Log in

(PE)

Log in using Tigrosa.

Create an organisation

(PE)

Create an Org using Tigrosa.

Register processor plugins

(CE/PE)

isx-create-plug-proc

Register a PlugProc, pointing to the processor plugin endpoint (potentially through a load-balancer terminating SSL).

Repeat this step as many times as needed, to register multiple processor plugins.

Register streamer plugins

(CE/PE)

isx-create-plug-strm

Register a PlugStrm, pointing to the streamer plugin endpoint (potentially through a load-balancer terminating SSL).

Repeat this step as many times as needed, to register multiple streamer plugins.

Create user-agent identities

(PE)

isx-create-user-agent

Create a UserAgent identity, used by the crawlers for requests.

Repeat this step as many times as needed, to create multiple user-agent identities.

Usage

Log in

(PE)

Log in using Tigrosa.

Register a site

(CE/PE)

isx-create-site

Register a Site which you want to crawl.

Start a crawl

(CE/PE)

isx-create-crwl

Start a Crwl.

Read resources

(CE/PE)

isx-read

Read a Crwl or other resources.

References

Isoxya Scripts

Isoxya Scripts is an open-source (BSD 3-Clause) collection of scripts for Isoxya web crawler. With these, it's possible to crawl sites and perform other operations using the Isoxya API. These are useful not only in development, but also as a demo of Isoxya's main capabilities, a quick way of performing actions even in production, and also in providing a functional reference for those wishing to develop their own programs on top of Isoxya.

Isoxya plugin: Crawler HTML

Isoxya plugin: Crawler HTML is an open-source (BSD 3-Clause) processor plugin for Isoxya web crawler. This plugin uses Isoxya 2 JSON interfaces to provide a core run loop for the crawling engine, receiving data for each page post-request, parsing it as static HTML, constructing URL metadata, and responding with a set of outbound URLs.

Isoxya plugin: Elasticsearch

Isoxya plugin: Elasticsearch is an open-source (BSD 3-Clause) streamer plugin for Isoxya web crawler. This plugin uses Isoxya 2 JSON interfaces to stream data into an Elasticsearch cluster, making it possible to query using all the normal features provided by Elasticsearch and Kibana.

Isoxya plugin: Spellchecker

Isoxya plugin: Spellchecker is an open-source (BSD 3-Clause) processor plugin for Isoxya web crawler. This plugin uses Isoxya 2 JSON interfaces to provide spellchecking capabilities to entire websites, even if they have millions of pages.

Interfaces

Processor /* POST

Request

POST /* HTTP/1.1
content-type: application/json
{
  "body": "PCFkb2N0eXBlIGh0bWw+CjxodG1sPgo8aGVhZD4KICAgIDx0aXRsZT5FeGFtcGxlIERvbWFpbjwvdGl0bGU+CgogICAgPG1ldGEgY2hhcnNldD0idXRmLTgiIC8+CiAgICA8bWV0YSBodHRwLWVxdWl2PSJDb250ZW50LXR5cGUiIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIgLz4KICAgIDxtZXRhIG5hbWU9InZpZXdwb3J0IiBjb250ZW50PSJ3aWR0aD1kZXZpY2Utd2lkdGgsIGluaXRpYWwtc2NhbGU9MSIgLz4KICAgIDxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+CiAgICBib2R5IHsKICAgICAgICBiYWNrZ3JvdW5kLWNvbG9yOiAjZjBmMGYyOwogICAgICAgIG1hcmdpbjogMDsKICAgICAgICBwYWRkaW5nOiAwOwogICAgICAgIGZvbnQtZmFtaWx5OiAtYXBwbGUtc3lzdGVtLCBzeXN0ZW0tdWksIEJsaW5rTWFjU3lzdGVtRm9udCwgIlNlZ29lIFVJIiwgIk9wZW4gU2FucyIsICJIZWx2ZXRpY2EgTmV1ZSIsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7CiAgICAgICAgCiAgICB9CiAgICBkaXYgewogICAgICAgIHdpZHRoOiA2MDBweDsKICAgICAgICBtYXJnaW46IDVlbSBhdXRvOwogICAgICAgIHBhZGRpbmc6IDJlbTsKICAgICAgICBiYWNrZ3JvdW5kLWNvbG9yOiAjZmRmZGZmOwogICAgICAgIGJvcmRlci1yYWRpdXM6IDAuNWVtOwogICAgICAgIGJveC1zaGFkb3c6IDJweCAzcHggN3B4IDJweCByZ2JhKDAsMCwwLDAuMDIpOwogICAgfQogICAgYTpsaW5rLCBhOnZpc2l0ZWQgewogICAgICAgIGNvbG9yOiAjMzg0ODhmOwogICAgICAgIHRleHQtZGVjb3JhdGlvbjogbm9uZTsKICAgIH0KICAgIEBtZWRpYSAobWF4LXdpZHRoOiA3MDBweCkgewogICAgICAgIGRpdiB7CiAgICAgICAgICAgIG1hcmdpbjogMCBhdXRvOwogICAgICAgICAgICB3aWR0aDogYXV0bzsKICAgICAgICB9CiAgICB9CiAgICA8L3N0eWxlPiAgICAKPC9oZWFkPgoKPGJvZHk+CjxkaXY+CiAgICA8aDE+RXhhbXBsZSBEb21haW48L2gxPgogICAgPHA+VGhpcyBkb21haW4gaXMgZm9yIHVzZSBpbiBpbGx1c3RyYXRpdmUgZXhhbXBsZXMgaW4gZG9jdW1lbnRzLiBZb3UgbWF5IHVzZSB0aGlzCiAgICBkb21haW4gaW4gbGl0ZXJhdHVyZSB3aXRob3V0IHByaW9yIGNvb3JkaW5hdGlvbiBvciBhc2tpbmcgZm9yIHBlcm1pc3Npb24uPC9wPgogICAgPHA+PGEgaHJlZj0iaHR0cHM6Ly93d3cuaWFuYS5vcmcvZG9tYWlucy9leGFtcGxlIj5Nb3JlIGluZm9ybWF0aW9uLi4uPC9hPjwvcD4KPC9kaXY+CjwvYm9keT4KPC9odG1sPgo=",
  "header": {
    "Vary": "Accept-Encoding",
    "Content-Type": "text/html; charset=UTF-8",
    "Content-Encoding": "gzip",
    "Etag": "\"3147526947+gzip\"",
    "Expires": "Tue, 02 Feb 2021 11:19:21 GMT",
    "Age": "264946",
    "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
    "Date": "Tue, 26 Jan 2021 11:19:21 GMT",
    "Server": "ECS (dcb/7EC6)",
    "Content-Length": "648",
    "Cache-Control": "max-age=604800",
    "X-Cache": "HIT"
  },
  "meta": {
    "status": 200,
    "config": null,
    "url": "http://example.com:80/",
    "method": "GET",
    "err": null,
    "duration": {
      "denominator": 1000000000,
      "numerator": 119165427
    }
  }
}

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "data": {
    "status": 200,
    "method": "GET",
    "header": {
      "Vary": "Accept-Encoding",
      "Content-Type": "text/html; charset=UTF-8",
      "Content-Encoding": "gzip",
      "Etag": "\"3147526947+gzip\"",
      "Expires": "Tue, 02 Feb 2021 11:19:21 GMT",
      "Age": "264946",
      "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
      "Date": "Tue, 26 Jan 2021 11:19:21 GMT",
      "Server": "ECS (dcb/7EC6)",
      "Content-Length": "648",
      "Cache-Control": "max-age=604800",
      "X-Cache": "HIT"
    },
    "err": null,
    "duration": {
      "denominator": 1000000000,
      "numerator": 119165427
    }
  },
  "urls": [
    "https://www.iana.org/domains/example"
  ]
}

Send page metadata, header, and body to a processor plugin, and receive extracted data and outbound URLs to crawl.

Request Parameters

Parameter Type Description
body string Base-64 encoded page body, e.g. HTML or image
header object HTTP header of response
meta.status number? HTTP status code, if available
meta.config object? processor config
meta.url string URL of page
meta.method string HTTP method of request
meta.err string? error (e.g. RobotDisallowed)
meta.duration number? duration of request; this is a simplified rational

Response Parameters

Parameter Type Description
data object free-form extracted data; processor may define own schema
urls array.string outbound URLs to crawl, if any

Streamer /* POST

Request

POST /* HTTP/1.1
content-type: application/json
{
  "crwl": {
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-26T11:46:21.411346Z",
    "t_begin": "2021-01-26T11:46:21.411346Z"
  },
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "t_retrieval": "2021-01-26T11:46:21.908809003Z",
  "data": {
    "status": 200,
    "method": "GET",
    "header": {
      "Vary": "Accept-Encoding",
      "Content-Type": "text/html; charset=UTF-8",
      "Content-Encoding": "gzip",
      "Etag": "\"3147526947\"",
      "Expires": "Tue, 02 Feb 2021 11:46:21 GMT",
      "Age": "425384",
      "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
      "Date": "Tue, 26 Jan 2021 11:46:21 GMT",
      "Server": "ECS (dcb/7F15)",
      "Content-Length": "648",
      "Cache-Control": "max-age=604800",
      "Accept-Ranges": "bytes",
      "X-Cache": "HIT"
    },
    "err": null,
    "duration": {
      "denominator": 250000000,
      "numerator": 30304037
    }
  },
  "url": "http://example.com:80/",
  "plug_proc": {
    "tag": "crawler-html",
    "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
  },
  "site": {
    "url": "http://example.com:80",
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw"
  }
}

Response 200

HTTP/1.1 200 OK
content-type: application/json

Send extracted data and page metadata to a streamer plugin.

Request Parameters

Parameter Type Description
crwl.href string Crwl Href
crwl.t_begin string Crwl time crawl began
org.href string Org Href
t_retrieval string time page retrieved
data object free-form extracted data; processor may define own schema
url string URL of page
plug_proc.tag string PlugProc tag, used for conditional data-streamer logic, and such as might appear on a financial invoice
plug_proc.href string PlugProc Href
site.url string Site URL
site.href string Site Href

Apex

(CE/PE)

Crwl

(CE/PE)

Crawl: A crawl of a Site. Belongs to an Org. Crawls can be web crawls, crawling from the site apex, or web crawls crawling from a starting list of pages, or a list crawl visiting only those pages on the list. Crawls can optionally validate external links, in which case new crawls are spawned after the parent crawl completes. It is not possible to crawl more than a single site at once; for that, multiple crawls should be used.

/site/:site_id/crwl POST

(CE/PE)

Request

POST /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl HTTP/1.1
content-type: application/json
{
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  }
}

Response 201

HTTP/1.1 201 Created
location: /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z
content-type: application/json
{
  "depth_max": null,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
  "list": null,
  "method": "GET",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "p": null,
  "pages": null,
  "pages_max": null,
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_proc_conf": null,
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "progress": null,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  },
  "status": "pending",
  "t_begin": "2021-01-25T19:20:56.518498Z",
  "t_end": null,
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  },
  "validate_ext": false
}

Create a Crwl.

Request Parameters

Parameter Type Description
depth_max number? max depth to crawl before terminating early (approximate) (PE)
list object? List, if list crawl rather than web crawl (PE)
org object Org (PE)
pages_max number? max pages to crawl before terminating early (approximate) (PE)
plug_proc[] array.object PlugProcs (CE/PE)
plug_proc_conf object? processor config (CE/PE)
plug_strm[] array.object PlugStrms (CE/PE)
user_agent object UserAgent (PE)
validate_ext boolean? whether to validate external links by auto-generating child crawls after completion (PE)

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl GET

(CE/PE)

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl?_next=2021-01-25T19:20:56.518498Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl?_prev=2021-01-25T19:20:56.518498Z>; rel="prev"
content-type: application/json
[
  {
    "depth_max": null,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
    "list": null,
    "method": "GET",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "p": null,
    "pages": 1,
    "pages_max": null,
    "plug_proc": [
      {
        "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
      }
    ],
    "plug_proc_conf": null,
    "plug_strm": [
      {
        "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
      }
    ],
    "progress": 0,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "http://example.com:80"
    },
    "status": "pending",
    "t_begin": "2021-01-25T19:20:56.518498Z",
    "t_end": null,
    "user_agent": {
      "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
    },
    "validate_ext": false
  }
]

List Crwls.

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl/:site_v GET

(CE/PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "depth_max": null,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
  "list": null,
  "method": "GET",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "p": null,
  "pages": 1,
  "pages_max": null,
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_proc_conf": null,
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "progress": 0,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  },
  "status": "pending",
  "t_begin": "2021-01-25T19:20:56.518498Z",
  "t_end": null,
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  },
  "validate_ext": false
}

Read a Crwl.

Response Parameters

Parameter Type Description
depth_max number? max depth to crawl before terminating early (approximate) (PE)
href string Href (CE/PE)
list object? List, if list crawl rather than web crawl (PE)
method string HTTP method for requests (PE)
org object Org (PE)
p object? parent Crwl, if crawl auto-generated during another crawl (PE)
pages number? total pages discovered (CE/PE)
pages_max number? max pages to crawl before terminating early (approximate) (PE)
plug_proc[] array.object PlugProcs (CE/PE)
plug_proc_conf object? processor config (CE/PE)
plug_strm[] array.object PlugStrms (CE/PE)
progress number? progress of crawl (%) (CE/PE)
site object Site (CE/PE)
status string status of crawl (pending, completed, limited, canceled) (CE/PE)
t_begin string time crawl began (CE/PE)
t_end string? time crawl ended, if finished (CE/PE)
user_agent object UserAgent (PE)
validate_ext boolean whether to validate external links by auto-generating child crawls after completion (PE)

/site/:site_id/crwl/:site_v PATCH

(PE)

Request

PATCH /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z HTTP/1.1
content-type: application/json
{
  "status": "canceled"
}

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "depth_max": null,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
  "list": null,
  "method": "GET",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "p": null,
  "pages": 1,
  "pages_max": null,
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_proc_conf": null,
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "progress": 0,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  },
  "status": "canceled",
  "t_begin": "2021-01-25T19:20:56.518498Z",
  "t_end": null,
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  },
  "validate_ext": false
}

Update a Crwl.

Request Parameters

Parameter Type Description
status string status of crawl; canceled cancels crawl (PE)

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl/:site_v/crwl GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/crwl>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/crwl?_next=2021-01-25T20:27:03.480965Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/crwl?_prev=2021-01-25T20:27:03.480965Z>; rel="prev"
content-type: application/json
[
  {
    "depth_max": 1,
    "href": "/site/aHR0cHM6Ly93d3cuaWFuYS5vcmc6NDQz/crwl/2021-01-25T20:27:03.480965Z",
    "list": {
      "href": "/list/cf7faaf0-2fbc-4f6f-a96d-379efef34d3e"
    },
    "method": "GET",
    "org": {
      "href": "/org/2db5e7d7-60b9-4a81-8649-e60ee3b05d38"
    },
    "p": {
      "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z"
    },
    "pages": 1,
    "pages_max": null,
    "plug_proc": [
      {
        "href": "/plug_proc/057eae7b-5779-4549-be13-d88808708ea9"
      }
    ],
    "plug_proc_conf": null,
    "plug_strm": [
      {
        "href": "/plug_strm/52822786-1343-47b6-b3fa-761e773c9ba5"
      }
    ],
    "progress": 100,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cHM6Ly93d3cuaWFuYS5vcmc6NDQz",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "https://www.iana.org:443"
    },
    "status": "completed",
    "t_begin": "2021-01-25T20:27:03.480965Z",
    "t_end": "2021-01-25T20:28:04.391544Z",
    "user_agent": {
      "href": "/user_agent/1e25f62f-ef6c-4ca9-93ad-63210e28ced9"
    },
    "validate_ext": false
  }
]

List child Crwls of a parent Crwl.

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl/:site_v/list GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/list>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/list?_next=2021-01-25T20:10:00.015549Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/list?_prev=2021-01-25T20:10:00.015549Z>; rel="prev"
content-type: application/json
[
  {
    "crwl": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z",
    "href": "/list/cf7faaf0-2fbc-4f6f-a96d-379efef34d3e",
    "org": {
      "href": "/org/2db5e7d7-60b9-4a81-8649-e60ee3b05d38"
    },
    "pages": 1,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cHM6Ly93d3cuaWFuYS5vcmc6NDQz",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "https://www.iana.org:443"
    }
  }
]

List child Lists of a parent Crwl.

Response Parameters

Response Parameters are as for /list/:list_id GET.

FinLdgr

(PE)

Finance-Ledger: A ledger showing completed Crwls, which PlugProc and PlugStrm plugins they used, how many pages were crawled, and which Org is responsible for paying the bill.

/fin_ldgr GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </fin_ldgr>; rel="first", </fin_ldgr?_next=2021-01-21T13:27:04.628437Z>; rel="next", </fin_ldgr?_prev=2021-01-21T13:27:04.628437Z>; rel="prev"
content-type: application/json
[
  {
    "fin_prod": {
      "href": "/fin_prod/da5c6942-813a-4923-bacc-019c4c102585",
      "t_ins": "2021-01-15T11:16:06.854017Z",
      "tag": "plug_strm.elasticsearch"
    },
    "href": "/fin_ldgr/cfdc7e29-5325-59b2-bf71-859c17198da9",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "qty": 3,
    "t_ins": "2021-01-21T13:27:04.628437Z"
  },
  {
    "fin_prod": {
      "href": "/fin_prod/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
      "t_ins": "2021-01-15T11:16:06.854017Z",
      "tag": "plug_proc.crawler-html"
    },
    "href": "/fin_ldgr/bcd6aad4-428e-5996-b707-370c8507f51c",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "qty": 3,
    "t_ins": "2021-01-21T13:27:04.628437Z"
  }
]

List FinLdgr entries.

Response Parameters

Response Parameters are as for /fin_ldgr/:fin_ldgr_id GET.

/fin_ldgr/:fin_ldgr_id GET

(PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "fin_prod": {
    "href": "/fin_prod/da5c6942-813a-4923-bacc-019c4c102585",
    "t_ins": "2021-01-15T11:16:06.854017Z",
    "tag": "plug_strm.elasticsearch"
  },
  "href": "/fin_ldgr/cfdc7e29-5325-59b2-bf71-859c17198da9",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "qty": 3,
  "t_ins": "2021-01-21T13:27:04.628437Z"
}

Read a FinLdgr entry.

Response Parameters

Parameter Type Description
fin_prod object FinProd (PE)
href string Href (PE)
org object Org (PE)
qty number quantity of units used (PE)
t_ins string time associated with usage (PE)

FinProd

(PE)

Finance-Product: A PlugProc or PlugStrm plugin, used within the FinLdgr to record usage.

/fin_prod GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </fin_prod>; rel="first", </fin_prod?_next=2021-01-15T11:16:06.854017Z>; rel="next", </fin_prod?_prev=2021-01-21T11:16:38.118334Z>; rel="prev"
content-type: application/json
[
  {
    "href": "/fin_prod/58ce5bbe-5a0a-43df-8860-01e70820e6d8",
    "t_ins": "2021-01-21T11:16:38.118334Z",
    "tag": "plug_proc.spellchecker"
  },
  {
    "href": "/fin_prod/da5c6942-813a-4923-bacc-019c4c102585",
    "t_ins": "2021-01-15T11:16:06.854017Z",
    "tag": "plug_strm.elasticsearch"
  },
  {
    "href": "/fin_prod/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
    "t_ins": "2021-01-15T11:16:06.854017Z",
    "tag": "plug_proc.crawler-html"
  }
]

List FinProds.

Response Parameters

Response Parameters are as for /fin_prod/:fin_prod_id GET.

/fin_prod/:fin_prod_id GET

(PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "href": "/fin_prod/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
  "t_ins": "2021-01-15T11:16:06.854017Z",
  "tag": "plug_proc.crawler-html"
}

Read a FinProd.

Response Parameters

Parameter Type Description
href string Href (PE)
t_ins string time of auto-creation (PE)
tag string auto-derived tag, such as might appear on a financial invoice (PE)

List

(PE)

List: A list of pages within a Site. Belongs to an Org. Used for list crawls and web crawls with external link validation.

/site/:site_id/list POST

(PE)

Request

POST /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list HTTP/1.1
content-type: application/json
{
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  }
}

Response 201

HTTP/1.1 201 Created
location: /list/d8996e4e-4942-4b8a-9d23-00901dc010ff
content-type: application/json
{
  "crwl": null,
  "href": "/list/d8996e4e-4942-4b8a-9d23-00901dc010ff",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pages": 0,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  }
}

Create a List.

Request Parameters

Parameter Type Description
org object Org (PE)

Response Parameters

Response Parameters are as for /list/:list_id GET.

/site/:site_id/list GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list?_next=2021-01-25T18:38:40.423053Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list?_prev=2021-01-25T18:38:40.423053Z>; rel="prev"
content-type: application/json
[
  {
    "crwl": null,
    "href": "/list/d8996e4e-4942-4b8a-9d23-00901dc010ff",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pages": 1,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "http://example.com:80"
    }
  }
]

List Lists.

Response Parameters

Response Parameters are as for /list/:list_id GET.

/list/:list_id GET

(PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "crwl": null,
  "href": "/list/d8996e4e-4942-4b8a-9d23-00901dc010ff",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pages": 1,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  }
}

Read a List.

Response Parameters

Parameter Type Description
crwl object? Crwl, if list auto-generated during another crawl (PE)
href string Href (PE)
org object Org (PE)
pages number total pages in list (PE)
site object Site (PE)

/list/:list_id DELETE

(PE)

Response 204

HTTP/1.1 204 No Content

Delete a List.

ListPage

(PE)

ListPage: A page within a List, which belongs to a Site.

/list/:list_id/list_page POST

(PE)

Request

POST /list/d8996e4e-4942-4b8a-9d23-00901dc010ff/list_page HTTP/1.1
content-type: application/json
{
  "url": ["/a"]
}

Response 204

HTTP/1.1 204 No Content

Insert one or more pages into a List.

Request Parameters

Parameter Type Description
url array.string site URLs to add to list (PE)

Org

(PE)

Organisation: The business, organisation, or human entity registered with the system, and responsible for paying any bills. Does not log in directly.

PlugProc

(CE/PE)

Plugin-Processor: A processor plugin, belonging to an Org. This registers the endpoint of a processor plugin within the system, making it available to Crwls. It is possible to register as public, in which case, it is made available to every Org. Multiple channels can be used to run more processor instances simultaneously, increasing throughput (up to some level, and at the expense of increased resource requirements).

/org/:org_id/plug_proc POST

(PE)

Request

POST /org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc HTTP/1.1
content-type: application/json
{
  "tag": "crawler-html",
  "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
}

Response 201

HTTP/1.1 201 Created
location: /plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f
content-type: application/json
{
  "chans": 1,
  "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "crawler-html",
  "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
}

Create a PlugProc.

Request Parameters

Parameter Type Description
chans number? channels, i.e. simultaneous processors (PE)
pub boolean? public, making available to every Org to use (PE)
tag string tag, used for conditional data-streamer logic, and such as might appear on a financial invoice (CE/PE)
url string URL of data-processor endpoint (CE/PE)

Response Parameters

Response Parameters are as for /plug_proc/:plug_proc_id GET.

/org/:org_id/plug_proc GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc>; rel="first", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc?_next=2021-01-15T11:15:23.366804Z>; rel="next", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc?_prev=2021-01-21T11:15:06.05113Z>; rel="prev"
content-type: application/json
[
  {
    "chans": 1,
    "href": "/plug_proc/58ce5bbe-5a0a-43df-8860-01e70820e6d8",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "tag": "spellchecker",
    "url": "http://spellchecker.plugin.dev.isoxya.com:8000/data"
  },
  {
    "chans": 1,
    "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "tag": "crawler-html",
    "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
  }
]

List PlugProcs.

Response Parameters

Response Parameters are as for /plug_proc/:plug_proc_id GET.

/plug_proc POST

(CE)

See /org/:org_id/plug_proc POST.

/plug_proc GET

(CE)

See /org/:org_id/plug_proc GET.

/plug_proc/:plug_proc_id GET

(CE/PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "chans": 1,
  "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "crawler-html",
  "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
}

Read a PlugProc.

Response Parameters

Parameter Type Description
chans number channels, i.e. simultaneous processors (PE)
href string Href (CE/PE)
org object Org (PE)
pub boolean public, making available to every Org to use (PE)
tag string tag, used for conditional data-streamer logic, and such as might appear on a financial invoice (CE/PE)
url string URL of data-processor endpoint (CE/PE)

/plug_proc/:plug_proc_id DELETE

(CE/PE)

Response 204

HTTP/1.1 204 No Content

Delete a PlugProc.

PlugStrm

(CE/PE)

Plugin-Streamer: A streamer plugin, belonging to an Org. This registers the endpoint of a streamer plugin within the system, making it available to Crwls. It is possible to register as public, in which case, it is made available to every Org. Multiple channels can be used to run more processor instances simultaneously, increasing throughput (up to some level, and at the expense of increased resource requirements).

/org/:org_id/plug_strm POST

(PE)

Request

POST /org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm HTTP/1.1
content-type: application/json
{
  "tag": "elasticsearch",
  "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
}

Response 201

HTTP/1.1 201 Created
location: /plug_strm/da5c6942-813a-4923-bacc-019c4c102585
content-type: application/json
{
  "chans": 1,
  "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "elasticsearch",
  "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
}

Create a PlugStrm.

Request Parameters

Parameter Type Description
chans number? channels, i.e. simultaneous streamers (PE)
pub boolean? public, making available to every Org to use (PE)
tag string tag, such as might appear on a financial invoice (CE/PE)
url string URL of data-streamer endpoint (CE/PE)

Response Parameters

Response Parameters are as for /plug_strm/:plug_strm_id GET.

/org/:org_id/plug_strm GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm>; rel="first", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm?_next=2021-01-15T11:15:27.265412Z>; rel="next", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm?_prev=2021-01-15T11:15:27.265412Z>; rel="prev"
content-type: application/json
[
  {
    "chans": 1,
    "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "tag": "elasticsearch",
    "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
  }
]

List PlugStrms.

Response Parameters

Response Parameters are as for /plug_strm/:plug_strm_id GET.

/plug_strm POST

(CE)

See /org/:org_id/plug_strm POST.

/plug_strm GET

(CE)

See /org/:org_id/plug_strm GET.

/plug_strm/:plug_strm_id GET

(CE/PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "chans": 1,
  "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "elasticsearch",
  "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
}

Read a PlugStrm.

Response Parameters

Parameter Type Description
chans number channels, i.e. simultaneous streamers (PE)
href string Href (CE/PE)
org object Org (PE)
pub boolean public, making available to every Org to use (PE)
tag string tag, such as might appear on a financial invoice (CE/PE)
url string URL of data-streamer endpoint (CE/PE)

/plug_strm/:plug_strm_id DELETE

(CE/PE)

Response 204

HTTP/1.1 204 No Content

Delete a PlugStrm.

Site

(CE/PE)

Site: A website. Sites must be registered within the system, after which Crwls may be created. Rate-limits are set per-site. Multiple channels can be used to open more than one simultaneous connection (not recommended for most cases, since it could take a site down).

/site POST

(CE/PE)

Request

POST /site HTTP/1.1
content-type: application/json
{
  "url": "http://example.com:80"
}

Response 201

HTTP/1.1 201 Created
location: /site/aHR0cDovL2V4YW1wbGUuY29tOjgw
content-type: application/json
{
  "chans": 1,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
  "rate_lim": {
    "denominator": 10,
    "numerator": 1
  },
  "url": "http://example.com:80"
}

Create a Site.

Request Parameters

Parameter Type Description
chans number? channels, i.e. simultaneous connections (PE)
rate_lim number? rate-limit (requests/seconds); e.g. 1/10 means 1 request every 10 seconds; this is a simplified rational (PE)
url string URL (CE/PE)

Response Parameters

Response Parameters are as for /site/:site_id GET.

/site/:site_id GET

(CE/PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "chans": 1,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
  "rate_lim": {
    "denominator": 10,
    "numerator": 1
  },
  "url": "http://example.com:80"
}

Read a Site.

Response Parameters

Parameter Type Description
chans number channels, i.e. simultaneous connections (PE)
href string Href (CE/PE)
rate_lim number rate-limit (requests/seconds); e.g. 1/10 means 1 request every 10 seconds; this is a simplified rational (PE)
url string URL (CE/PE)

UserAgent

(PE)

User-Agent: A user-agent identifier, belonging to an Org. This is used for a Crwl as identification during the request. It is possible to register as public, in which case, it is made available to every Org.

/org/:org_id/user_agent POST

(PE)

Request

POST /org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent HTTP/1.1
content-type: application/json
{
  "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
}

Response 201

HTTP/1.1 201 Created
location: /user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7
content-type: application/json
{
  "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
}

Create a UserAgent.

Request Parameters

Parameter Type Description
pub boolean? public, making available to every Org to use (PE)
str string user-agent sent as identifier during requests; ${VERSION} is interpolated (PE)

Response Parameters

Response Parameters are as for /user_agent/:user_agent_id GET.

/org/:org_id/user_agent GET

(PE)

Response 200

HTTP/1.1 200 OK
link: </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent>; rel="first", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent?_next=2021-01-15T11:15:32.420728Z>; rel="next", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent?_prev=2021-01-15T11:15:32.420728Z>; rel="prev"
content-type: application/json
[
  {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
  }
]

List UserAgents.

Response Parameters

Response Parameters are as for /user_agent/:user_agent_id GET.

/user_agent/:user_agent_id GET

(PE)

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
}

Read a UserAgent.

Response Parameters

Parameter Type Description
href string Href (PE)
org object Org (PE)
pub boolean public, making available to every Org to use (PE)
str string user-agent sent as identifier during requests; ${VERSION} is interpolated (PE)

/user_agent/:user_agent_id DELETE

(PE)

Response 204

HTTP/1.1 204 No Content

Delete a UserAgent.