NAV
http

Isoxya 2.0.2

Isoxya is a web crawler and data processing system, representing years of research into building a next-generation web crawler. It can process websites with tens of millions of pages, and extract and transform that data in myriad ways, including streaming data into Elasticsearch. A flexible plugin system interfaces via JSON with software written in numerous other programming languages, allowing its functionality to be extended to support multiple industries.

Installation

Tigrosa

Tigrosa 2 is required as an authentication proxy. The full set of routes supported by Isoxya is the set of Tigrosa routes plus the Isoxya routes detailed here. Tigrosa can be installed before or after Isoxya, although Tigrosa will not run until Isoxya is running, and equally it won't be possible to initialise Isoxya until Tigrosa is running.

Different databases within the same PostgreSQL and Redis servers can be used if required. Installing Tigrosa within the same PostgreSQL or Redis databases as Isoxya is not supported. It is strongly recommended to use different PostgreSQL users for Tigrosa, as detailed in the documentation.

PostgreSQL

create users and database

CREATE USER isx_dev;

CREATE DATABASE isx_dev OWNER isx_dev;

CREATE USER isx_dev_pe_auto;
CREATE USER isx_dev_pe_crwl;
CREATE USER isx_dev_pe_proc;
CREATE USER isx_dev_pe_strm;

GRANT isx_dev TO isx_dev_pe_auto;
GRANT isx_dev TO isx_dev_pe_crwl;
GRANT isx_dev TO isx_dev_pe_proc;
GRANT isx_dev TO isx_dev_pe_strm;

PostgreSQL 13 is required as the main datastore. Other recent versions may also work.

For high-availability, a number of options are possible, such as active-passive with automatic failover, or multi-master where available. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

Redis

Redis 6 is required as a cache. Other recent versions will also likely work.

For high-availability, an active-passive setup with automatic failover is possible. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

RabbitMQ

create vhost and users

rabbitmqctl add_vhost isx_dev &&

rabbitmqctl add_user isx_dev         $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_auto $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_crwl $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_proc $(openssl rand -base64 32) &&
rabbitmqctl add_user isx_dev_pe_strm $(openssl rand -base64 32) &&

rabbitmqctl clear_password isx_dev         &&
rabbitmqctl clear_password isx_dev_pe_auto &&
rabbitmqctl clear_password isx_dev_pe_crwl &&
rabbitmqctl clear_password isx_dev_pe_proc &&
rabbitmqctl clear_password isx_dev_pe_strm &&

rabbitmqctl set_permissions -p isx_dev isx_dev         ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_auto ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_crwl ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_proc ".*" ".*" ".*" &&
rabbitmqctl set_permissions -p isx_dev isx_dev_pe_strm ".*" ".*" ".*" &&

true

set policies

rabbitmqctl set_policy -p isx_dev base ".*" \
    '{"ha-mode":"all","ha-sync-mode":"automatic"}' \
    --priority 0 &&
rabbitmqctl set_policy -p isx_dev hlth "_\.healthcheck" \
    '{"ha-mode":"all","ha-sync-mode":"automatic","message-ttl":0}' \
    --priority 1 --apply-to queues &&
rabbitmqctl set_policy -p isx_dev dyn "(crwl|proc|strm)\..*" \
    '{"ha-mode":"all","ha-sync-mode":"automatic","expires":604800000}' \
    --priority 1 --apply-to queues &&

true

RabbitMQ 3 is required for messaging between the main programs.

For high-availability, RabbitMQ's built-in multi-broker setup is possible. A minimum of 3 nodes is recommended, with mirrored queues applied via a policy, and pause-minority cluster partition handling.

Containers

Podman, Docker, Docker Swarm, Kubernetes, or an alternative are recommended for running the main programs. Alternatively, it is possible to run binaries with few dependencies instead, but these are not currently supplied separately, meaning it would be necessary to extract the binaries from the container images.

For high-availability, it is possible to either let the container orchestrator handle this, or alternatively set up multiple instances behind an HTTP load-balancer. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

Dynamic Resources

Pacemaker with other ClusterLabs components is required for dynamic resource management. In future, it is hoped that other orchestrators will also be supported, such as Docker Swarm or Kubernetes (get in touch if you'd like to discuss an extension such as this).

For high-availability, Pacemaker can be configured to allow resource migration across the cluster automatically. A minimum of 3 nodes is recommended.

Plugins

The Isoxya engine requires plugins to run. Which plugins are used can change the utility of the web crawler dramatically. For example, one set of plugins could turn Isoxya into an SEO crawler, another set of plugins into a large-scale spellchecker, and another set of plugins into an image search engine.

The installation instructions for plugins vary; consult their documentation for specific steps to take. Refereces includes a list of open-source or proprietary plugins known to be available.

For high-availability, it is possible to either let the container orchestrator handle this, or alternatively set up multiple instances behind an HTTP load-balancer. Depending on the strategy used, a minimum or either 2 or 3 nodes is recommended.

Configuration

isx-pe-api

isx-pe-api is the Isoxya Pro Edition API, controlling the main engine. It is typically installed on the Containers servers.

Environment Variables

Variable Default Description
ADDRESS localhost address to bind to; must be local
PORT 8000 port to bind to
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-auto

isx-pe-auto is the Isoxya Pro Edition dynamic resource manager allocator, responsible for launching crawlers, processors, and streamers. It is typically installed on the Dynamic Resources servers.

Container Mounts

Location Target
/ /mnt/chroot.d

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-auto-stop-inactive

isx-pe-auto-stop-inactive is the Isoxya Pro Edition dynamic resource manager deallocator, responsible for cleaning up crawlers, processors, and streamers. It is typically installed on the Dynamic Resources servers.

Container Mounts

Location Target
/ /mnt/chroot.d

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-auto-validate-ext

isx-pe-auto-stop-inactive is the Isoxya Pro Edition validate external scheduled task, generating new crawls automatically where required. It is typically installed on the Dynamic Resources servers.

Container Mounts

Location Target
/ /mnt/chroot.d

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-crwl

isx-pe-crwl is the Isoxya Pro Edition crawler, crawling a single site. One of these is run for every site being crawled, potentially more than one in parallel if multiple channels are being used. It is typically installed on the Dynamic Resources servers.

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-proc

isx-pe-proc is the Isoxya Pro Edition processor, connecting to a processor plugin. One of these is run for every processor plugin, potentially more than one in parallel if multiple channels are being used. It is typically installed on the Dynamic Resources servers.

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

isx-pe-strm

isx-pe-strm is the Isoxya Pro Edition streamer, connecting to a streamer plugin. One of these is run for every streamer plugin, potentially more than one in parallel if multiple channels are being used. It is typically installed on the Dynamic Resources servers.

Environment Variables

Variable Default Description
CONFIG_FILE config.yml config file
LICENSE_FILE license.yml licence file
POSTGRESQL_URL postgres://postgres:postgres@pg:5432 PostgreSQL URL
RABBITMQ_URL amqp://guest:guest@rmq:5672/ RabbitMQ URL
REDIS_URL redis://rds:6379 Redis URL

Initialisation

Use the Isoxya x Bin Scripts to complete setup, either directly or by using the scripts as reference.

Log in

Log in using Tigrosa.

Create an organisation

Create an Org using Tigrosa.

Register processor plugins

isx-create-plug-proc

Register a PlugProc, pointing to the processor plugin endpoint (potentially through a load-balancer terminating SSL).

Repeat this step as many times as needed, to register multiple processor plugins.

Register streamer plugins

isx-create-plug-strm

Register a PlugStrm, pointing to the streamer plugin endpoint (potentially through a load-balancer terminating SSL).

Repeat this step as many times as needed, to register multiple streamer plugins.

Create user-agent identities

isx-create-user-agent

Create a UserAgent identity, used by the crawlers for requests.

Repeat this step as many times as needed, to create multiple user-agent identities.

Usage

Log in

Log in using Tigrosa.

Register a site

isx-create-site

Register a Site which you want to crawl.

Start a crawl

isx-create-crwl

Start a Crwl.

Read resources

isx-read

Read a Crwl or other resources.

References

Isoxya x Bin Scripts

Isoxya x Bin Scripts is an open-source (BSD 3-Clause) collection of scripts for Isoxya web crawler. With these, it's possible to crawl sites and perform other operations using the Isoxya API. These are useful not only in development, but also as a demo of Isoxya's main capabilities, a quick way of performing actions even in production, and also in providing a functional reference for those wishing to develop their own programs on top of Isoxya.

Isoxya plugin: Crawler HTML

Isoxya plugin: Crawler HTML is an open-source (BSD 3-Clause) processor plugin for Isoxya web crawler. This plugin uses Isoxya 2 JSON interfaces to provide a core run loop for the crawling engine, receiving data for each page post-request, parsing it as static HTML, constructing URL metadata, and responding with a set of outbound URLs.

Isoxya plugin: Elasticsearch

Isoxya plugin: Elasticsearch is an open-source (BSD 3-Clause) streamer plugin for Isoxya web crawler. This plugin uses Isoxya 2 JSON interfaces to stream data into an Elasticsearch cluster, making it possible to query using all the normal features provided by Elasticsearch and Kibana.

Isoxya plugin: Spellchecker

Isoxya plugin: Spellchecker is an open-source (BSD 3-Clause) processor plugin for Isoxya web crawler. This plugin uses Isoxya 2 JSON interfaces to provide spellchecking capabilities to entire websites, even if they have millions of pages.

Interfaces

Processor /* POST

Request

POST /* HTTP/1.1
content-type: application/json
{
  "body": "PCFkb2N0eXBlIGh0bWw+CjxodG1sPgo8aGVhZD4KICAgIDx0aXRsZT5FeGFtcGxlIERvbWFpbjwvdGl0bGU+CgogICAgPG1ldGEgY2hhcnNldD0idXRmLTgiIC8+CiAgICA8bWV0YSBodHRwLWVxdWl2PSJDb250ZW50LXR5cGUiIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIgLz4KICAgIDxtZXRhIG5hbWU9InZpZXdwb3J0IiBjb250ZW50PSJ3aWR0aD1kZXZpY2Utd2lkdGgsIGluaXRpYWwtc2NhbGU9MSIgLz4KICAgIDxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+CiAgICBib2R5IHsKICAgICAgICBiYWNrZ3JvdW5kLWNvbG9yOiAjZjBmMGYyOwogICAgICAgIG1hcmdpbjogMDsKICAgICAgICBwYWRkaW5nOiAwOwogICAgICAgIGZvbnQtZmFtaWx5OiAtYXBwbGUtc3lzdGVtLCBzeXN0ZW0tdWksIEJsaW5rTWFjU3lzdGVtRm9udCwgIlNlZ29lIFVJIiwgIk9wZW4gU2FucyIsICJIZWx2ZXRpY2EgTmV1ZSIsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7CiAgICAgICAgCiAgICB9CiAgICBkaXYgewogICAgICAgIHdpZHRoOiA2MDBweDsKICAgICAgICBtYXJnaW46IDVlbSBhdXRvOwogICAgICAgIHBhZGRpbmc6IDJlbTsKICAgICAgICBiYWNrZ3JvdW5kLWNvbG9yOiAjZmRmZGZmOwogICAgICAgIGJvcmRlci1yYWRpdXM6IDAuNWVtOwogICAgICAgIGJveC1zaGFkb3c6IDJweCAzcHggN3B4IDJweCByZ2JhKDAsMCwwLDAuMDIpOwogICAgfQogICAgYTpsaW5rLCBhOnZpc2l0ZWQgewogICAgICAgIGNvbG9yOiAjMzg0ODhmOwogICAgICAgIHRleHQtZGVjb3JhdGlvbjogbm9uZTsKICAgIH0KICAgIEBtZWRpYSAobWF4LXdpZHRoOiA3MDBweCkgewogICAgICAgIGRpdiB7CiAgICAgICAgICAgIG1hcmdpbjogMCBhdXRvOwogICAgICAgICAgICB3aWR0aDogYXV0bzsKICAgICAgICB9CiAgICB9CiAgICA8L3N0eWxlPiAgICAKPC9oZWFkPgoKPGJvZHk+CjxkaXY+CiAgICA8aDE+RXhhbXBsZSBEb21haW48L2gxPgogICAgPHA+VGhpcyBkb21haW4gaXMgZm9yIHVzZSBpbiBpbGx1c3RyYXRpdmUgZXhhbXBsZXMgaW4gZG9jdW1lbnRzLiBZb3UgbWF5IHVzZSB0aGlzCiAgICBkb21haW4gaW4gbGl0ZXJhdHVyZSB3aXRob3V0IHByaW9yIGNvb3JkaW5hdGlvbiBvciBhc2tpbmcgZm9yIHBlcm1pc3Npb24uPC9wPgogICAgPHA+PGEgaHJlZj0iaHR0cHM6Ly93d3cuaWFuYS5vcmcvZG9tYWlucy9leGFtcGxlIj5Nb3JlIGluZm9ybWF0aW9uLi4uPC9hPjwvcD4KPC9kaXY+CjwvYm9keT4KPC9odG1sPgo=",
  "header": {
    "Vary": "Accept-Encoding",
    "Content-Type": "text/html; charset=UTF-8",
    "Content-Encoding": "gzip",
    "Etag": "\"3147526947+gzip\"",
    "Expires": "Tue, 02 Feb 2021 11:19:21 GMT",
    "Age": "264946",
    "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
    "Date": "Tue, 26 Jan 2021 11:19:21 GMT",
    "Server": "ECS (dcb/7EC6)",
    "Content-Length": "648",
    "Cache-Control": "max-age=604800",
    "X-Cache": "HIT"
  },
  "meta": {
    "status": 200,
    "config": null,
    "url": "http://example.com:80/",
    "method": "GET",
    "err": null,
    "duration": {
      "denominator": 1000000000,
      "numerator": 119165427
    }
  }
}

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "data": {
    "status": 200,
    "method": "GET",
    "header": {
      "Vary": "Accept-Encoding",
      "Content-Type": "text/html; charset=UTF-8",
      "Content-Encoding": "gzip",
      "Etag": "\"3147526947+gzip\"",
      "Expires": "Tue, 02 Feb 2021 11:19:21 GMT",
      "Age": "264946",
      "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
      "Date": "Tue, 26 Jan 2021 11:19:21 GMT",
      "Server": "ECS (dcb/7EC6)",
      "Content-Length": "648",
      "Cache-Control": "max-age=604800",
      "X-Cache": "HIT"
    },
    "err": null,
    "duration": {
      "denominator": 1000000000,
      "numerator": 119165427
    }
  },
  "urls": [
    "https://www.iana.org/domains/example"
  ]
}

Send page metadata, header, and body to a processor plugin, and receive extracted data and outbound URLs to crawl.

Request Parameters

Parameter Type Description
body string Base-64 encoded page body, e.g. HTML or image
header object HTTP header of response
meta.status number? HTTP status code, if available
meta.config object? processor config
meta.url string URL of page
meta.method string HTTP method of request
meta.err string? error (e.g. RobotDisallowed)
meta.duration number? duration of request; this is a simplified rational

Response Parameters

Parameter Type Description
data object free-form extracted data; processor may define own schema
urls array.string outbound URLs to crawl, if any

Streamer /* POST

Request

POST /* HTTP/1.1
content-type: application/json
{
  "crwl": {
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-26T11:46:21.411346Z",
    "t_begin": "2021-01-26T11:46:21.411346Z"
  },
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "t_retrieval": "2021-01-26T11:46:21.908809003Z",
  "data": {
    "status": 200,
    "method": "GET",
    "header": {
      "Vary": "Accept-Encoding",
      "Content-Type": "text/html; charset=UTF-8",
      "Content-Encoding": "gzip",
      "Etag": "\"3147526947\"",
      "Expires": "Tue, 02 Feb 2021 11:46:21 GMT",
      "Age": "425384",
      "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
      "Date": "Tue, 26 Jan 2021 11:46:21 GMT",
      "Server": "ECS (dcb/7F15)",
      "Content-Length": "648",
      "Cache-Control": "max-age=604800",
      "Accept-Ranges": "bytes",
      "X-Cache": "HIT"
    },
    "err": null,
    "duration": {
      "denominator": 250000000,
      "numerator": 30304037
    }
  },
  "url": "http://example.com:80/",
  "plug_proc": {
    "tag": "crawler-html",
    "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
  },
  "site": {
    "url": "http://example.com:80",
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw"
  }
}

Response 200

HTTP/1.1 200 OK
content-type: application/json

Send extracted data and page metadata to a streamer plugin.

Request Parameters

Parameter Type Description
crwl.href string Crwl Href
crwl.t_begin string Crwl time crawl began
org.href string Org Href
t_retrieval string time page retrieved
data object free-form extracted data; processor may define own schema
url string URL of page
plug_proc.tag string PlugProc tag, used for conditional data-streamer logic, and such as might appear on a financial invoice
plug_proc.href string PlugProc Href
site.url string Site URL
site.href string Site Href

Apex

Crwl

Crawl: A crawl of a Site. Belongs to an Org. Crawls can be web crawls, crawling from the site apex, or web crawls crawling from a starting list of pages, or a list crawl visiting only those pages on the list. Crawls can optionally validate external links, in which case new crawls are spawned after the parent crawl completes. It is not possible to crawl more than a single site at once; for that, multiple crawls should be used.

/site/:site_id/crwl POST

Request

POST /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl HTTP/1.1
content-type: application/json
{
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  }
}

Response 201

HTTP/1.1 201 Created
location: /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z
content-type: application/json
{
  "depth_max": null,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
  "list": null,
  "method": "GET",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "p": null,
  "pages": null,
  "pages_max": null,
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_proc_conf": null,
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "progress": null,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  },
  "status": "pending",
  "t_begin": "2021-01-25T19:20:56.518498Z",
  "t_end": null,
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  },
  "validate_ext": false
}

Create a Crwl.

Request Parameters

Parameter Type Description
depth_max number? max depth to crawl before terminating early (approximate)
list object? List, if list crawl rather than web crawl
org object Org
pages_max number? max pages to crawl before terminating early (approximate)
plug_proc[] array.object PlugProcs
plug_proc_conf object? processor config
plug_strm[] array.object PlugStrms
user_agent object UserAgent
validate_ext boolean? whether to validate external links by auto-generating child crawls after completion

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl GET

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl?_next=2021-01-25T19:20:56.518498Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl?_prev=2021-01-25T19:20:56.518498Z>; rel="prev"
content-type: application/json
[
  {
    "depth_max": null,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
    "list": null,
    "method": "GET",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "p": null,
    "pages": 1,
    "pages_max": null,
    "plug_proc": [
      {
        "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
      }
    ],
    "plug_proc_conf": null,
    "plug_strm": [
      {
        "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
      }
    ],
    "progress": 0,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "http://example.com:80"
    },
    "status": "pending",
    "t_begin": "2021-01-25T19:20:56.518498Z",
    "t_end": null,
    "user_agent": {
      "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
    },
    "validate_ext": false
  }
]

List Crwls.

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl/:site_v GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "depth_max": null,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
  "list": null,
  "method": "GET",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "p": null,
  "pages": 1,
  "pages_max": null,
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_proc_conf": null,
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "progress": 0,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  },
  "status": "pending",
  "t_begin": "2021-01-25T19:20:56.518498Z",
  "t_end": null,
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  },
  "validate_ext": false
}

Read a Crwl.

Response Parameters

Parameter Type Description
depth_max number? max depth to crawl before terminating early (approximate)
href string Href
list object? List, if list crawl rather than web crawl
method string HTTP method for requests
org object Org
p object? parent Crwl, if crawl auto-generated during another crawl
pages number? total pages discovered
pages_max number? max pages to crawl before terminating early (approximate)
plug_proc[] array.object PlugProcs
plug_proc_conf object? processor config
plug_strm[] array.object PlugStrms
progress number? progress of crawl (%)
site object Site
status string status of crawl (pending, completed, limited, canceled)
t_begin string time crawl began
t_end string? time crawl ended, if finished
user_agent object UserAgent
validate_ext boolean whether to validate external links by auto-generating child crawls after completion

/site/:site_id/crwl/:site_v PATCH

Request

PATCH /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z HTTP/1.1
content-type: application/json
{
  "status": "canceled"
}

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "depth_max": null,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T19:20:56.518498Z",
  "list": null,
  "method": "GET",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "p": null,
  "pages": 1,
  "pages_max": null,
  "plug_proc": [
    {
      "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f"
    }
  ],
  "plug_proc_conf": null,
  "plug_strm": [
    {
      "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585"
    }
  ],
  "progress": 0,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  },
  "status": "canceled",
  "t_begin": "2021-01-25T19:20:56.518498Z",
  "t_end": null,
  "user_agent": {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7"
  },
  "validate_ext": false
}

Update a Crwl.

Request Parameters

Parameter Type Description
status string status of crawl; canceled cancels crawl

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl/:site_v/crwl GET

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/crwl>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/crwl?_next=2021-01-25T20:27:03.480965Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/crwl?_prev=2021-01-25T20:27:03.480965Z>; rel="prev"
content-type: application/json
[
  {
    "depth_max": 1,
    "href": "/site/aHR0cHM6Ly93d3cuaWFuYS5vcmc6NDQz/crwl/2021-01-25T20:27:03.480965Z",
    "list": {
      "href": "/list/cf7faaf0-2fbc-4f6f-a96d-379efef34d3e"
    },
    "method": "GET",
    "org": {
      "href": "/org/2db5e7d7-60b9-4a81-8649-e60ee3b05d38"
    },
    "p": {
      "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z"
    },
    "pages": 1,
    "pages_max": null,
    "plug_proc": [
      {
        "href": "/plug_proc/057eae7b-5779-4549-be13-d88808708ea9"
      }
    ],
    "plug_proc_conf": null,
    "plug_strm": [
      {
        "href": "/plug_strm/52822786-1343-47b6-b3fa-761e773c9ba5"
      }
    ],
    "progress": 100,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cHM6Ly93d3cuaWFuYS5vcmc6NDQz",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "https://www.iana.org:443"
    },
    "status": "completed",
    "t_begin": "2021-01-25T20:27:03.480965Z",
    "t_end": "2021-01-25T20:28:04.391544Z",
    "user_agent": {
      "href": "/user_agent/1e25f62f-ef6c-4ca9-93ad-63210e28ced9"
    },
    "validate_ext": false
  }
]

List child Crwls of a parent Crwl.

Response Parameters

Response Parameters are as for /site/:site_id/crwl/:site_v GET.

/site/:site_id/crwl/:site_v/list GET

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/list>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/list?_next=2021-01-25T20:10:00.015549Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z/list?_prev=2021-01-25T20:10:00.015549Z>; rel="prev"
content-type: application/json
[
  {
    "crwl": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crwl/2021-01-25T20:09:57.106349Z",
    "href": "/list/cf7faaf0-2fbc-4f6f-a96d-379efef34d3e",
    "org": {
      "href": "/org/2db5e7d7-60b9-4a81-8649-e60ee3b05d38"
    },
    "pages": 1,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cHM6Ly93d3cuaWFuYS5vcmc6NDQz",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "https://www.iana.org:443"
    }
  }
]

List child Lists of a parent Crwl.

Response Parameters

Response Parameters are as for /list/:list_id GET.

FinLdgr

Finance-Ledger: A ledger showing completed Crwls, which PlugProc and PlugStrm plugins they used, how many pages were crawled, and which Org is responsible for paying the bill.

/fin_ldgr GET

Response 200

HTTP/1.1 200 OK
link: </fin_ldgr>; rel="first", </fin_ldgr?_next=2021-01-21T13:27:04.628437Z>; rel="next", </fin_ldgr?_prev=2021-01-21T13:27:04.628437Z>; rel="prev"
content-type: application/json
[
  {
    "fin_prod": {
      "href": "/fin_prod/da5c6942-813a-4923-bacc-019c4c102585",
      "t_ins": "2021-01-15T11:16:06.854017Z",
      "tag": "plug_strm.elasticsearch"
    },
    "href": "/fin_ldgr/cfdc7e29-5325-59b2-bf71-859c17198da9",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "qty": 3,
    "t_ins": "2021-01-21T13:27:04.628437Z"
  },
  {
    "fin_prod": {
      "href": "/fin_prod/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
      "t_ins": "2021-01-15T11:16:06.854017Z",
      "tag": "plug_proc.crawler-html"
    },
    "href": "/fin_ldgr/bcd6aad4-428e-5996-b707-370c8507f51c",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "qty": 3,
    "t_ins": "2021-01-21T13:27:04.628437Z"
  }
]

List FinLdgr entries.

Response Parameters

Response Parameters are as for /fin_ldgr/:fin_ldgr_id GET.

/fin_ldgr/:fin_ldgr_id GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "fin_prod": {
    "href": "/fin_prod/da5c6942-813a-4923-bacc-019c4c102585",
    "t_ins": "2021-01-15T11:16:06.854017Z",
    "tag": "plug_strm.elasticsearch"
  },
  "href": "/fin_ldgr/cfdc7e29-5325-59b2-bf71-859c17198da9",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "qty": 3,
  "t_ins": "2021-01-21T13:27:04.628437Z"
}

Read a FinLdgr entry.

Response Parameters

Parameter Type Description
fin_prod object FinProd
href string Href
org object Org
qty number quantity of units used
t_ins string time associated with usage

FinProd

Finance-Product: A PlugProc or PlugStrm plugin, used within the FinLdgr to record usage.

/fin_prod GET

Response 200

HTTP/1.1 200 OK
link: </fin_prod>; rel="first", </fin_prod?_next=2021-01-15T11:16:06.854017Z>; rel="next", </fin_prod?_prev=2021-01-21T11:16:38.118334Z>; rel="prev"
content-type: application/json
[
  {
    "href": "/fin_prod/58ce5bbe-5a0a-43df-8860-01e70820e6d8",
    "t_ins": "2021-01-21T11:16:38.118334Z",
    "tag": "plug_proc.spellchecker"
  },
  {
    "href": "/fin_prod/da5c6942-813a-4923-bacc-019c4c102585",
    "t_ins": "2021-01-15T11:16:06.854017Z",
    "tag": "plug_strm.elasticsearch"
  },
  {
    "href": "/fin_prod/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
    "t_ins": "2021-01-15T11:16:06.854017Z",
    "tag": "plug_proc.crawler-html"
  }
]

List FinProds.

Response Parameters

Response Parameters are as for /fin_prod/:fin_prod_id GET.

/fin_prod/:fin_prod_id GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "href": "/fin_prod/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
  "t_ins": "2021-01-15T11:16:06.854017Z",
  "tag": "plug_proc.crawler-html"
}

Read a FinProd.

Response Parameters

Parameter Type Description
href string Href
t_ins string time of auto-creation
tag string auto-derived tag, such as might appear on a financial invoice

List

List: A list of pages within a Site. Belongs to an Org. Used for list crawls and web crawls with external link validation.

/site/:site_id/list POST

Request

POST /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list HTTP/1.1
content-type: application/json
{
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  }
}

Response 201

HTTP/1.1 201 Created
location: /list/d8996e4e-4942-4b8a-9d23-00901dc010ff
content-type: application/json
{
  "crwl": null,
  "href": "/list/d8996e4e-4942-4b8a-9d23-00901dc010ff",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pages": 0,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  }
}

Create a List.

Request Parameters

Parameter Type Description
org object Org

Response Parameters

Response Parameters are as for /list/:list_id GET.

/site/:site_id/list GET

Response 200

HTTP/1.1 200 OK
link: </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list>; rel="first", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list?_next=2021-01-25T18:38:40.423053Z>; rel="next", </site/aHR0cDovL2V4YW1wbGUuY29tOjgw/list?_prev=2021-01-25T18:38:40.423053Z>; rel="prev"
content-type: application/json
[
  {
    "crwl": null,
    "href": "/list/d8996e4e-4942-4b8a-9d23-00901dc010ff",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pages": 1,
    "site": {
      "chans": 1,
      "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
      "rate_lim": {
        "denominator": 10,
        "numerator": 1
      },
      "url": "http://example.com:80"
    }
  }
]

List Lists.

Response Parameters

Response Parameters are as for /list/:list_id GET.

/list/:list_id GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "crwl": null,
  "href": "/list/d8996e4e-4942-4b8a-9d23-00901dc010ff",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pages": 1,
  "site": {
    "chans": 1,
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "rate_lim": {
      "denominator": 10,
      "numerator": 1
    },
    "url": "http://example.com:80"
  }
}

Read a List.

Response Parameters

Parameter Type Description
crwl object? Crwl, if list auto-generated during another crawl
href string Href
org object Org
pages number total pages in list
site object Site

/list/:list_id DELETE

Response 204

HTTP/1.1 204 No Content

Delete a List.

ListPage

ListPage: A page within a List, which belongs to a Site.

/list/:list_id/list_page POST

Request

POST /list/d8996e4e-4942-4b8a-9d23-00901dc010ff/list_page HTTP/1.1
content-type: application/json
{
  "url": ["/a"]
}

Response 204

HTTP/1.1 204 No Content

Insert one or more pages into a List.

Request Parameters

Parameter Type Description
url array.string site URLs to add to list

Org

Organisation: The business, organisation, or human entity registered with the system, and responsible for paying any bills. Does not log in directly.

PlugProc

Plugin-Processor: A processor plugin, belonging to an Org. This registers the endpoint of a processor plugin within the system, making it available to Crwls. It is possible to register as public, in which case, it is made available to every Org. Multiple channels can be used to run more processor instances simultaneously, increasing throughput (up to some level, and at the expense of increased resource requirements).

/org/:org_id/plug_proc POST

Request

POST /org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc HTTP/1.1
content-type: application/json
{
  "tag": "crawler-html",
  "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
}

Response 201

HTTP/1.1 201 Created
location: /plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f
content-type: application/json
{
  "chans": 1,
  "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "crawler-html",
  "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
}

Create a PlugProc.

Request Parameters

Parameter Type Description
chans number? channels, i.e. simultaneous processors
pub boolean? public, making available to every Org to use
tag string tag, used for conditional data-streamer logic, and such as might appear on a financial invoice
url string URL of data-processor endpoint

Response Parameters

Response Parameters are as for /plug_proc/:plug_proc_id GET.

/org/:org_id/plug_proc GET

Response 200

HTTP/1.1 200 OK
link: </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc>; rel="first", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc?_next=2021-01-15T11:15:23.366804Z>; rel="next", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_proc?_prev=2021-01-21T11:15:06.05113Z>; rel="prev"
content-type: application/json
[
  {
    "chans": 1,
    "href": "/plug_proc/58ce5bbe-5a0a-43df-8860-01e70820e6d8",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "tag": "spellchecker",
    "url": "http://spellchecker.plugin.dev.isoxya.com:8000/data"
  },
  {
    "chans": 1,
    "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "tag": "crawler-html",
    "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
  }
]

List PlugProcs.

Response Parameters

Response Parameters are as for /plug_proc/:plug_proc_id GET.

/plug_proc/:plug_proc_id GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "chans": 1,
  "href": "/plug_proc/76ce4d4a-a965-4ab4-9fae-2c375750cb0f",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "crawler-html",
  "url": "http://crawler-html.plugin.dev.isoxya.com:8000/data"
}

Read a PlugProc.

Response Parameters

Parameter Type Description
chans number channels, i.e. simultaneous processors
href string Href
org object Org
pub boolean public, making available to every Org to use
tag string tag, used for conditional data-streamer logic, and such as might appear on a financial invoice
url string URL of data-processor endpoint

/plug_proc/:plug_proc_id DELETE

Response 204

HTTP/1.1 204 No Content

Delete a PlugProc.

PlugStrm

Plugin-Streamer: A streamer plugin, belonging to an Org. This registers the endpoint of a streamer plugin within the system, making it available to Crwls. It is possible to register as public, in which case, it is made available to every Org. Multiple channels can be used to run more processor instances simultaneously, increasing throughput (up to some level, and at the expense of increased resource requirements).

/org/:org_id/plug_strm POST

Request

POST /org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm HTTP/1.1
content-type: application/json
{
  "tag": "elasticsearch",
  "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
}

Response 201

HTTP/1.1 201 Created
location: /plug_strm/da5c6942-813a-4923-bacc-019c4c102585
content-type: application/json
{
  "chans": 1,
  "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "elasticsearch",
  "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
}

Create a PlugStrm.

Request Parameters

Parameter Type Description
chans number? channels, i.e. simultaneous streamers
pub boolean? public, making available to every Org to use
tag string tag, such as might appear on a financial invoice
url string URL of data-streamer endpoint

Response Parameters

Response Parameters are as for /plug_strm/:plug_strm_id GET.

/org/:org_id/plug_strm GET

Response 200

HTTP/1.1 200 OK
link: </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm>; rel="first", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm?_next=2021-01-15T11:15:27.265412Z>; rel="next", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/plug_strm?_prev=2021-01-15T11:15:27.265412Z>; rel="prev"
content-type: application/json
[
  {
    "chans": 1,
    "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "tag": "elasticsearch",
    "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
  }
]

List PlugStrms.

Response Parameters

Response Parameters are as for /plug_strm/:plug_strm_id GET.

/plug_strm/:plug_strm_id GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "chans": 1,
  "href": "/plug_strm/da5c6942-813a-4923-bacc-019c4c102585",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "tag": "elasticsearch",
  "url": "http://elasticsearch.plugin.dev.isoxya.com:8000/data"
}

Read a PlugStrm.

Response Parameters

Parameter Type Description
chans number channels, i.e. simultaneous streamers
href string Href
org object Org
pub boolean public, making available to every Org to use
tag string tag, such as might appear on a financial invoice
url string URL of data-streamer endpoint

/plug_strm/:plug_strm_id DELETE

Response 204

HTTP/1.1 204 No Content

Delete a PlugStrm.

Site

Site: A website. Sites must be registered within the system, after which Crwls may be created. Rate-limits are set per-site. Multiple channels can be used to open more than one simultaneous connection (not recommended for most cases, since it could take a site down).

/site POST

Request

POST /site HTTP/1.1
content-type: application/json
{
  "url": "http://example.com:80"
}

Response 201

HTTP/1.1 201 Created
location: /site/aHR0cDovL2V4YW1wbGUuY29tOjgw
content-type: application/json
{
  "chans": 1,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
  "rate_lim": {
    "denominator": 10,
    "numerator": 1
  },
  "url": "http://example.com:80"
}

Create a Site.

Request Parameters

Parameter Type Description
chans number? channels, i.e. simultaneous connections
rate_lim number? rate-limit (requests/seconds); e.g. 1/10 means 1 request every 10 seconds; this is a simplified rational
url string URL

Response Parameters

Response Parameters are as for /site/:site_id GET.

/site/:site_id GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "chans": 1,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
  "rate_lim": {
    "denominator": 10,
    "numerator": 1
  },
  "url": "http://example.com:80"
}

Read a Site.

Response Parameters

Parameter Type Description
chans number channels, i.e. simultaneous connections
href string Href
rate_lim number rate-limit (requests/seconds); e.g. 1/10 means 1 request every 10 seconds; this is a simplified rational
url string URL

UserAgent

User-Agent: A user-agent identifier, belonging to an Org. This is used for a Crwl as identification during the request. It is possible to register as public, in which case, it is made available to every Org.

/org/:org_id/user_agent POST

Request

POST /org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent HTTP/1.1
content-type: application/json
{
  "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
}

Response 201

HTTP/1.1 201 Created
location: /user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7
content-type: application/json
{
  "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
}

Create a UserAgent.

Request Parameters

Parameter Type Description
pub boolean? public, making available to every Org to use
str string user-agent sent as identifier during requests; ${VERSION} is interpolated

Response Parameters

Response Parameters are as for /user_agent/:user_agent_id GET.

/org/:org_id/user_agent GET

Response 200

HTTP/1.1 200 OK
link: </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent>; rel="first", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent?_next=2021-01-15T11:15:32.420728Z>; rel="next", </org/1df9ee03-3c25-4ea6-9276-d4c7a58de332/user_agent?_prev=2021-01-15T11:15:32.420728Z>; rel="prev"
content-type: application/json
[
  {
    "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7",
    "org": {
      "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
    },
    "pub": false,
    "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
  }
]

List UserAgents.

Response Parameters

Response Parameters are as for /user_agent/:user_agent_id GET.

/user_agent/:user_agent_id GET

Response 200

HTTP/1.1 200 OK
content-type: application/json
{
  "href": "/user_agent/9529f38e-c75b-4361-818d-b9ac3c0c81e7",
  "org": {
    "href": "/org/1df9ee03-3c25-4ea6-9276-d4c7a58de332"
  },
  "pub": false,
  "str": "Isoxya/${VERSION} (+https://www.isoxya.com/)"
}

Read a UserAgent.

Response Parameters

Parameter Type Description
href string Href
org object Org
pub boolean public, making available to every Org to use
str string user-agent sent as identifier during requests; ${VERSION} is interpolated

/user_agent/:user_agent_id DELETE

Response 204

HTTP/1.1 204 No Content

Delete a UserAgent.