CREATE an External Collection

  • Capella Columnar
  • reference
    +
    This topic describes how you use the CREATE statement to create a collection so that you can query OLAP data on an external data source.
     CreateRemoteCollection | CreateExternalCollection | CreateStandaloneCollection

    To create a link to an external data source, you use the Capella Columnar UI. See Set Up an External Data Source.

    Syntax

    CreateExternalCollection EBNF
    CreateExternalCollection ::= "CREATE" "EXTERNAL" "ANALYTICS"? "COLLECTION" ("IF" "NOT" "EXISTS")?
                                 QualifiedName
                                 CollectionTypeDef?
                                 "ON" Identifier
                                 "AT" LinkName
                                  ( ( 'PATH' | 'USING' ) StringLiteral )?
                                 "WITH" ObjectConstructor
    CreateExternalCollection Diagram
    CREATE
    You must include the keyword EXTERNAL when creating this type of collection.
    Show QualifiedName Diagram
    (DatabaseAndScopeName ".")? Identifier
    QualifiedName
    Show CollectionTypeDef Diagram
    "(" CollectionFieldDef ("," CollectionFieldDef )* ")"
    CollectionTypeDef
    Show CollectionFieldDef Diagram
    Identifier ( "BOOLEAN" | "BIGINT" | "INT" | "DOUBLE" | "STRING" ) ( "NOT" "UNKNOWN" )?
    CollectionFieldDef

    Examples

    The following examples create external Capella Columnar collections for data stored in an Amazon S3 bucket.

    Use backtick characters (``) to delimit identifiers that include the - operator symbol.
      CREATE EXTERNAL COLLECTION music.myPlaylist.countrySongs
        ON `data-music`
        AT musicLink
        PATH "music/myPlaylist/countrySongs"
        WITH {"format": "json"};
    Show another example
      CREATE EXTERNAL COLLECTION music.myPlaylist.rockSongs
        ON `data-music`
        AT musicLink
        PATH "music/myPlaylist/rockSongs"
        WITH {"format": "json"};

    Arguments

    ON

    The ON clause identifies the bucket name on the external data source, such as an Amazon S3 bucket. Supply only the name of the bucket, not a URL.

    AT

    The AT clause specifies the name of the link that contains credentials for the S3 bucket name. The specified link must have a type of S3.

    PATH

    The PATH, or USING, clause is a string that specifies the location of the data files relative to the bucket name provided by the ON clause. Do not supply a filename as part of the path.

    For example, the path can contain one or more Amazon S3 prefixes delimited by slashes /, such as:

    PATH "music/myPlaylist/rockSongs"

    For your query to include files located at the bucket, or top, level of the storage organization, you supply an empty path:

    PATH ""

    External collections use the provided PATH to query the files in the specified location in the external object store. You can use the PATH to point to a desired subset of data only, leading to better performance as Capella Columnar does not query any files outside the provided PATH. For information about using dynamic prefixes in the path, which give you the ability to specify the exact prefix name in the WHERE clauses of your queries, see Design a Location Path and Optimize Performance of External Analytics Collections in Couchbase.

    To specify particular filenames in the path to query, you can supply either an include or an exclude parameter in the WITH clause.

    WITH

    The WITH clause enables you to specify parameters for the collection. Its ObjectConstructor represents an object containing key-value pairs, one for each parameter. You can define the following parameters.

    Name Description Schema

    format
    Required

    Specifies the format of the external data. Accepts one of the following string values:

    json — JSON Lines
    csv — Comma-separated values
    tsv — Tab-separated values
    parquet — Apache Parquet
    avro — Apache Avro

    For files that have a format of CSV or TSV, you also supply a list of field names and data types in the CollectionTypeDef, described below. For information about the mapping that Capella Columnar performs for Avro and Parquet data types, see Data Type Mapping.

    enum: json, csv, tsv, parquet, avro

    table format
    Required

    An external dataset pointing to a Deltalake Table. This enables you to create external datasets based on Delta Lake tables.

    "table-format": "delta"

    Parquet

    decimal-to-double
    Optional

    Delta DECIMAL values are converted to doubles, with the possibility of precision loss. The flag decimal-to-double must be set upon creating the dataset.

    Boolean

    timestamp-to-long
    Optional

    By default, Delta timestamps are converted to long. Set this flag to false to parse them as datetime.

    Boolean

    date-to-int
    Optional

    By default, Delta date are converted to int. Set this flag to false to parse them as date.

    Boolean

    timezone
    Optional

    The timezone flag adjusts temporal types (e.g., DATETIME) stored in UTC within Delta table to a specified local timezone (e.g., PST).

    String

    header
    Required

    Only used if the format is CSV or TSV.

    When true, skip the first row of the file.

    Boolean

    redact-warnings
    Optional

    Only used if the format is CSV or TSV.

    When true, redact sensitive information—such as the filename—from warning messages.

    Default : false

    Boolean

    null
    Optional

    Only used if the format is CSV or TSV.

    The string used in the external collection to represent a null value. Use a backslash \ to escape special characters: see Literals.

    See the NOT UNKNOWN flag in the CollectionTypeDef definition that follows.

    Example : "\\N"

    string

    exclude
    Optional

    Applies only if the include parameter is not present.

    The names of the files in the specified path to exclude from querying. The supplied value can include a prefix, or subdirectory, of the location specified by the PATH clause. Capella Columnar queries any files that do not match the specification.

    You can use the following wildcard characters, in common with the include parameter:

    * — Matches anything
    ? — Matches any single character
    [ sequence ] — Matches any characters in sequence
    [! sequence ] — Matches any characters not in sequence
    Example : "*.?sv"

    string, or array of strings

    include
    Optional

    Applies only if the exclude parameter is not present.

    The names of the files in the specified path to include in queries. The supplied value can include a prefix, or subdirectory, of the location specified by the PATH clause. Capella Columnar queries only files that match the specification.

    You can use the same wildcard characters as for the exclude parameter.
    Example : ["*2018*.json", "*2019*.json"]

    string, or array of strings

    CollectionTypeDef

    You use the CollectionTypeDef if the files in the external data store have a format of CSV or TSV. It consists of a comma-separated list of field definitions and their desired field types. These definitions guide the transformation of each CSV or TSV record into a JSON object. Each field definition consists of:

    • The name to assign to the field.

    • The data type of the field. This can be any of the primitive data types, where INT is an alias for BIGINT. If the field does not contain a value of this data type, Capella Columnar ignores the record and issues a warning.

    • Optionally, the NOT UNKNOWN flag. When this flag is present, if this field is missing or null, Capella Columnar ignores the record.

    Data in JSON, CSV, or TSV format can be in compressed gzip files, with the extension .gz or .gzip.

    For more information about external collections, see Set Up an External Data Source.

    Data Type Mapping: Parquet and Avro

    For files that have a format of Parquet or Avro, Capella Columnar maps data types as follows:

    Parquet/Avro Data Type Capella Columnar Data Type

    INT

    INT

    LONG

    INT

    FLOAT

    DOUBLE

    DOUBLE

    DOUBLE

    BYTES

    BINARY

    STRING

    STRING

    BOOLEAN

    BOOLEAN

    NULL

    NULL

    Union (nullable)*

    NULL/ANY

    Record

    OBJECT

    Array

    ARRAY

    MAP

    ARRAY of OBJECT in the form
    [{"key": <mapKey>, "value": <mapValue>}, …​]

    *Avro only