Atlas Data Federation supports S3 buckets as federated database instance stores. You must define mappings in your federated database instance to your S3 bucket to run queries against your data.
Example Configuration for S3 Data Store
Example
Consider a S3 bucket datacenter-alpha containing data
collected from a datacenter:
|--metrics |--hardware
The /metrics/hardware path stores JSON files with metrics
derived from the datacenter hardware, where each filename is
the UNIX timestamp in milliseconds of the 24 hour period
covered by that file:
/hardware/1564671291998.json
The following configuration:
Defines a federated database instance store on the
datacenter-alphaS3 bucket in theus-east-1AWS region. The federated database instance store is specifically restricted to only datafiles in themetricsfolder path.Maps files from the
hardwarefolder to a MongoDB databasedatacenter-alpha-metricsand collectionhardware. The configuration mapping includes parsing logic for capturing the timestamp implied in the filename.
{ "stores" : [ { "name" : "datacenter-alpha", "provider" : "s3", "region" : "us-east-1", "bucket" : "datacenter-alpha", "additionalStorageClasses" : [ "STANDARD_IA" ], "prefix" : "/metrics", "delimiter" : "/" } ], "databases" : [ { "name" : "datacenter-alpha-metrics", "collections" : [ { "name" : "hardware", "dataSources" : [ { "storeName" : "datacenter-alpha", "path" : "/hardware/{date date}" } ] } ] } ] }
Atlas Data Federation parses the S3 bucket datacenter-alpha and processes all
files under /metrics/hardware/. The collections uses the
path parsing syntax to map the filename to
the date field, which is an ISO-8601 date, in each document. If
a matching date field does not exist in a document, it will be
added.
Users connected to the federated database instance can use the MongoDB Query Language
and supported aggregations to analyze data in the S3 bucket
through the datacenter-alpha-metrics.hardware collection.
Configuration Format
The federated database instance configuration has the following format:
1 { 2 "stores" : [ 3 { 4 "name" : "<string>", 5 "provider": "<string>", 6 "region" : "<string>", 7 "bucket" : "<string>", 8 "additionalStorageClasses" : ["<string>"], 9 "prefix" : "<string>", 10 "includeTags": <boolean>, 11 "delimiter": "<string>", 12 "public": <boolean> 13 } 14 ], 15 "databases" : [ 16 { 17 "name" : "<string>", 18 "collections" : [ 19 { 20 "name" : "<string>", 21 "dataSources" : [ 22 { 23 "storeName" : "<string>", 24 "path" : "<string>", 25 "defaultFormat" : "<string>", 26 "provenanceFieldName": "<string>", 27 "omitAttributes": true | false 28 } 29 ] 30 } 31 ], 32 "maxWildcardCollections" : <integer>, 33 "views" : [ 34 { 35 "name" : "<string>", 36 "source" : "<string>", 37 "pipeline" : "<string>" 38 } 39 ] 40 } 41 ] 42 }
stores- The
storesobject defines each data store associated with the federated database instance. The federated database instance store captures files in an S3 bucket, documents in Atlas cluster, or files stored at publicly accessible URLs. Data Federation can only access data stores defined in thestoresobject. databases- The
databasesobject defines the mapping between each federated database instance store defined instoresand MongoDB collections in the databases.
stores
1 "stores" : [ 2 { 3 "name" : "<string>", 4 "provider" : "<string>", 5 "region" : "<string>", 6 "bucket" : "<string>", 7 "additionalStorageClasses" : ["<string>"], 8 "prefix" : "<string>", 9 "delimiter" : "<string>", 10 "includeTags": <boolean>, 11 "public": <boolean> 12 } 13 ]
storesArray of objects where each object represents a data store to associate with the federated database instance. The federated database instance store captures files in an S3 bucket, documents in Atlas cluster, or files stored at publicly accessible URLs. Atlas Data Federation can only access data stores defined in the
storesobject.
stores.[n].nameName of the federated database instance store. The
databases.[n].collections.[n].dataSources.[n].storeNamefield references this value as part of mapping configuration.
stores.[n].regionName of the AWS region in which the S3 bucket is hosted. For a list of valid region names, see Amazon Web Services (AWS).
stores.[n].bucketName of the AWS S3 bucket. Must exactly match the name of an S3 bucket which Atlas Data Federation can access with the configured AWS IAM credentials.
stores.[n].additionalStorageClassesOptional. Array of AWS S3 storage classes. Atlas Data Federation will include the files in these storage classes in the query results. Valid values are:
INTELLIGENT_TIERINGto include files in the Intelligent Tiering storage classSTANDARD_IAto include files in the Standard-Infrequent Access storage classNote
Files in the Standard storage class are supported by default.
stores.[n].prefixOptional. Prefix Atlas Data Federation applies when searching for files in the S3 bucket.
For example, consider an S3 bucket
metricswith the following structure:metrics |--hardware |--software |--computed The federated database instance store prepends the value of
prefixto thedatabases.[n].collections.[n].dataSources.[n].pathto create the full path for files to ingest. Setting theprefixto/softwarerestricts anydatabasesobjects using the federated database instance store to only subpaths/software.If omitted, Atlas Data Federation searches all files from the root of the S3 bucket.
stores.[n].delimiterOptional. The delimiter that separates
databases.[n].collections.[n].dataSources.[n].pathsegments in the federated database instance store. Data Federation uses the delimiter to efficiently traverse S3 buckets with a hierarchical directory structure. You can specify any character supported by the S3 object keys as the delimiter. For example, you can specify an underscore (_) or a plus sign (+) or multiple characters such as double underscores (__) as the delimiter.If omitted, defaults to
"/".
stores.[n].includeTagsOptional. Determines whether or not to use S3 tags on the files in the given path as additional partition attributes. Valid values are
trueandfalse.If omitted, defaults to
false.If set to
true, Atlas Data Federation does the following:Adds the S3 tags as additional partition attributes.
Adds new top level BSON elements that associate each tag to each document for the tagged files.
Warning
If set to
true, Atlas Data Federation processes the files for additional partition attributes by making extra calls to S3 to get the tags. This behavior might impact performance.
stores.[n].publicOptional. Specifies whether the bucket is public.
If set to
true, Atlas Data Federation doesn't use the configured AWS IAM role to access the S3 bucket. If set tofalse, the configured AWS IAM must include permissions to access the S3 bucket, even if that bucket is public.If omitted, defaults to
false.
databases
1 "databases" : [ 2 { 3 "name" : "<string>", 4 "collections" : [ 5 { 6 "name" : "<string>", 7 "dataSources" : [ 8 { 9 "storeName" : "<string>", 10 "defaultFormat" : "<string>", 11 "path" : "<string>", 12 "provenanceFieldName": "<string>", 13 "omitAttributes": <boolean> 14 } 15 ] 16 } 17 ], 18 "maxWildcardCollections" : <integer>, 19 "views" : [ 20 { 21 "name" : "<string>", 22 "source" : "<string>", 23 "pipeline" : "<string>" 24 } 25 ] 26 } 27 ]
databasesArray of objects where each object represents a database, its collections, and, optionally, any views on the collections. Each database can have multiple
collectionsandviewsobjects.
databases.[n].nameName of the database to which Atlas Data Federation maps the data contained in the data store.
databases.[n].collectionsArray of objects where each object represents a collection and data sources that map to a
storesfederated database instance store.
databases.[n].collections.[n].nameName of the collection to which Atlas Data Federation maps the data contained in each
databases.[n].collections.[n].dataSources.[n].storeName. Each object in the array represents the mapping between the collection and an object in thestoresarray.You can generate collection names dynamically from file paths by specifying
*for the collection name and thecollectionName()function in thepathfield. See Generate Dynamic Collection Names from File Path for examples.
databases.[n].collections.[n].dataSourcesArray of objects where each object represents a
storesfederated database instance store to map with the collection.
databases.[n].collections.[n].dataSources.[n].storeNameName of a federated database instance store to map to the
<collection>. Must match thenameof an object in thestoresarray.
databases.[n].collections.[n].dataSources.[n].pathControls how Atlas Data Federation searches for and parses files in the
storeNamebefore mapping them to the<collection>. federated database instance prepends thestores.[n].prefixto thepathto build the full path to search within. Specify/to capture all files and folders from theprefixpath.For example, consider an S3 bucket
metricswith the following structure:metrics |--hardware |--software |--computed A
pathof/directs Atlas Data Federation to search all files and folders in themetricsbucket.A
pathof/hardwaredirects Atlas Data Federation to search only that path for files to ingest.If the
prefixissoftware, Atlas Data Federation searches for files only in the path/software/computed.Appending the
*wildcard character to the path directs Atlas Data Federation to include all files and folders from that point in the path. For example,/software/computed*would match files like/software/computed-detailed,/software/computedArchive, and/software/computed/errors.pathsupports additional syntax for parsing filenames, including:Generating document fields from filenames.
Using regular expressions to control field generation.
Setting boundaries for bucketing filenames by timestamp.
See Define Path for S3 Data for more information.
When specifying the
path:Specify the data type for the partition attribute.
Ensure that the partition attribute type matches the data type to parse.
Use the delimiter specified in
delimiter.
When specifying attributes of the same type, do any of the following:
Add a constant separator between the attributes.
Use regular expressions to describe the search pattern. To learn more, see Unsupported Parsing Functions.
Optional. Default format that Data Federation assumes if it encounters a file without an extension while searching the
databases.[n].collections.[n].dataSources.[n].storeName.The following values are valid for the
defaultFormatfield:.json,.json.gz,.bson,.bson.gz,.avro,.avro.gz,.orc,.tsv,.tsv.gz,.csv,.csv.gz,.parquetNote
If your file format is
CSVorTSV, you must include a header row in your data. See CSV and TSV for more information.If omitted, Data Federation attempts to detect the file type by processing a few bytes of the file.
databases.[n].collections.[n].dataSources.[n].provenanceFieldNameName for the field that includes the provenance of the documents in the results. If you specify this setting in the storage configuration, Atlas Data Federation returns the following fields for each document in the result:
Field NameDescriptionproviderProvider (
stores.[n].provider) in the federated database instance storage configurationregionAWS region (
stores.[n].region)bucketName of the AWS S3 bucket (
stores.[n].bucket)keyPath (
databases.[n].collections.[n].dataSources.[n].path) to the documentlastModifiedDate and time the document was last modified.
You can't configure this setting using the Visual Editor in the Atlas UI.
databases.[n].collections.[n].dataSources.[n].omitAttributesOptional. Flag that specifies whether to omit the attributes (key and value pairs) that Atlas Data Federation adds to documents in the collection. You can specify one of the following values:
false- to add the attributestrue- to omit the attributes
If omitted, defaults to
falseand Atlas Data Federation adds the attributes.Example
Consider a file named
/employees/949-555-0195.jsonfor which you configure thepath/employees/{phone string}. Atlas Data Federation adds the attributephone: 949-555-0195to documents in this file ifomitAttributesisfalse, regardless of whether the key-value pair already exists in the document. If you setomitAttributestotrue, Atlas Data Federation doesn't add the attribute to the document in the virtual collection.
databases.[n].maxWildcardCollectionsOptional. Maximum number of wildcard
*collections in the database. Each wildcard collection can have only one data source. Value can be between1and1000, inclusive. If omitted, defaults to100.
databases.[n].viewsArray of objects where each object represents an aggregation pipeline on a collection. To learn more about views, see Views.
databases.[n].views.[n].sourceName of the source collection for the view. If you want to create a view with a $sql stage, you must omit this field as the SQL statement will specify the source collection.
databases.[n].views.[n].pipelineAggregation pipeline stage(s) to apply to the
sourcecollection. You can also create views using the $sql stage.