Search Results for

    Show / Hide Table of Contents

    Connecting to Google Cloud Storage

    Overview

    You can run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage with the help of Google Cloud Storage connector, that is an open-source Java library. It is supported by Google Cloud when used with Google cloud products and use cases.

    The Google Cloud Storage offers you multiple benefits like, direct data access, HDFS compatibility, Interoperability, data accessibility, high data availability, no storage management overhead, and quick startup.

    Prerequisite

    You will need the following details to configure and import data using Google Cloud Storage:

    • Access key ID and Secret access key
    • Project ID
    • Service URL
    • Service account JSON file
    • Google cloud bucket.

    Permissions required for Service account

    You can control the access to the resources in your Google Cloud project with the help of IAM (Identity and Access Management). The resources comprise of Cloud Storage buckets and objects that are stored within buckets, along with other Google Cloud entities such as Compute Engine instances.

    Permissions allow principals to perform specific actions on buckets or objects in Cloud storage. There are two options to prepare the Google cloud permissions and storage:

    • Option 1: Using the Transfer Appliance Cloud Setup Application
    • Option 2: Configuring Google Cloud permissions and Cloud Storage step-by-step.

    ❕ Note: You need to create the JSON file and Permissions by logging into Google Cloud Console. You can refer to the Setting up Google Cloud Storage document to learn more.


    Import data using Google Cloud Storage connector

    Follow the below steps to create a new dataflow for the Google Cloud Storage import connector:

    1. Go to Dataflow > Imports.
    2. Click New dataflow.

    The Set dataflow name page appears.


    Alt text


    1. In the Set dataflow name page, type dataflow name in the Name text area.
    2. Click Next.

    The Choose connector page appears.


    Alt text


    To add Google Cloud Storage connector

    1. In the Choose connector page, select Google Cloud Storage connector.

    ❕ Note: You can use the Search feature too to find the connector. Also, the Google Cloud Storage connector can be found under Cloud category.



    Alt text


    1. Enter Display Name for your dataflow in the text area.
    2. Enter Description for your dataflow in the text area.
    3. Click Next.

    The Connect to Google Cloud Storage page appears.


    Alt text


    To configure Google Cloud Storage

    Follow the below steps to configure the connection to Google Cloud Storage:

    1. Enter your credentials such as Access key ID, Secret access key, Project ID, and Service URL to configure with Google Cloud Storage.
    2. Click Choose file to upload the JSON file.
    3. Click the Folder icon in the Google cloud bucket text area.

    Once you select the cloud bucket, the Table Details columns appear.


    Alt text


    1. Enter the Table Details to process the data.
    Item Description
    Purpose Option to assign a purpose (Data or Metadata) for each table.
    Data
    Loads customer data.
    Metadata
    Loads Metadata.
    File Name Displays the name of the file that you imported.
    Table Name Displays the imported table name.
    Datetime format Displays a number of Datetime Formats and SkyPoint’s Modern Data Stack Platform is set to automatically detect them.
    Delimiter Displays available separators for the variables in the imported data.
    First Row as Header Check the box for the system to automatically collect the data according to the Header Contents.
    Advanced Settings Select the options to fine tune the Import process with minute details.
    1. Click the Advanced settings for your desired file name.

    The Advanced settings pop-up appears.


    Alt text


    Item Description
    Compression type Method that is used for compressing the details from source, Azure Data Lake Storage Gen2.
    Row delimiter A separator that identifies the boundaries of the flow of a data stream. In case, a different separator is used in it, the information requires change for more accuracy in data ingestion.
    Encoding As the data comes in data stream, there is a type of encoding used for deciphering it. The default encoding is UTF-8.
    Escape character It is a particular case of metacharacters that is given an identification of start or end. You can manually select it from the drop-down list.
    Quote character You can select one of the advanced Quote characters mentioned in the drop-down list.
    1. Click Save on the Advanced settings pop-up to save the advanced settings.
    2. Click Save.

    Run, edit, and delete the imported data

    Once you save the connector, the Google Cloud Storage connector gets displayed in the list of tables created in the Dataflow page.


    Alt text


    Item Description
    Name Displays the name of the imported Dataflow.
    Type Displays connector type symbol.
    Status Indicates whether the data is imported successfully.
    Tables Count Displays the number of tables.
    Created Date Displays date of creation.
    Last refresh type Displays the refresh value. You can see the value is Full or Incremental after the last refresh of data.
    Updated Date Displays last modified date.
    Last Refresh Displays the latest refresh date. This date will get updated whenever you refresh the data.
    Group by Option to view the items in a specific Group (For example, name, type, status).
    • Select the horizontal ellipsis in the Actions column and do the following:
    If you want to Then
    Modify the Dataflow Select Edit and modify the Dataflow. Click Save to apply your changes.
    Execute the Dataflow Select Run.
    Bring the data to its previous state Select Rollback.
    Delete the Dataflow Select Remove and then click the Delete button. All tables in the data source get deleted.
    See the run history of the Dataflow Select Run history.

    Next step

    After completing the data import, start the Master Data Management (MDM) - Stitch process to develop a unified view of your customers.

    • Improve this Doc
    In This Article
    Back to top Powered By SkyPoint Cloud