Philipsamppython package¶
Philipsamppython.sdk module¶
Philips AMP (python)
-
class
amp¶ Bases:
future.types.newobject.newobjectPhilipsamppython class
-
amp_dummies(dataframe, column, drop_column=False, drop_first_dummy=False)¶ Create dummy variables for categorical columns.
Parameters: - dataframe (pandas.DataFrame/pyspark.sql.DataFrame) – specifies pandas or pyspark data frame.
- column (string) – specifies categorical column of the dataframe which has multiple levels.
- drop_column (logical,optional) – Columns to be dropped if any.
- drop_first_dummy (logical,optional) – Choose if reference factor needs to be dropped.
Returns: returns dataframe of same type as input.
Example
>>> import Philipsamppython >>> import os >>> # create amp object >>> ex = Philipsamppython.amp() >>> # load data >>> data_path_str = "BreastCancer" >>> py_spark_df = ex.amp_read_csv(data_path=data_path_str, header="true") >>> pyspark_df_with_dummies = ex.amp_dummies(py_spark_df, column = "Species" )
-
amp_handle_na(dataframe, percent=70)¶ Replace NA by using mean imputation.
Parameters: - dataframe (pyspark.sql.DataFrame) – Specifies spark dataframe.
- percent (int) – Specifies percentage of NAs in an integer column of the dataframe, above which the column will be dropped and below which NAs will be centrally imputed. The percentage must be between 0 and 100.
Returns: returns spark dataframe devoid of NAs.
Example
>>> import Philipsamppython >>> import os >>> # create amp object >>> ex = Philipsamppython.amp() >>> # load data >>> data_path_str = "airquality" >>> py_spark_df = ex.amp_read_csv(data_path=data_path_str, header="true") >>> py_spark_df_with_out_na = ex.amp_handle_na(py_spark_df,10)
-
amp_interrupt_handler(worker_function, handler_function)¶ API to execute functions which could handle interruptions and exit gracefully. Examples include closing database connection etc. This API is introduced to use along with AMPExecutePlugin to terminate the long running scripts after timerValue
Parameters: - worker_function (function) – Primary function which will be executed initially, before it encounters interrupt signal.
- handler_function (function) – Interrupt handler, which will be called when an interrupt signal is send.
Returns: returns none.
Example
>>> import Philipsamppython >>> import time >>> import sys >>> import pandas as pd >>> ex = Philipsamppython.amp() >>> def actual_function(params): >>> print("You're awesome") >>> print(params) >>> time.sleep(80) >>> print("Completed main function") >>> Val1 = "TRUE" >>> Message = "Success in execution of main program" >>> Message1 = "It was long wait of 80 sec" >>> df=pd.DataFrame([Val1, Message, Message1]) >>> ex.amp_set_result(df) >>> def interrupt_function(signum,frame): >>> print("Got SIGINT interruption") >>> print("Now sleeping for 80 secs to check whether script quits with default sleep time in >>> this function!!") >>> #time.sleep(80) >>> print("Slept for 80 secs did not quit") >>> Val1 = "FALSE" >>> Message = "Missed flight too much of waiting :(" >>> Message1 = "Bailing out" >>> df1 = pd.DataFrame([Val1, Message, Message1]) >>> ex.amp_set_result(df1) >>> sys.exit(0) >>> def perform(fun, *args): >>> fun( *args ) >>> from functools import partial >>> ex.amp_interrupt_handler(partial(actual_function,"abcd"),interrupt_function)
-
amp_normalize(dataframe, method='zscore')¶ Normalize column wise
Parameters: - dataframe (pyspark.sql.DataFrame) – Specifies spark dataframe or pandas dataframe.
- method (string) – Specifies method of normalization to be applied. Takes either “”zscore” or “range”.
Returns: returns normalized dataframe.
Example
>>> import Philipsamppython >>> import os >>> # create amp object >>> ex = Philipsamppython.amp() >>> # load data >>> data_path_str = "BreastCancer" >>> py_spark_df = ex.amp_read_csv(data_path=data_path_str, header="true") >>> ex.amp_normalize(df = py_spark_df, method = "zscore"")
-
amp_read_csv(file, header='true')¶ Reads CSV files using Spark
Parameters: - file (string) – File to load as a character string.
- header (logical) – Indicates whether the file contains the names of the variables as its first line. Default, it is set to TRUE. If the input file does not contain header, error will be thrown if the header is not set to FALSE.
Returns: pandas.DataFrame or pyspark.sql.DataFrame.
Example
>>> import Philipsamppython >>> import os >>> # create amp object >>> ex = Philipsamppython.amp() >>> # load data >>> data_path_str = "BreastCancer" >>> py_spark_df = ex.amp_read_csv(file=data_path_str, header="true")
-
amp_read_jdbc_data(url, query, driver, username, password, output_dataframe_type='pandas', implementationType='python')¶ Query the database using JDBC and returns pandas or pyspark dataframe.
Parameters: - url (string) – Specifies jdbc url to connect to.
- query (string) – SQL statement for querying.
- driver (string) – Class name of the jdbc driver to use to connect to this url.
- username (string) – Specifies username if any.
- password (string) – Specifies password if username provided.
- output_dataframe_type (string) – Field specifying the output data frame type, pyspark or pandas.
Returns: returns pyspark or pandas dataframe.
Example
>>> import Philipsamppython >>> # create amp object >>> ex = Philipsamppython.amp() >>> # output datatype: spark dataframe >>> pyspark_df = ex.amp_read_jdbc_data('jdbc:postgresql://host:<port>/dbname','select * from dsp_db_schema.dsp_tbl_1','org.postgresql.Driver','usrname','psswd')
-
amp_read_jdbcprofile_data(query, profile, output_dataframe_type='pandas')¶ Query result is fetched and creates spark dataframe
Parameters: Returns: Spark or pandas dataframe based on the input passed
Example
>>> import Philipsamppython >>> ex=Philipsamppython.amp() >>> ex.amp_read_jdbcprofile_data('SELECT TOP 3 * FROM alert','sqlserver')
-
amp_read_s3_data(file, profile)¶ Reads CSV from s3 and create spark dataframe
Parameters: Note
In order to use this functionality user needs to call ex.sparkContext.stop() to close any existing sparkcontext that might have been created with a different credential
Returns: Spark DataframeExample
>>> import Philipsamppython >>> ex=Philipsamppython.amp() >>> ex.amp_read_s3_data('data/BreastCancer/','S3west')
-
amp_set_result(dataframe)¶ Caches intermediate data sets generated.
Parameters: dataframe (pandas.DataFrame) – specifies dataframe to be cached. Returns: returns none. Example
>>> import Philipsamppython >>> import os >>> # create amp object >>> ex = Philipsamppython.amp()
-
amp_write_csv(dataframe, file, import_column_names='true')¶ Save the contents of the dataframe to a csv file.
Parameters: - dataframe (pyspark.sql.DataFrame) – Specifies spark dataframe.
- file (string) – File to which the dataframe has to be saved.
- import_column_names (logical) – if TRUE it saves the column names of data frame to csv file, it will not if set to FALSE.
Returns: returns none.
Example
>>> import Philipsamppython >>> import os >>> # create amp object >>> ex = Philipsamppython.amp() >>> # load data >>> data_path_str = "BreastCancer" >>> py_spark_df = ex.amp_read_csv(data_path=data_path_str, header="true") >>> ex.amp_write_csv(py_spark_df,file="breast_cancer_dataset",import_column_names="true")
-