Skip to main content

Featured

Spark Transformations, Actions and Lazy Evaluation.

Apache Spark RDD supports two types of Operations: Transformations Actions A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed.

Pyspark functions - to_date() vs date_format()


PySpark to_date() – Convert String to Date Format

to_date() function is used to format string (StringType) to date (DateType) column.

Syntax: to_date(column,format)
Example: to_date(col("string_column"),"MM/dd/yyyy") 

Lets create a sample dataframe to work on
from pyspark.sql.functions import to_date,date_format,expr,col

columns = ["Date1","Date2"]
data = [("20220131", "2022-01-31"), ("20220228", "2022-02-28"),
        ("20220331", "2022-03-31"), ("20220430", "2022-04-30") ,
        ("20220531", ""), ("", "2022-06-30")]
df = spark.createDataFrame(data).toDF(*columns)
df.show()



to_date() function takes the first argument as a date string and the second argument as format. 
If the format is not provide, to_date() takes the default value as 'yyyy-MM-dd'.

If the input column values does not match with the format specified (second argument) then to_date() populates the new column with null.

df2 = df.withColumn("TO_DATE_1",to_date(col("Date1"))) \
        .withColumn("TO_DATE_2",to_date(col("Date2")))
df2.show()


to_date() accepts the first argument in any date format

df2=df.withColumn('TO_DATE_1',to_date(col('Date1'),'yyyyMMdd')) \
    .withColumn('TO_DATE_2',to_date(col('Date2'),'yyyy-MM-dd'))
df2.show()





to_date() will only convert the input column to date and will not change the date format



PySpark date_format() – Convert Date to String format

date_format() function is used to format date (DateType / StringType) to (StringType) in the specified format

Syntax: date_format(column,format)
Example: date_format(current_timestamp(),"yyyy MM dd").alias("date_format")


It takes the first argument as a date/string and the second argument as format.
If the format is not provide, date_format() throws a TypeError.

df3=df.withColumn("DATE_FORMAT_1",date_format(col("Date1"))) \
    .withColumn("DATE_FORMAT_2",date_format(col("Date2")))
df3.show()

date_format() requires the first argument to be in 'yyyy-MM-dd' format, else it populates the new column with null.

df3=df.withColumn("DATE_FORMAT_1",date_format(col("Date1"),'yyyyMMdd')) \
    .withColumn("DATE_FORMAT_2",date_format(col("Date2"),'yyyy-MM-dd'))
df3.show()

If the input column values does not match with the format specified (second argument) then date_format() converts it in the specified format

df3 = df.select('Date2') \
    .withColumn("DATE_FORMAT_yyyy-MM-dd",date_format(col("Date2"),'yyyy-MM-dd')) \
    .withColumn("DATE_FORMAT_yyyyMMdd",date_format(col("Date2"),'yyyyMMdd'))\
    .withColumn("DATE_FORMAT_MM/yyyy/dd",date_format(col("Date2"),'MM/yyyy/dd'))\
    .withColumn("DATE_FORMAT_MM/yyyy",date_format(col("Date2"),'MM/yyyy'))\
    .withColumn("DATE_FORMAT_2yyyyMM",date_format(col("Date2"),'yyyyMM'))
df3.show()




Conclusion

To date
  1. to_date() function is used to format string (StringType) to date (DateType) column.
  2. If the format is not provide, to_date() takes the default value as 'yyyy-MM-dd'
  3. to_date() accepts the first argument in any date format
  4. If the input column values does not match with the format specified (second argument) then to_date() populates the new column with null.
Date format
  1. date_format() function is used to format date (DateType / StringType) to (StringType) in the specified format
  2. If the format is not provide, date_format() throws a TypeError.
  3. date_format() requires the first argument to be in 'yyyy-MM-dd' format, else it populates the new column with null.
  4. If the input column values does not match with the format specified (second argument) then date_format() converts it in the specified format

Comments