PySpark to_date() – Convert String to Date Format
to_date() function is used to format string (StringType) to date (DateType) column.
Syntax: to_date(column,format)
Example: to_date(col("string_column"),"MM/dd/yyyy")
Lets create a sample dataframe to work on
from pyspark.sql.functions import to_date,date_format,expr,col
columns = ["Date1","Date2"]
data = [("20220131", "2022-01-31"), ("20220228", "2022-02-28"),
("20220331", "2022-03-31"), ("20220430", "2022-04-30") ,
("20220531", ""), ("", "2022-06-30")]
df = spark.createDataFrame(data).toDF(*columns)
df.show()
to_date() function takes the first argument as a date string and the second argument as format.
If the format is not provide, to_date() takes the default value as 'yyyy-MM-dd'.
If the input column values does not match with the format specified (second argument) then to_date() populates the new column with null.
df2 = df.withColumn("TO_DATE_1",to_date(col("Date1"))) \
.withColumn("TO_DATE_2",to_date(col("Date2")))
df2.show()
to_date() accepts the first argument in any date format
df2=df.withColumn('TO_DATE_1',to_date(col('Date1'),'yyyyMMdd')) \
.withColumn('TO_DATE_2',to_date(col('Date2'),'yyyy-MM-dd'))
df2.show()
to_date() will only convert the input column to date and will not change the date format
PySpark date_format() – Convert Date to String format
date_format() function is used to format date (DateType / StringType) to (StringType) in the specified format
Syntax: date_format(column,format)
Example: date_format(current_timestamp(),"yyyy MM dd").alias("date_format")
It takes the first argument as a date/string and the second argument as format.
If the format is not provide, date_format() throws a TypeError.
df3=df.withColumn("DATE_FORMAT_1",date_format(col("Date1"))) \
.withColumn("DATE_FORMAT_2",date_format(col("Date2")))
df3.show()
date_format() requires the first argument to be in 'yyyy-MM-dd' format, else it populates the new column with null.
df3=df.withColumn("DATE_FORMAT_1",date_format(col("Date1"),'yyyyMMdd')) \
.withColumn("DATE_FORMAT_2",date_format(col("Date2"),'yyyy-MM-dd'))
df3.show()
If the input column values does not match with the format specified (second argument) then date_format() converts it in the specified format
df3 = df.select('Date2') \
.withColumn("DATE_FORMAT_yyyy-MM-dd",date_format(col("Date2"),'yyyy-MM-dd')) \
.withColumn("DATE_FORMAT_yyyyMMdd",date_format(col("Date2"),'yyyyMMdd'))\
.withColumn("DATE_FORMAT_MM/yyyy/dd",date_format(col("Date2"),'MM/yyyy/dd'))\
.withColumn("DATE_FORMAT_MM/yyyy",date_format(col("Date2"),'MM/yyyy'))\
.withColumn("DATE_FORMAT_2yyyyMM",date_format(col("Date2"),'yyyyMM'))
df3.show()
Conclusion
To date
- to_date() function is used to format string (StringType) to date (DateType) column.
- If the format is not provide, to_date() takes the default value as 'yyyy-MM-dd'
- to_date() accepts the first argument in any date format
- If the input column values does not match with the format specified (second argument) then to_date() populates the new column with null.
Date format
- date_format() function is used to format date (DateType / StringType) to (StringType) in the specified format
- If the format is not provide, date_format() throws a TypeError.
- date_format() requires the first argument to be in 'yyyy-MM-dd' format, else it populates the new column with null.
- If the input column values does not match with the format specified (second argument) then date_format() converts it in the specified format
Comments
Post a Comment