pyspark
Getting Started
from pyspark.sql import SparkSession
def init_spark():
spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
# spark = SparkSession.builder.master("spark://192.168.2.99:7077").getOrCreate()
sc = spark.sparkContext
return spark,sc
def main():
spark,sc = init_spark()
nums = sc.parallelize([1,2,3,4])
print(nums.map(lambda x: x*x).collect())
if __name__ == '__main__':
main()
Submit
python - Pyspark - Error related to SparkContext - no attribute _jsc - Stack Overflow
TODO
- PySpark Union and UnionAll Explained — SparkByExamples
- udacity/nd027-c3-data-lakes-with-spark
- AlexIoannides/pyspark-example-project: Example project implementing best practices for PySpark ETL jobs and applications.
- https://sparkbyexamples.com/pyspark-tutorial/
- Running PySpark as a Spark standalone job — Anaconda documentation
- First Steps With PySpark and Big Data Processing – Real Python
- Docker/Jupyter PySpark - charlesreid1
- aljavier/spark-docker-setup: Small setup of development environment for Apache Spark with docker
- python - How divide or multiply every non-string columns of a PySpark dataframe with a float constant? - Stack Overflow
- apache spark - Pyspark dataframe write to single json file with specific name - Stack Overflow
- python - How do I select rows from a DataFrame based on column values? - Stack Overflow
- python - Concatenate two PySpark dataframes - Stack Overflow
Cool Sources
- 7 Bilingual PySpark: blending Python and SQL code - Data Analysis with Python and PySpark MEAP V14
- INSERT INTO - Spark 3.2.1 Documentation
wiki.software.list.json
- Working with JSON in Apache Spark | by Neeraj Bhadani | Expedia Group Technology | Medium
- "column_name" function - Stack Overflow](.md) function - Stack Overflow](dentropydaemon-wiki/Software/List/pyspark.md) function - Stack Overflow]]%20function%20-%20Stack%20Overflow)
- apache spark - Pyspark dataframe write to single json file with specific name - Stack Overflow
Cast as new column
changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))
- python - How to change a dataframe column from String type to Double type in PySpark? - Stack Overflow
- [[dentropydaemon-wiki/Software/List/Spark|[Spark|[Spark|[Spark|Spark) - Stack Overflow
- python - Mapping columns from one dataframe to another to create a new column - Stack Overflow
Lists
JOIN
Notes
- [[dentropydaemon-wiki/Software/List/pyspark|[pyspark|[pyspark|[pyspark|pyspark) - Stack Overflow
- Select Column
python df[["column_name"]]
- Do basic math in middle of query
python df.schema df.printSchema()
- scala - How to check the schema of DataFrame? - Stack Overflow
- Append Dataframe / Add a Row
- scala - How to add a Spark Dataframe to the bottom of another dataframe? - Stack Overflow
- Add Row to Dataframe in Pandas – thisPointer
- Copy Columns possibly in another table ``` python
* [How to copy columns to a new Pandas DataFrame in Python](https://www.adamsmith.haus/python/answers/how-to-copy-columns-to-a-new-pandas-dataframe-in-python)
* Set column name in select query `alias`
python
import pyspark.sql.functions as F
unioned_df.select(F.to_timestamp(F.from_unixtime( F.col("ts") / 1000 )).alias('time_stamp')).show()
* Cast dataframe pyspark
python
* [python - How to change a dataframe column from String type to Double type in PySpark? - Stack Overflow](https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark)
* List data already in spark
python
spark.sql('show databases').show()
spark.sql("show tables in default").show()
spark.catalog.listTables('default')
* [List Tables & Databases in Apache Spark | by Swaroop | Medium](https://medium.com/@durgaswaroop/list-tables-and-databases-in-spark-2d03594d2883)
* Save dataframe to spark sql table
python
unioned_df.write.saveAsTable('default.log_data')
* Access data already in spark
python
spark.sql("SELECT * FROM log_data").toPandas()
```
* Timestamp stuff
* scala - How to convert unix timestamp to date in Spark - Stack Overflow
* Case
* pyspark - Apache spark dealing with case statements - Stack Overflow
Dataframe Examples
- Example Data Frames
- PySpark - Create an Empty DataFrame & RDD — SparkByExamples
- PySpark - Create DataFrame from List - GeeksforGeeks
- Put a list of JSON files or CSV's into a single data frame
- python - Merging multiple data frames row-wise in PySpark - Data Science Stack Exchange
python import glob import os log_files = glob.glob(os.getcwd() + '/log-data/*') list_of_dfs = [] for log_file in log_files: list_of_dfs.append( spark.read.json(log_file) ) import functools def unionAll(dfs): return functools.reduce(lambda df1,df2: df1.union(df2.select(df1.columns)), dfs) unioned_df = unionAll(list_of_dfs)
S3
Documentation
Work
- spark-examples/pyspark-examples: Pyspark RDD, DataFrame and Dataset Examples in Python language
- From sparkbyexamples.com
- mahmoudparsian/pyspark-tutorial: PySpark-Tutorial provides basic algorithms using PySpark
- Shows examples for CLI, no Jobs or Jupyter ntoebooks
Does not work
- [[dentropydaemon-wiki/Software/List/pyspark|[pySpark|[pySpark|[pySpark|pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
- Seems to use Python2
- 6 Years Old
- Includes command to start spark
- Lots of jupyter notebooks