-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Phi3poc #2301
base: master
Are you sure you want to change the base?
[WIP] Phi3poc #2301
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2301 +/- ##
==========================================
- Coverage 84.55% 84.53% -0.03%
==========================================
Files 328 328
Lines 16848 16848
Branches 1513 1513
==========================================
- Hits 14246 14242 -4
- Misses 2602 2606 +4 ☔ View full report in Codecov by Sentry. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
core/src/main/python/synapse/ml/llm/HuggingFaceCausallmTransform.py
Outdated
Show resolved
Hide resolved
core/src/main/python/synapse/ml/llm/HuggingFaceCausallmTransform.py
Outdated
Show resolved
Hide resolved
core/src/main/python/synapse/ml/llm/HuggingFaceCausallmTransform.py
Outdated
Show resolved
Hide resolved
self.config.update(kwargs) | ||
|
||
|
||
def camel_to_snake(text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there might already be one in library to use
"output column", | ||
typeConverter=TypeConverters.toString, | ||
) | ||
modelParam = Param(Params._dummy(), "modelParam", "Model Parameters") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain difference between model params and other params (you can just link to other docs if easier)
typeConverter=TypeConverters.toString, | ||
) | ||
modelParam = Param(Params._dummy(), "modelParam", "Model Parameters") | ||
modelConfig = Param(Params._dummy(), "modelConfig", "Model configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain difference between model config and other params (you can just link to other docs if easier)
useFabricLakehouse = Param( | ||
Params._dummy(), | ||
"useFabricLakehouse", | ||
"Use FabricLakehouse", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for a local cache then you might be able to make the verbage generic like useLocalCache
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@@ -0,0 +1 @@ | |||
{"cells":[{"cell_type":"markdown","source":["# Apply Phi3 model with HuggingFace Causal ML"],"metadata":{"nteract":{"transient":{"deleting":false}}},"id":"7a355394-5b22-4c09-8d4f-9467a2fcfce4"},{"cell_type":"markdown","source":["![HuggingFace Logo](https://huggingface.co/front/assets/huggingface_logo-noborder.svg)\n","\n","**HuggingFace** is a popular open-source platform that develops computation tools for building application using machine learning. It is widely known for its Transformers library which contains open-source implementation of transformer models for text, image, and audio task.\n","\n","[**Phi 3**](https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/) is a family of AI models developed by Microsoft, designed to redefine what is possible with small language models (SLMs). Phi-3 models are the most compatable and cost-effective SLMs, [outperforming models of the same size and even larger ones in language](https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/?msockid=26355e446adb6dfa06484f956b686c27), reasoning, coding, and math benchmarks. \n","\n","<img src=\"https://pub-66c8c8c5ae474e9a9161c92b21de2f08.r2.dev/2024/04/The-Phi-3-small-language-models-with-big-potential-1.jpg\" alt=\"Phi 3 model performance\" width=\"600\">\n","\n","To make it easier to scale up causal language model prediction on a large dataset, we have integrated [HuggingFace Causal LM](https://huggingface.co/docs/transformers/tasks/language_modeling) with SynapseML. This integration makes it easy to use the Apache Spark distributed computing framework to process large data on text generation tasks.\n","\n","This tutorial shows hot to apply [phi3 model](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) at scale with no extra setting.\n"],"metadata":{"nteract":{"transient":{"deleting":false}},"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"aa35ae52-6a9e-458d-91ee-ae3962ab5b68"},{"cell_type":"code","source":["chats = [\n"," (1, \"fix grammar: helol mi friend\"),\n"," (2, \"What is SynapseML\"),\n"," (3, \"translate to Spanish: hello\"),\n","]\n","\n","chat_df = spark.createDataFrame(chats, [\"row_index\", \"content\"])\n","chat_df.show()"],"outputs":[{"output_type":"display_data","data":{"application/vnd.livy.statement-meta+json":{"spark_pool":null,"statement_id":9,"statement_ids":[9],"state":"finished","livy_statement_state":"available","session_id":"0c9f61cd-1288-4e0e-9c81-e054702855b3","normalized_state":"finished","queued_time":"2025-01-16T17:14:07.1864063Z","session_start_time":null,"execution_start_time":"2025-01-16T17:18:56.122231Z","execution_finish_time":"2025-01-16T17:19:03.4236677Z","parent_msg_id":"11078688-e7a5-4a37-8e95-6485d95aa809"},"text/plain":"StatementMeta(, 0c9f61cd-1288-4e0e-9c81-e054702855b3, 9, Finished, Available, Finished)"},"metadata":{}},{"output_type":"stream","name":"stdout","text":["+---------+--------------------+\n|row_index| content|\n+---------+--------------------+\n| 1|fix grammar: helo...|\n| 2| What is SynapseML|\n| 3|translate to Span...|\n+---------+--------------------+\n\n"]}],"execution_count":3,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"7e76b540-466f-4ab3-9aa9-da8de5517fc1"},{"cell_type":"markdown","source":["## Define and Apply Phi3 model"],"metadata":{"nteract":{"transient":{"deleting":false}},"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ac0687e7-6609-4af4-a1a4-c098cb404374"},{"cell_type":"code","source":["from synapse.ml.llm.HuggingFaceCausallmTransform import HuggingFaceCausalLM\n","\n","phi3_transformer = (\n"," HuggingFaceCausalLM()\n"," .setModelName(\"microsoft/Phi-3-mini-4k-instruct\")\n"," .setInputCol(\"content\")\n"," .setOutputCol(\"result\")\n"," .setModelParam(max_new_tokens=1000)\n"," .setModelConfig(local_files_only=False, trust_remote_code=True)\n",")\n","result_df = phi3_transformer.transform(chat_df).collect()\n","display(result_df)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"},"collapsed":false,"jupyter":{"outputs_hidden":true},"editable":true,"run_control":{"frozen":false}},"id":"f8db55d9-b89d-420f-80e9-618041def698"},{"cell_type":"markdown","source":["## Use local cache\n","\n","By caching the model, you can reduce initialization time. On Fabric, store the model in a Lakehouse and use setCachePath to load it."],"metadata":{"nteract":{"transient":{"deleting":false}},"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"4c839ac6-f92e-4615-a0c3-977a96231cc6"},{"cell_type":"code","source":["# %%sh\n","# azcopy copy \"https://mmlspark.blob.core.windows.net/huggingface/microsoft/Phi-3-mini-4k-instruct\" \"/lakehouse/default/Files/microsoft/\" --recursive=true"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"9bc5edf1-35cb-45d6-b1dc-49a22a01484b"},{"cell_type":"code","source":["# phi3_transformer = (\n","# HuggingFaceCausalLM()\n","# .setCachePath(\"/lakehouse/default/Files/microsoft/Phi-3-mini-4k-instruct\")\n","# .setInputCol(\"content\")\n","# .setOutputCol(\"result\")\n","# .setModelParam(max_new_tokens=1000)\n","# )\n","# result_df = phi3_transformer.transform(chat_df).collect()\n","# display(result_df)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ee52c891-3be2-48fe-87b3-648e299a794e"}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"name":"synapse_pyspark","language":"Python","display_name":"Synapse PySpark"},"language_info":{"name":"python"},"microsoft":{"language":"python","language_group":"synapse_pyspark","ms_spell_check":{"ms_spell_check_language":"en"}},"nteract":{"version":"nteract-front-end@1.0.0"},"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{"spark.synapse.nbs.session.timeout":"1200000"}}},"synapse_widget":{"version":"0.1","state":{}},"dependencies":{"lakehouse":{"default_lakehouse":"cf3f397e-6a87-43ab-b8e0-bb9342e11c7a","default_lakehouse_name":"jessiwang_phi3","default_lakehouse_workspace_id":"4751a5bb-6a44-4164-8b31-c3b6a4cf1f8d"},"environment":{}}},"nbformat":4,"nbformat_minor":5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before checkin this style needs to be fixed with black . in the top level dir
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Related Issues/PRs
#xxx
What changes are proposed in this pull request?
Briefly describe the changes included in this Pull Request.
How is this patch tested?
Does this PR change any dependencies?
Does this PR add a new feature? If so, have you added samples on website?
website/docs/documentation
folder.Make sure you choose the correct class
estimators/transformers
and namespace.DocTable
points to correct API link.yarn run start
to make sure the website renders correctly.<!--pytest-codeblocks:cont-->
before each python code blocks to enable auto-tests for python samples.WebsiteSamplesTests
job pass in the pipeline.