Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Phi3poc #2301

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

[WIP] Phi3poc #2301

wants to merge 26 commits into from

Conversation

JessicaXYWang
Copy link
Contributor

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Briefly describe the changes included in this Pull Request.

How is this patch tested?

  • I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

  • No. You can skip this section.
  • Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • No. You can skip this section.
  • Yes. Make sure you have added samples following below steps.
  1. Find the corresponding markdown file for your new feature in website/docs/documentation folder.
    Make sure you choose the correct class estimators/transformers and namespace.
  2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
  3. Make sure the DocTable points to correct API link.
  4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
  5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
  6. Make sure the WebsiteSamplesTests job pass in the pipeline.

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Oct 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.53%. Comparing base (bab6aed) to head (7a3e315).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2301      +/-   ##
==========================================
- Coverage   84.55%   84.53%   -0.03%     
==========================================
  Files         328      328              
  Lines       16848    16848              
  Branches     1513     1513              
==========================================
- Hits        14246    14242       -4     
- Misses       2602     2606       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

self.config.update(kwargs)


def camel_to_snake(text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there might already be one in library to use

"output column",
typeConverter=TypeConverters.toString,
)
modelParam = Param(Params._dummy(), "modelParam", "Model Parameters")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain difference between model params and other params (you can just link to other docs if easier)

typeConverter=TypeConverters.toString,
)
modelParam = Param(Params._dummy(), "modelParam", "Model Parameters")
modelConfig = Param(Params._dummy(), "modelConfig", "Model configuration")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain difference between model config and other params (you can just link to other docs if easier)

useFabricLakehouse = Param(
Params._dummy(),
"useFabricLakehouse",
"Use FabricLakehouse",
Copy link
Collaborator

@mhamilton723 mhamilton723 Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for a local cache then you might be able to make the verbage generic like useLocalCache

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -0,0 +1 @@
{"cells":[{"cell_type":"markdown","source":["# Apply Phi3 model with HuggingFace Causal ML"],"metadata":{"nteract":{"transient":{"deleting":false}}},"id":"7a355394-5b22-4c09-8d4f-9467a2fcfce4"},{"cell_type":"markdown","source":["![HuggingFace Logo](https://huggingface.co/front/assets/huggingface_logo-noborder.svg)\n","\n","**HuggingFace** is a popular open-source platform that develops computation tools for building application using machine learning. It is widely known for its Transformers library which contains open-source implementation of transformer models for text, image, and audio task.\n","\n","[**Phi 3**](https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/) is a family of AI models developed by Microsoft, designed to redefine what is possible with small language models (SLMs). Phi-3 models are the most compatable and cost-effective SLMs, [outperforming models of the same size and even larger ones in language](https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/?msockid=26355e446adb6dfa06484f956b686c27), reasoning, coding, and math benchmarks. \n","\n","<img src=\"https://pub-66c8c8c5ae474e9a9161c92b21de2f08.r2.dev/2024/04/The-Phi-3-small-language-models-with-big-potential-1.jpg\" alt=\"Phi 3 model performance\" width=\"600\">\n","\n","To make it easier to scale up causal language model prediction on a large dataset, we have integrated [HuggingFace Causal LM](https://huggingface.co/docs/transformers/tasks/language_modeling) with SynapseML. This integration makes it easy to use the Apache Spark distributed computing framework to process large data on text generation tasks.\n","\n","This tutorial shows hot to apply [phi3 model](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) at scale with no extra setting.\n"],"metadata":{"nteract":{"transient":{"deleting":false}},"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"aa35ae52-6a9e-458d-91ee-ae3962ab5b68"},{"cell_type":"code","source":["chats = [\n"," (1, \"fix grammar: helol mi friend\"),\n"," (2, \"What is SynapseML\"),\n"," (3, \"translate to Spanish: hello\"),\n","]\n","\n","chat_df = spark.createDataFrame(chats, [\"row_index\", \"content\"])\n","chat_df.show()"],"outputs":[{"output_type":"display_data","data":{"application/vnd.livy.statement-meta+json":{"spark_pool":null,"statement_id":9,"statement_ids":[9],"state":"finished","livy_statement_state":"available","session_id":"0c9f61cd-1288-4e0e-9c81-e054702855b3","normalized_state":"finished","queued_time":"2025-01-16T17:14:07.1864063Z","session_start_time":null,"execution_start_time":"2025-01-16T17:18:56.122231Z","execution_finish_time":"2025-01-16T17:19:03.4236677Z","parent_msg_id":"11078688-e7a5-4a37-8e95-6485d95aa809"},"text/plain":"StatementMeta(, 0c9f61cd-1288-4e0e-9c81-e054702855b3, 9, Finished, Available, Finished)"},"metadata":{}},{"output_type":"stream","name":"stdout","text":["+---------+--------------------+\n|row_index| content|\n+---------+--------------------+\n| 1|fix grammar: helo...|\n| 2| What is SynapseML|\n| 3|translate to Span...|\n+---------+--------------------+\n\n"]}],"execution_count":3,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"7e76b540-466f-4ab3-9aa9-da8de5517fc1"},{"cell_type":"markdown","source":["## Define and Apply Phi3 model"],"metadata":{"nteract":{"transient":{"deleting":false}},"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ac0687e7-6609-4af4-a1a4-c098cb404374"},{"cell_type":"code","source":["from synapse.ml.llm.HuggingFaceCausallmTransform import HuggingFaceCausalLM\n","\n","phi3_transformer = (\n"," HuggingFaceCausalLM()\n"," .setModelName(\"microsoft/Phi-3-mini-4k-instruct\")\n"," .setInputCol(\"content\")\n"," .setOutputCol(\"result\")\n"," .setModelParam(max_new_tokens=1000)\n"," .setModelConfig(local_files_only=False, trust_remote_code=True)\n",")\n","result_df = phi3_transformer.transform(chat_df).collect()\n","display(result_df)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"},"collapsed":false,"jupyter":{"outputs_hidden":true},"editable":true,"run_control":{"frozen":false}},"id":"f8db55d9-b89d-420f-80e9-618041def698"},{"cell_type":"markdown","source":["## Use local cache\n","\n","By caching the model, you can reduce initialization time. On Fabric, store the model in a Lakehouse and use setCachePath to load it."],"metadata":{"nteract":{"transient":{"deleting":false}},"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"4c839ac6-f92e-4615-a0c3-977a96231cc6"},{"cell_type":"code","source":["# %%sh\n","# azcopy copy \"https://mmlspark.blob.core.windows.net/huggingface/microsoft/Phi-3-mini-4k-instruct\" \"/lakehouse/default/Files/microsoft/\" --recursive=true"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"9bc5edf1-35cb-45d6-b1dc-49a22a01484b"},{"cell_type":"code","source":["# phi3_transformer = (\n","# HuggingFaceCausalLM()\n","# .setCachePath(\"/lakehouse/default/Files/microsoft/Phi-3-mini-4k-instruct\")\n","# .setInputCol(\"content\")\n","# .setOutputCol(\"result\")\n","# .setModelParam(max_new_tokens=1000)\n","# )\n","# result_df = phi3_transformer.transform(chat_df).collect()\n","# display(result_df)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ee52c891-3be2-48fe-87b3-648e299a794e"}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"name":"synapse_pyspark","language":"Python","display_name":"Synapse PySpark"},"language_info":{"name":"python"},"microsoft":{"language":"python","language_group":"synapse_pyspark","ms_spell_check":{"ms_spell_check_language":"en"}},"nteract":{"version":"nteract-front-end@1.0.0"},"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{"spark.synapse.nbs.session.timeout":"1200000"}}},"synapse_widget":{"version":"0.1","state":{}},"dependencies":{"lakehouse":{"default_lakehouse":"cf3f397e-6a87-43ab-b8e0-bb9342e11c7a","default_lakehouse_name":"jessiwang_phi3","default_lakehouse_workspace_id":"4751a5bb-6a44-4164-8b31-c3b6a4cf1f8d"},"environment":{}}},"nbformat":4,"nbformat_minor":5}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before checkin this style needs to be fixed with black . in the top level dir

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JessicaXYWang
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants